public class Modelling
extends java.lang.Object
Functions related to training and evaluating ML models. The functions are grouped into a few groups.
For the purpose of this system, categorical data means a column of data that is not numeric. it could be strings, keywords, or arbitrary objects.
Minimal example extra dependencies for PCA:
[uncomplicate/neanderthal "0.43.3"]
It is also important to note that you can serialize the fit results to nippy automatically as included in dtype-next are extensions to nippy that work with tensors.
Modifier and Type | Method and Description |
---|---|
static java.util.Map |
correlationTable(java.lang.Object ds)
Return a map of column to inversely sorted from greatest to least sequence of tuples of column name, pearson correlation coefficient.
|
static java.util.Map |
correlationTable(java.lang.Object ds,
java.lang.Object options)
Return a map of column to inversely sorted from greatest to least sequence of tuples of column name, coefficient.
|
static java.util.Map |
fillRangeReplace(java.lang.Object ds,
java.lang.Object cname,
double maxSpan,
java.lang.Object missingStrategy)
Expand a dataset ensuring that the difference between two successive values is less than
max-span . |
static java.util.Map |
fitCategorical(java.lang.Object ds,
java.lang.Object cname)
Fit an object->integer transformation.
|
static java.util.Map |
fitCategorical(java.lang.Object ds,
java.lang.Object cname,
java.lang.Object options)
Fit an object->integer transform that takes each value and assigned an integer to it.
|
static java.lang.Object |
fitMinMax(java.lang.Object ds)
Fit a minmax transformation that will transform each column to a minimum of -0.5 and a maximum of 0.5.
|
static java.lang.Object |
fitMinMax(java.lang.Object ds,
java.lang.Object options)
Fit a bias and scale the dataset that transforms each colum to a target min-max value.
|
static java.util.Map |
fitOneHot(java.lang.Object ds,
java.lang.Object cname)
Fit a mapping from a categorical column to a group of one-hot encoded columns.
|
static java.util.Map |
fitOneHot(java.lang.Object ds,
java.lang.Object cname,
java.lang.Object options)
Fit a transformation from a single column of categorical values to a
one-hot encoded group of columns. |
static java.lang.Object |
fitPCA(java.lang.Object ds)
Fit a PCA transformation onto a dataset keeping 95% of the variance.
|
static java.lang.Object |
fitPCA(java.lang.Object ds,
java.lang.Object options)
Fit a PCA transformation on a dataset.
|
static java.lang.Object |
fitStdScale(java.lang.Object ds)
Calculate per-column mean, stddev.
|
static java.util.Map |
inferenceTargetLabelMap(java.lang.Object ds)
Return a map of val->idx for the inference target.
|
static java.util.Map |
interpolateLOESS(java.lang.Object ds,
java.lang.Object xColname,
java.lang.Object yColname)
Perform a LOESS interpolation using the default parameters.
|
static java.util.Map |
interpolateLOESS(java.lang.Object ds,
java.lang.Object xColname,
java.lang.Object yColname,
java.lang.Object options)
Map a LOESS-interpolation transformation onto a dataset.
|
static java.util.Map |
invertCategorical(java.lang.Object ds,
java.lang.Object catFitData)
Reverse a previously transformed categorical mapping.
|
static java.util.Map |
invertOneHot(java.lang.Object ds,
java.lang.Object fitData)
Reverse a previously transformed one-hot mapping.
|
static java.lang.Iterable |
kFold(java.lang.Object ds,
long k)
Return k maps of the form
{:test-ds :train-ds} . |
static java.lang.Iterable |
kFold(java.lang.Object ds,
long k,
java.lang.Object options)
Produce 2*k datasets from 1 dataset using k-fold algorithm.
|
static java.lang.Object |
labels(java.lang.Object ds)
Find the inference column.
|
static java.lang.Object |
probabilityDistributionToLabels(java.lang.Object ds)
Given a dataset where the column names are labels and the each row is a probabilitly distribution across the labels, produce a Buffer of labels taking the highest probability for each row to choose the label.
|
static java.util.Map |
setInferenceTarget(java.lang.Object ds,
java.lang.Object cname)
Set a column in the dataset as the inference target.
|
static java.util.Map |
trainTestSplit(java.lang.Object ds)
Randomize then split dataset using 70% of the data for training and the rest for testing.
|
static java.util.Map |
trainTestSplit(java.lang.Object ds,
java.lang.Object options)
Split the dataset returning a map of
{:train-ds :test-ds} . |
static java.util.Map |
transformCategorical(java.lang.Object ds,
java.lang.Object catFitData)
Apply an object->integer transformation with data obtained from fitCategorical.
|
static java.util.Map |
transformMinMax(java.lang.Object ds,
java.lang.Object fitData)
Transform a dataset using a previously fit minimax transformation.
|
static java.util.Map |
transformOneHot(java.lang.Object ds,
java.lang.Object fitData)
Transform a dataset using a fitted one-hot mapping.
|
static java.util.Map |
transformPCA(java.lang.Object ds,
java.lang.Object fitData)
Transform a dataset by the PCA fit data.
|
static java.util.Map |
transformStdScale(java.lang.Object ds,
java.lang.Object fitData)
Transform dataset to mean of zero and a standard deviation of 1.
|
public static java.util.Map fitCategorical(java.lang.Object ds, java.lang.Object cname, java.lang.Object options)
Fit an object->integer transform that takes each value and assigned an integer to it. The returned value can be used in transformCategorical to transform the dataset.
Options:
:table-args
- Either a sequence of vectors [col-val, idx] or a sorted sequence of column values where integers will be assigned as per the sorted sequence. Any values found outside the the specified values will be auto-mapped to the next largest integer.:res-dtype
- Datatype of result column. Defaults to :float64
.public static java.util.Map fitCategorical(java.lang.Object ds, java.lang.Object cname)
Fit an object->integer transformation. Integers will be assigned in random order. For more control over the transform see the 3-arity version of the function.
public static java.util.Map transformCategorical(java.lang.Object ds, java.lang.Object catFitData)
Apply an object->integer transformation with data obtained from fitCategorical.
public static java.util.Map invertCategorical(java.lang.Object ds, java.lang.Object catFitData)
Reverse a previously transformed categorical mapping.
public static java.util.Map fitOneHot(java.lang.Object ds, java.lang.Object cname, java.lang.Object options)
Fit a transformation from a single column of categorical values to a one-hot
encoded group of columns. .
Options:
:table-args
- Either a sequence of vectors [col-val, idx] or a sorted sequence of column values where integers will be assigned as per the sorted sequence. Any values found outside the the specified values will be auto-mapped to the next largest integer.:res-dtype
- Datatype of result column. Defaults to :float64
.public static java.util.Map fitOneHot(java.lang.Object ds, java.lang.Object cname)
Fit a mapping from a categorical column to a group of one-hot encoded columns.
public static java.util.Map transformOneHot(java.lang.Object ds, java.lang.Object fitData)
Transform a dataset using a fitted one-hot mapping.
public static java.util.Map invertOneHot(java.lang.Object ds, java.lang.Object fitData)
Reverse a previously transformed one-hot mapping.
public static java.util.Map correlationTable(java.lang.Object ds, java.lang.Object options)
Return a map of column to inversely sorted from greatest to least sequence of tuples of column name, coefficient.
Options:
:correlation-type
One of :pearson
, :spearman
, or :kendall
. Defaults to :pearson
.public static java.util.Map correlationTable(java.lang.Object ds)
Return a map of column to inversely sorted from greatest to least sequence of tuples of column name, pearson correlation coefficient.
public static java.util.Map fillRangeReplace(java.lang.Object ds, java.lang.Object cname, double maxSpan, java.lang.Object missingStrategy)
Expand a dataset ensuring that the difference between two successive values is less than max-span
.
maxSpan
- The minimal span value. For datetime types this is interpreted in millisecond or epoch-millisecond space.missingStrategy
- Same missing strategy types from TMD.replaceMissing
.public static java.lang.Object fitPCA(java.lang.Object ds, java.lang.Object options)
Fit a PCA transformation on a dataset.
{:means, :eigenvalues, :eigenvectors}
.
Options:
:method
- either :svd
or :cov
. Use either SVD transformation or covariance-matrix base PCA. :cov
method is somewhat slower but returns accurate variances and thus is the default.:variance-amount
- Keep columns until variance is just less than variance-amount. Defaults to 0.95.:n-components
- Return a fixed number of components. Overrides :variance-amount
an returns a fixed number of components.:covariance-bias
- When using :cov
divide by n-rows
if true and n-rows - 1
if false. Defaults to false.public static java.lang.Object fitPCA(java.lang.Object ds)
Fit a PCA transformation onto a dataset keeping 95% of the variance. See documentation for 2-arity form.
public static java.util.Map transformPCA(java.lang.Object ds, java.lang.Object fitData)
Transform a dataset by the PCA fit data.
public static java.lang.Object fitStdScale(java.lang.Object ds)
Calculate per-column mean, stddev.
Options:
:mean?
- Produce per-column means. Defaults to true.:stddev?
- Produce per-column standard deviation. Defaults to true.public static java.util.Map transformStdScale(java.lang.Object ds, java.lang.Object fitData)
Transform dataset to mean of zero and a standard deviation of 1.
public static java.lang.Object fitMinMax(java.lang.Object ds, java.lang.Object options)
Fit a bias and scale the dataset that transforms each colum to a target min-max value.
Options:
:min
- Target minimum value. Defaults it -0.5.:max
- Target maximum value. Defaults to 0.5.public static java.lang.Object fitMinMax(java.lang.Object ds)
Fit a minmax transformation that will transform each column to a minimum of -0.5 and a maximum of 0.5.
public static java.util.Map transformMinMax(java.lang.Object ds, java.lang.Object fitData)
Transform a dataset using a previously fit minimax transformation.
public static java.util.Map interpolateLOESS(java.lang.Object ds, java.lang.Object xColname, java.lang.Object yColname, java.lang.Object options)
Map a LOESS-interpolation transformation onto a dataset. This can be used to, among other things, smooth out a column before graphing. For the meaning of the options, see documentation on the org.apache.commons.math3.analysis.interpolationLoessInterpolator.
Option defaults have been chosen to map somewhat closely to the R defaults.
Options:
:bandwidth
- Defaults to 0.75.:iterations
- Defaults to 4.:accuracy
- Defaults to LoessInterpolator/DEFAULT_ACCURACY.:result-name
- Result column name. Defaults to yColname.toString + "-loess"
.public static java.util.Map interpolateLOESS(java.lang.Object ds, java.lang.Object xColname, java.lang.Object yColname)
Perform a LOESS interpolation using the default parameters. For options see 4-arity form of function.
public static java.lang.Iterable kFold(java.lang.Object ds, long k, java.lang.Object options)
Produce 2*k datasets from 1 dataset using k-fold algorithm. Returns a k maps of the form `{:test-ds :train-ds}.
Options:
:randomize-dataset?
- When true, shuffle dataset. Defaults to true.:seed
- When randomizing dataset, seed may be either an integer or an implementation of java.util.Random
.public static java.lang.Iterable kFold(java.lang.Object ds, long k)
Return k maps of the form {:test-ds :train-ds}
. For options see 3-arity form.
public static java.util.Map trainTestSplit(java.lang.Object ds, java.lang.Object options)
Split the dataset returning a map of {:train-ds :test-ds}
.
Options:
:randomize-dataset?
- Defaults to true.:seed
- When provided must be an integer or an implementation java.util.Random
.:train-fraction
- Fraction of dataset to use as training set. Defaults to 0.7.public static java.util.Map trainTestSplit(java.lang.Object ds)
Randomize then split dataset using 70% of the data for training and the rest for testing.
public static java.util.Map setInferenceTarget(java.lang.Object ds, java.lang.Object cname)
Set a column in the dataset as the inference target. This information is stored in the column metadata. This function is short form for:
Object col = column(ds, cname);
return assoc(ds, cname, varyMeta(col, assocFn, kw("inference-target?"), true));
public static java.lang.Object labels(java.lang.Object ds)
Find the inference column. If column was the result of a categorical mapping, reverse that mapping. Return data in a form that can be efficiently converted to a Buffer.
public static java.lang.Object probabilityDistributionToLabels(java.lang.Object ds)
Given a dataset where the column names are labels and the each row is a probabilitly distribution across the labels, produce a Buffer of labels taking the highest probability for each row to choose the label.
public static java.util.Map inferenceTargetLabelMap(java.lang.Object ds)
Return a map of val->idx for the inference target.