tech.v3.dataset.modelling

Methods related specifically to machine learning such as setting the inference target. This file integrates tightly with tech.v3.dataset.categorical which provides categorical -> number and one-hot transformation pathways.

The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via clojure.core/meta

column-values->categorical

(column-values->categorical dataset src-column)

Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values. In the case of one-hot mappings, src-column must be the original column name before the one-hot map

dataset->categorical-xforms

(dataset->categorical-xforms ds)

Given a dataset, return a map of column-name->xform information.

feature-ecount

(feature-ecount dataset)

Number of feature columns. Feature columns are columns that are not inference targets.

inference-column?

(inference-column? col)

inference-target-column-names

(inference-target-column-names ds)

Return the names of the columns that are inference targets.

inference-target-ds

(inference-target-ds dataset)

Given a dataset return reverse-mapped inference target columns or nil in the case where there are no inference targets.

inference-target-label-inverse-map

(inference-target-label-inverse-map dataset & [label-columns])

Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.

inference-target-label-map

(inference-target-label-map dataset & [label-columns])

k-fold-datasets

(k-fold-datasets dataset k options)(k-fold-datasets dataset k)

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Returns a sequence of {:test-ds :train-ds}

Options:

  • :randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be provided. Defaults to true.
  • :seed - when :randomize-dataset? is true then this can either be an implementation of java.util.Random or an integer seed which will be used to construct java.util.Random.

labels

(labels dataset)

Return the labels. The labels sequence is the reverse mapped inference column. This returns a single column of data or errors out.

model-type

(model-type dataset & [column-name-seq])

Check the label column after dataset processing. Return either :regression :classification

num-inference-classes

(num-inference-classes dataset)

Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.

probability-distributions->label-column

(probability-distributions->label-column prob-ds dst-colname label-column-datatype)(probability-distributions->label-column prob-ds dst-colname)

Given a dataset that has columns in which the column names describe labels and the rows describe a probability distribution, create a label column by taking the max value in each row and assign column that row value.

set-inference-target

(set-inference-target dataset target-name-or-target-name-seq)

Set the inference target on the column. This sets the :column-type member of the column metadata to :inference-target?.

train-test-split

(train-test-split dataset {:keys [train-fraction], :or {train-fraction 0.7}, :as options})(train-test-split dataset)

Probabilistically split the dataset returning a map of {:train-ds :test-ds}.

Options:

  • :randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be provided. Defaults to true.
  • :seed - when :randomize-dataset? is true then this can either be an implementation of java.util.Random or an integer seed which will be used to construct java.util.Random.
  • :train-fraction - Fraction of the dataset to use as training set. Defaults to 0.7.