tech.v3.dataset.neanderthal

Conversion of a dataset to/from a neanderthal dense matrix as well as various dataset transformations such as pca, covariance and correlation matrixes.

Please include these additional dependencies in your project:

  [uncomplicate/neanderthal "0.45.0"]

dataset->dense

(dataset->dense dataset neanderthal-layout datatype)(dataset->dense dataset neanderthal-layout)(dataset->dense dataset)

Convert a dataset into a dense neanderthal CPU matrix. If the matrix is column-major, then potentially you can get accelerated copies from the dataset into neanderthal.

  • neanderthal-layout - either :column for a column-major matrix or :row for a row-major matrix.
  • datatype - either :float64 or :float32

dense->dataset

(dense->dataset matrix)

Given a neanderthal matrix, convert its columns into the columns of a tech.v3.dataset. This does the conversion in-place. If you would like to copy the neanderthal matrix into JVM arrays, then after method use dtype/clone.

fit-pca

(fit-pca dataset {:keys [n-components variance-amount], :or {variance-amount 0.95}, :as options})(fit-pca dataset)

Run PCA on the dataset. Dataset must not have missing values or non-numeric string columns.

Keep in mind that PCA may be highly influenced by outliers in the dataset and a probabilistic or some level of auto-encoder dimensionality reduction more effective for your problem.

Returns pca-info: {:means - vec of means :eigenvalues - vec of eigenvalues :eigenvectors - matrix of eigenvectors }

Use transform-pca with a dataset and the the returned value to perform PCA on a dataset.

Options:

  • method - svd, cov - Either use SVD or covariance based method. SVD is faster but covariance method means the post-projection variances are accurate. Defaults to cov. Both methods produce similar projection matrixes.
  • variance-amount - fractional amount of variance to keep. Defaults to 0.95.
  • n-components - If provided overrides variance amount and sets the number of components to keep. This controls the number of result columns directly as an integer.
  • covariance-bias? - When using :cov, divide by n-rows if true and (dec n-rows) if false. defaults to false.

fit-pca!

(fit-pca! tensor {:keys [method covariance-bias?], :or {method :cov, covariance-bias? false}, :as _options})(fit-pca! tensor)

Run Principle Component Analysis on a tensor.

Keep in mind that PCA may be highly influenced by outliers in the dataset and a probabilistic or some level of auto-encoder dimensionality reduction more effective for your problem.

Returns a map of:

  • :means - vec of means
  • :eigenvalues - vec of eigenvalues. These are the variance of columns of the post-projected tensor if :cov is used. They are in the ballpark if :svd is used.
  • :eigenvectors - matrix of eigenvectors

Options:

  • method - svd, cov - Either use SVD or covariance based method. SVD is faster but covariance method means the post-projection variances are accurate. Both methods produce an identical or extremely similar projection matrix. Defaults to :cov.
  • covariance-bias? - When using :cov, divide by n-rows if true and (dec n-rows) if false. defaults to false.

neanderthal-enabled?

(neanderthal-enabled?)

transform-pca

(transform-pca dataset {:keys [n-components result-datatype], :as pca-transform})

PCA transform the dataset returning a new dataset. The method used to generate the pca information is indicated in the metadata of the dataset.

transform-pca!

(transform-pca! tensor pca-info n-components)

PCA transform the dataset returning a new tensor. Mean-centers the tensor in-place.