tech.v3.dataset.neanderthal
Conversion of a dataset to/from a neanderthal dense matrix as well as various dataset transformations such as pca, covariance and correlation matrixes.
Please include these additional dependencies in your project:
[uncomplicate/neanderthal "0.45.0"]
dataset->dense
(dataset->dense dataset neanderthal-layout datatype)
(dataset->dense dataset neanderthal-layout)
(dataset->dense dataset)
Convert a dataset into a dense neanderthal CPU matrix. If the matrix is column-major, then potentially you can get accelerated copies from the dataset into neanderthal.
- neanderthal-layout - either :column for a column-major matrix or :row for a row-major matrix.
- datatype - either :float64 or :float32
dense->dataset
(dense->dataset matrix)
Given a neanderthal matrix, convert its columns into the columns of a tech.v3.dataset. This does the conversion in-place. If you would like to copy the neanderthal matrix into JVM arrays, then after method use dtype/clone.
fit-pca
(fit-pca dataset {:keys [n-components variance-amount], :or {variance-amount 0.95}, :as options})
(fit-pca dataset)
Run PCA on the dataset. Dataset must not have missing values or non-numeric string columns.
Keep in mind that PCA may be highly influenced by outliers in the dataset and a probabilistic or some level of auto-encoder dimensionality reduction more effective for your problem.
Returns pca-info: {:means - vec of means :eigenvalues - vec of eigenvalues :eigenvectors - matrix of eigenvectors }
Use transform-pca with a dataset and the the returned value to perform PCA on a dataset.
Options:
- method - svd, cov - Either use SVD or covariance based method. SVD is faster but covariance method means the post-projection variances are accurate. Defaults to cov. Both methods produce similar projection matrixes.
- variance-amount - fractional amount of variance to keep. Defaults to 0.95.
- n-components - If provided overrides variance amount and sets the number of components to keep. This controls the number of result columns directly as an integer.
- covariance-bias? - When using :cov, divide by n-rows if true and (dec n-rows) if false. defaults to false.
fit-pca!
(fit-pca! tensor {:keys [method covariance-bias?], :or {method :cov, covariance-bias? false}, :as _options})
(fit-pca! tensor)
Run Principle Component Analysis on a tensor.
Keep in mind that PCA may be highly influenced by outliers in the dataset and a probabilistic or some level of auto-encoder dimensionality reduction more effective for your problem.
Returns a map of:
- :means - vec of means
- :eigenvalues - vec of eigenvalues. These are the variance of columns of the post-projected tensor if :cov is used. They are in the ballpark if :svd is used.
- :eigenvectors - matrix of eigenvectors
Options:
- method - svd, cov - Either use SVD or covariance based method. SVD is faster
but covariance method means the post-projection variances are accurate. Both
methods produce an identical or extremely similar projection matrix. Defaults
to
:cov
. - covariance-bias? - When using :cov, divide by n-rows if true and (dec n-rows) if false. defaults to false.
transform-pca
(transform-pca dataset {:keys [n-components result-datatype], :as pca-transform})
PCA transform the dataset returning a new dataset. The method used to generate the pca information is indicated in the metadata of the dataset.
transform-pca!
(transform-pca! tensor pca-info n-components)
PCA transform the dataset returning a new tensor. Mean-centers the tensor in-place.