TMD 7.035
A Clojure high performance data processing system.
Topics
- tech.ml.dataset Getting Started
- tech.ml.dataset Walkthrough
- tech.ml.dataset Quick Reference
- tech.ml.dataset Columns, Readers, and Datatypes
- tech.ml.dataset And nippy
- tech.ml.dataset Supported Datatypes
Namespaces
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating in memory datasets.
Public variables and functions:
- ->>dataset
- ->dataset
- add-column
- add-or-update-column
- all-descriptive-stats-names
- append-columns
- assoc-ds
- assoc-metadata
- bind->
- brief
- categorical->number
- categorical->one-hot
- column
- column->dataset
- column-cast
- column-count
- column-labeled-mapseq
- column-map
- column-map-m
- column-names
- columns
- columns-with-missing-seq
- columnwise-concat
- concat
- concat-copying
- concat-inplace
- data->dataset
- dataset->data
- dataset-name
- dataset-parser
- dataset?
- descriptive-stats
- drop-columns
- drop-missing
- drop-rows
- empty-dataset
- ensure-array-backed
- filter
- filter-column
- filter-dataset
- group-by
- group-by->indexes
- group-by-column
- group-by-column->indexes
- group-by-column-consumer
- has-column?
- head
- induction
- major-version
- mapseq-parser
- mapseq-reader
- mapseq-rf
- min-n-by-column
- missing
- new-column
- new-dataset
- order-column-names
- pmap-ds
- print-all
- rand-nth
- remove-column
- remove-columns
- remove-rows
- rename-columns
- replace-missing
- replace-missing-value
- reverse-rows
- row-at
- row-count
- row-map
- row-mapcat
- rows
- rowvec-at
- rowvecs
- sample
- select
- select-by-index
- select-columns
- select-columns-by-index
- select-missing
- select-rows
- set-dataset-name
- shape
- shuffle
- sort-by
- sort-by-column
- tail
- take-nth
- unique-by
- unique-by-column
- unordered-select
- unroll-column
- update
- update-column
- update-columns
- update-columnwise
- update-elemwise
- value-reader
- write!
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions are supported, a straight value->integer map and one-hot encoding.
tech.v3.dataset.clipboard
Optional namespace that copies a dataset to the clipboard for pasting into applications such as excel or google sheets.
Public variables and functions:
tech.v3.dataset.column-filters
Queries to select column subsets that have various properites such as all numeric columns, all feature columns, or columns that have a specific datatype.
Public variables and functions:
tech.v3.dataset.io.datetime
Helpful and well tested string->datetime pathways.
tech.v3.dataset.io.string-row-parser
Parsing functions based on raw data that is represented by a sequence of string arrays.
Public variables and functions:
tech.v3.dataset.io.univocity
Bindings to univocity. Transforms csv's, tsv's into sequences
of string arrays that are then passed into tech.v3.dataset.io.string-row-parser
methods.
Public variables and functions:
tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
Public variables and functions:
tech.v3.dataset.math
Various mathematic transformations of datasets such as (inefficiently)
building simple tables, pca, and normalizing columns to have mean of 0 and variance of 1.
More in-depth transformations are found at tech.v3.dataset.neanderthal
.
Public variables and functions:
tech.v3.dataset.metamorph
This is an auto-generated api system - it scans the namespaces and changes the first
to be metamorph-compliant which means transforming an argument that is just a dataset into
an argument that is a metamorph context - a map of {:metamorph/data ds}
. They also return
their result as a metamorph context.
Public variables and functions:
- add-column
- add-or-update-column
- append-columns
- assoc-ds
- assoc-metadata
- brief
- build-pipelined-function
- categorical->number
- categorical->one-hot
- column
- column->dataset
- column-cast
- column-count
- column-labeled-mapseq
- column-map
- column-names
- column-values->categorical
- columns
- columns-with-missing-seq
- columnwise-concat
- concat
- concat-copying
- concat-inplace
- data->dataset
- dataset->categorical-xforms
- dataset->data
- dataset-name
- dataset?
- descriptive-stats
- drop-columns
- drop-missing
- drop-rows
- empty-dataset
- ensure-array-backed
- feature-ecount
- filter
- filter-column
- filter-dataset
- group-by
- group-by->indexes
- group-by-column
- group-by-column->indexes
- group-by-column-consumer
- has-column?
- head
- induction
- inference-column?
- inference-target-column-names
- inference-target-ds
- inference-target-label-inverse-map
- inference-target-label-map
- k-fold-datasets
- labels
- mapseq-reader
- min-n-by-column
- missing
- model-type
- new-column
- new-dataset
- num-inference-classes
- order-column-names
- pmap-ds
- print-all
- probability-distributions->label-column
- rand-nth
- remove-column
- remove-columns
- remove-rows
- rename-columns
- replace-missing
- replace-missing-value
- reverse-rows
- row-at
- row-count
- row-map
- row-mapcat
- rows
- rowvec-at
- rowvecs
- sample
- select
- select-by-index
- select-columns
- select-columns-by-index
- select-missing
- select-rows
- set-dataset-name
- set-inference-target
- shape
- shuffle
- sort-by
- sort-by-column
- tail
- take-nth
- train-test-split
- unique-by
- unique-by-column
- unordered-select
- unroll-column
- update
- update-column
- update-columns
- update-columnwise
- update-elemwise
- value-reader
- write!
tech.v3.dataset.modelling
Methods related specifically to machine learning such as setting the inference target. This file integrates tightly with tech.v3.dataset.categorical which provides categorical -> number and one-hot transformation pathways.
Public variables and functions:
- column-values->categorical
- dataset->categorical-xforms
- feature-ecount
- inference-column?
- inference-target-column-names
- inference-target-ds
- inference-target-label-inverse-map
- inference-target-label-map
- k-fold-datasets
- labels
- model-type
- num-inference-classes
- probability-distributions->label-column
- set-inference-target
- train-test-split
tech.v3.dataset.reductions
Specific high performance reductions intended to be performend over a sequence
of datasets. This allows aggregations to be done in situations where the dataset is
larger than what will fit in memory on a normal machine. Due to this fact, summation
is implemented using Kahan algorithm and various statistical methods are done in using
statistical estimation techniques and thus are prefixed with prob-
which is short
for probabilistic
.
Public variables and functions:
tech.v3.dataset.reductions.apache-data-sketch
Reduction reducers based on the apache data sketch family of algorithms.
Public variables and functions:
tech.v3.dataset.rolling
Implement a generalized rolling window including support for time-based variable width windows.
tech.v3.dataset.set
Extensions to datasets to do per-row bag-semantics set/union and intersection.
Public variables and functions:
tech.v3.dataset.tensor
Conversion mechanisms from dataset to tensor and back.
Public variables and functions:
tech.v3.dataset.zip
Load zip data. Zip files with a single file entry can be loaded with ->dataset. When a zip file has multiple entries you have to call zipfile->dataset-seq.
Public variables and functions:
tech.v3.libs.arrow
Support for reading/writing apache arrow datasets. Datasets may be memory mapped but default to being read via an input stream.
Public variables and functions:
tech.v3.libs.clj-transit
Transit bindings for the jvm version of tech.v3.dataset.
Public variables and functions:
tech.v3.libs.fastexcel
Parse a dataset in xlsx format. This namespace auto-registers a handler for
the 'xlsx' file type so that when using ->dataset, xlsx
will automatically map to
(first (workbook->datasets))
.
Public variables and functions:
tech.v3.libs.guava.cache
Use a google guava cache to memoize function results. Function must not return nil values. Exceptions propagate to caller.
Public variables and functions:
tech.v3.libs.parquet
Support for reading Parquet files. You must require this namespace to enable parquet read/write support.
Public variables and functions:
tech.v3.libs.poi
Parse a dataset in xls or xlsx format. This namespace auto-registers a handler for
the xls
file type so that when using ->dataset, xls
will automatically map to
(first (workbook->datasets))
.
Public variables and functions:
tech.v3.libs.tribuo
Bindings to make working with tribuo more straight forward when using datasets.