TMD 7.035

A Clojure high performance data processing system.

Topics

Namespaces

tech.v3.dataset

Column major dataset abstraction for efficiently manipulating in memory datasets.

tech.v3.dataset.categorical

Conversions of categorical values into numbers and back. Two forms of conversions are supported, a straight value->integer map and one-hot encoding.

tech.v3.dataset.clipboard

Optional namespace that copies a dataset to the clipboard for pasting into applications such as excel or google sheets.

Public variables and functions:

tech.v3.dataset.column

tech.v3.dataset.column-filters

Queries to select column subsets that have various properites such as all numeric columns, all feature columns, or columns that have a specific datatype.

tech.v3.dataset.io.csv

CSV parsing based on charred.api/read-csv.

Public variables and functions:

tech.v3.dataset.io.datetime

Helpful and well tested string->datetime pathways.

tech.v3.dataset.io.string-row-parser

Parsing functions based on raw data that is represented by a sequence of string arrays.

Public variables and functions:

tech.v3.dataset.io.univocity

Bindings to univocity. Transforms csv's, tsv's into sequences of string arrays that are then passed into tech.v3.dataset.io.string-row-parser methods.

tech.v3.dataset.join

implementation of join algorithms, both exact (hash-join) and near.

tech.v3.dataset.math

Various mathematic transformations of datasets such as (inefficiently) building simple tables, pca, and normalizing columns to have mean of 0 and variance of 1. More in-depth transformations are found at tech.v3.dataset.neanderthal.

tech.v3.dataset.metamorph

This is an auto-generated api system - it scans the namespaces and changes the first to be metamorph-compliant which means transforming an argument that is just a dataset into an argument that is a metamorph context - a map of {:metamorph/data ds}. They also return their result as a metamorph context.

Public variables and functions:

tech.v3.dataset.modelling

Methods related specifically to machine learning such as setting the inference target. This file integrates tightly with tech.v3.dataset.categorical which provides categorical -> number and one-hot transformation pathways.

tech.v3.dataset.print

tech.v3.dataset.reductions

Specific high performance reductions intended to be performend over a sequence of datasets. This allows aggregations to be done in situations where the dataset is larger than what will fit in memory on a normal machine. Due to this fact, summation is implemented using Kahan algorithm and various statistical methods are done in using statistical estimation techniques and thus are prefixed with prob- which is short for probabilistic.

tech.v3.dataset.reductions.apache-data-sketch

Reduction reducers based on the apache data sketch family of algorithms.

tech.v3.dataset.rolling

Implement a generalized rolling window including support for time-based variable width windows.

tech.v3.dataset.set

Extensions to datasets to do per-row bag-semantics set/union and intersection.

tech.v3.dataset.tensor

Conversion mechanisms from dataset to tensor and back.

Public variables and functions:

tech.v3.dataset.zip

Load zip data. Zip files with a single file entry can be loaded with ->dataset. When a zip file has multiple entries you have to call zipfile->dataset-seq.

Public variables and functions:

tech.v3.libs.arrow

Support for reading/writing apache arrow datasets. Datasets may be memory mapped but default to being read via an input stream.

tech.v3.libs.clj-transit

Transit bindings for the jvm version of tech.v3.dataset.

tech.v3.libs.fastexcel

Parse a dataset in xlsx format. This namespace auto-registers a handler for the 'xlsx' file type so that when using ->dataset, xlsx will automatically map to (first (workbook->datasets)).

Public variables and functions:

tech.v3.libs.guava.cache

Use a google guava cache to memoize function results. Function must not return nil values. Exceptions propagate to caller.

Public variables and functions:

tech.v3.libs.parquet

Support for reading Parquet files. You must require this namespace to enable parquet read/write support.

tech.v3.libs.poi

Parse a dataset in xls or xlsx format. This namespace auto-registers a handler for the xls file type so that when using ->dataset, xls will automatically map to (first (workbook->datasets)).

Public variables and functions:

tech.v3.libs.smile.data

Bindings to the smile DataFrame system.

tech.v3.libs.tribuo

Bindings to make working with tribuo more straight forward when using datasets.