public class TMD
extends java.lang.Object
tech.ml.dataset
is a high performance library for processing columnar data similar to pandas or R’ data table. Datasets are maps
of their columns and columns derive from various Clojure interfaces such as IIndexed and IFn to make accessing their data as easy as possible.
Columns have a conversion to a tech.v3.datate.Buffer
object accessible via tech.v3.DType.toBuffer()
so if you want higher performance non-boxing access that is also available. Any bit of sequential data can be turned into a column. The best way is if the data is already in a primitive array or nio buffer use that as a column - it will be used in place. It is also possible to direclty instantiate a Buffer object in a read-only pathway to create a virtualized column:
println(head(assoc(colmapDs, kw("c"), new tech.v3.datatype.LongReader() {
public long lsize() { return 10; }
public long readLong( long idx) {
return 2*idx;
}
})));
//testds [5 3]:
//| :b | :a | :c |
//|----:|---:|---:|
//| 9.0 | 0 | 0 |
//| 8.0 | 1 | 2 |
//| 7.0 | 2 | 4 |
//| 6.0 | 3 | 6 |
//| 5.0 | 4 | 8 |
Datasets implement a subset of java.util.Map and clojure’s persistent map interfaces. This means you can use various java.util.Map
functions and you can also use clojure.core/assoc
, clojure.core/dissoc
, and clojure.core/merge
in order to add and remove columns from the dataset. These are exposed in tech.v3.Clj
as equivalently named functions. In combination with the fact that columns implement Clojure.lang.IIndexed providing nth
as well as the single arity IFn invoke method you can do a surprising amount of dataset processing without using bespoke TMD functions at all.
All of the functions in tech.v3.datatype.VecMath
will work with column although most of those functions require the columns to have no missing data. The recommendation is to do you missing-value processing first and then move into the various elemwise functions. Integer columns with missing values will upcast themselves to double columns for any math operation so the result keeps consistent w/r/t NaN behavior. Again, ideally missing values should be dealt with before doing operations in the VecMath
namespace.
Most of the functions of the dataset (filter, sort, groupBy) will auto-parallelize but but there are many times where the most efficient use of machine resources is to parallelize a the outermost level using pmapDs
. The parallelization primitives check and run in serial mode of the current thread is already in a parallelization pathway.
Modifier and Type | Method and Description |
---|---|
static java.lang.Object |
column(java.lang.Object ds,
java.lang.Object cname)
Return the column named
cname else throw exception. |
static long |
columnCount(java.lang.Object ds)
Return the number of columns.
|
static java.util.Map |
columnDef(java.lang.Object name,
java.lang.Object data)
Efficiently create a column definition explicitly specifying name and data.
|
static java.util.Map |
columnDef(java.lang.Object name,
java.lang.Object data,
java.lang.Object missing)
Efficiently create a column definition explicitly specifying name, data, and missing.
|
static java.util.Map |
columnDef(java.lang.Object name,
java.lang.Object data,
java.lang.Object missing,
java.lang.Object metadata)
Efficiently create a column definition explicitly specifying name, data, missing, and metadata.
|
static java.util.Map |
columnMap(java.lang.Object ds,
java.lang.Object resultCname,
clojure.lang.IFn mapFn,
java.lang.Object srcCnames)
Map a function across 1 or more columns to produce a new column.
|
static java.util.Map |
concatCopying(java.lang.Object datasets)
Concatenate an Iterable of datasets into one dataset via copying data into one dataset.
|
static java.util.Map |
concatInplace(java.lang.Object datasets)
Concatenate an Iterable of datasets into one dataset via creating virtual buffers that index into the previous datasets.
|
static java.util.Map |
descriptiveStats(java.lang.Object ds)
Create a dataset of the descriptive statistics of the input dataset.
|
static java.util.Map |
descriptiveStats(java.lang.Object ds,
java.lang.Object options)
Create a dataset of the descriptive statistics of the input dataset.
|
static java.util.Map |
dropColumns(java.lang.Object ds,
java.lang.Object columnNames)
Drop columns by name.
|
static java.util.Map |
dropRows(java.lang.Object ds,
java.lang.Object rowIndexes)
Drop rows by index.
|
static java.util.Map |
filter(java.lang.Object ds,
clojure.lang.IFn predicate)
Filter a dataset.
|
static java.util.Map |
filterColumn(java.lang.Object ds,
java.lang.Object cname,
clojure.lang.IFn predicate)
Filter a dataset.
|
static java.util.Map |
groupBy(java.lang.Object ds,
clojure.lang.IFn groupFn)
Group a dataset returning a Map of keys to dataset.
|
static java.util.Map |
groupByColumn(java.lang.Object ds,
java.lang.Object cname)
Group a dataset by a specific column returning a Map of keys to dataset.
|
static java.util.Map |
head(java.lang.Object ds)
Return the first 5 rows of the dataset
|
static java.util.Map |
head(java.lang.Object ds,
long nRows)
Return the first N rows of the dataset
|
static boolean |
isDataset(java.lang.Object ds)
Returns true if this object is a dataset.
|
static java.util.Map |
join(java.util.Map leftDs,
java.util.Map rightDs,
java.util.Map options)
Perform a join operation between two datasets.
|
static java.util.Map |
leftJoinAsof(java.lang.Object colname,
java.util.Map lhs,
java.util.Map rhs) |
static java.util.Map |
leftJoinAsof(java.lang.Object colname,
java.util.Map lhs,
java.util.Map rhs,
java.lang.Object options)
Perform a left join but join on nearest value as opposed to matching value.
|
static java.util.Map |
makeDataset(java.lang.Object dsData)
Make a dataset.
|
static java.util.Map |
makeDataset(java.lang.Object dsData,
java.util.Map options)
Basic pathway to take data and get back a datasets.
|
static org.roaringbitmap.RoaringBitmap |
missing(java.lang.Object dsOrColumn)
Return the missing set of a dataset or a column in the form of a RoaringBitmap.
|
static java.util.Map |
neanderthalToDataset(java.lang.Object denseMat)
Convert a neanderthal matrix to a dataset such that the columns of the matrix become the columns of the dataset.
|
static java.lang.Object |
pmapDS(java.lang.Object ds,
clojure.lang.IFn mapFn,
java.lang.Object options)
Parallelize mapping a function from dataset->dataset across a dataset.
|
static java.util.Map |
renameColumns(java.lang.Object ds,
java.util.Map renameMap)
Rename columns providing a map of oldname to newname.
|
static java.util.Map |
replaceMissing(java.lang.Object ds,
java.lang.Object strategy)
Replace missing values.
|
static java.util.Map |
replaceMissing(java.lang.Object ds,
java.lang.Object strategy,
java.lang.Object columns)
Replace the missing values from a column or set of columns.
|
static java.util.Map |
reverseRows(java.lang.Object ds)
Reverse the rows of the dataset
|
static long |
rowCount(java.lang.Object ds)
Return the number of rows.
|
static java.util.Map |
rowMap(java.lang.Object ds,
clojure.lang.IFn mapFn)
Map a function across the rows of the dataset with each row in map form.
|
static java.lang.Object |
rowMap(java.lang.Object ds,
clojure.lang.IFn mapFn,
java.lang.Object options)
Map a function across the rows of the dataset with each row in map form.
|
static java.lang.Object |
rowMapcat(java.lang.Object ds,
clojure.lang.IFn mapFn,
java.lang.Object options)
Map a function across the rows of the dataset with each row in map form.
|
static tech.v3.datatype.Buffer |
rows(java.lang.Object ds)
Return the rows of the dataset in a flyweight map format.
|
static tech.v3.datatype.Buffer |
rowvecs(java.lang.Object ds)
Return the rows of the dataset where each row is just a flat Buffer of data.
|
static tech.v3.datatype.Buffer |
rowvecs(java.lang.Object ds,
boolean copying)
Return the rows of the dataset where each row is just a flat Buffer of data.
|
static java.util.Map |
sample(java.lang.Object ds)
Return a random sampling of 5 rows without replacement of the data
|
static java.util.Map |
sample(java.lang.Object ds,
long nRows)
Return a random sampling of N rows without replacement of the data
|
static java.util.Map |
sample(java.lang.Object ds,
long nRows,
java.util.Map options)
Return a random sampling of N rows of the data.
|
static java.util.Map |
select(java.lang.Object ds,
java.lang.Object columnNames,
java.lang.Object rows)
Select a sub-rect of the dataset.
|
static java.util.Map |
selectColumns(java.lang.Object ds,
java.lang.Object columnNames)
Select columns by name.
|
static java.util.Map |
selectRows(java.lang.Object ds,
java.lang.Object rowIndexes)
Select rows by index.
|
static java.util.Map |
shuffle(java.lang.Object ds)
Randomly shuffle the dataset rows.
|
static java.util.Map |
shuffle(java.lang.Object ds,
java.util.Map options)
Randomly shuffle the dataset rows.
|
static java.util.Map |
sortBy(java.lang.Object ds,
clojure.lang.IFn sortFn)
Sort a dataset.
|
static java.util.Map |
sortBy(java.lang.Object ds,
clojure.lang.IFn sortFn,
java.lang.Object compareFn)
Sort a dataset.
|
static java.util.Map |
sortBy(java.lang.Object ds,
clojure.lang.IFn sortFn,
java.lang.Object compareFn,
java.lang.Object options)
Sort a dataset by first mapping
sortFn over it and then sorting over the result. |
static java.util.Map |
sortByColumn(java.lang.Object ds,
java.lang.Object cname)
Sort a dataset by a specific column.
|
static java.util.Map |
sortByColumn(java.lang.Object ds,
java.lang.Object cname,
java.lang.Object compareFn)
Sort a dataset by a specific column.
|
static java.util.Map |
sortByColumn(java.lang.Object ds,
java.lang.Object cname,
java.lang.Object compareFn,
java.lang.Object options)
Sort a dataset by using the values from column
cname . |
static java.util.Map |
tail(java.lang.Object ds)
Return the last 5 rows of the dataset
|
static java.util.Map |
tail(java.lang.Object ds,
long nRows)
Return the last N rows of the dataset
|
static java.util.Map |
tensorToDataset(java.lang.Object tens)
Convert a tensor to a dataset such that the columns of the tensor become the columns of the dataset named after their index.
|
static java.lang.Object |
toNeanderthal(java.lang.Object ds)
Convert a dataset to a neanderthal 2D matrix such that the columns of the dataset become the columns of the matrix.
|
static java.lang.Object |
toNeanderthal(java.lang.Object ds,
clojure.lang.Keyword layout,
clojure.lang.Keyword datatype)
Convert a dataset to a neanderthal 2D matrix such that the columns of the dataset become the columns of the matrix.
|
static tech.v3.datatype.NDBuffer |
toTensor(java.lang.Object ds)
Convert a dataset to a jvm-heap based 2D double (float64) tensor.
|
static tech.v3.datatype.NDBuffer |
toTensor(java.lang.Object ds,
clojure.lang.Keyword datatype)
Convert a dataset to a jvm-heap based 2D tensor such that the columns of the dataset become the columns of the tensor.
|
static java.util.Map |
uniqueBy(java.lang.Object ds,
clojure.lang.IFn uniqueFn)
Create a dataset with no duplicates by taking first of duplicate values.
|
static java.util.Map |
uniqueByColumn(java.lang.Object ds,
java.lang.Object cname)
Make a dataset unique using a particular column as the uniqueness criteria and taking the first value.
|
static void |
writeDataset(java.lang.Object ds,
java.lang.String path)
Write a dataset to disc as csv, tsv, csv.gz, tsv.gz or nippy.
|
static void |
writeDataset(java.lang.Object ds,
java.lang.String path,
java.lang.Object options)
Write a dataset to disc as csv, tsv, csv.gz, tsv.gz, json, json.gz or nippy.
|
public static java.util.Map makeDataset(java.lang.Object dsData, java.util.Map options)
Basic pathway to take data and get back a datasets. If dsData is a string a built in system can parse csv, tsv, csv.gz, tsv.gz, .json, json.gz and .nippy format files. Specific other formats such as xlsx, apache arrow and parquet formats are provided in other classes.
Aside from string data formats, you can explicitly provide either a sequence of maps or a map of columns with the map of columns being by far more the most efficient. In the map-of-columns approach arrays of primitive numeric data and native buffers will be used in-place.
The options for parsing a dataset are extensive and documented at ->dataset.
Example:
Map ds = makeDataset("https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv");
tech.v3.Clj.println(head(ds));
// https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [5 3]:
// | symbol | date | price |
// |--------|------------|------:|
// | MSFT | 2000-01-01 | 39.81 |
// | MSFT | 2000-02-01 | 36.35 |
// | MSFT | 2000-03-01 | 43.22 |
// | MSFT | 2000-04-01 | 28.37 |
// | MSFT | 2000-05-01 | 25.45 |
Map colmapDs = makeDataset(hashmap(kw("a"), range(10),
kw("b"), toDoubleArray(range(9,-1,-1))),
hashmap(kw("dataset-name"), "testds"));
println(colmapDs);
// testds [10 2]:
// | :b | :a |
// |----:|---:|
// | 9.0 | 0 |
// | 8.0 | 1 |
// | 7.0 | 2 |
// | 6.0 | 3 |
// | 5.0 | 4 |
// | 4.0 | 5 |
// | 3.0 | 6 |
// | 2.0 | 7 |
// | 1.0 | 8 |
// | 0.0 | 9 |
public static java.util.Map makeDataset(java.lang.Object dsData)
Make a dataset. See 2-arity form of function.
public static boolean isDataset(java.lang.Object ds)
Returns true if this object is a dataset.
public static long rowCount(java.lang.Object ds)
Return the number of rows.
public static long columnCount(java.lang.Object ds)
Return the number of columns.
public static java.lang.Object column(java.lang.Object ds, java.lang.Object cname)
Return the column named cname
else throw exception.
public static java.util.Map columnDef(java.lang.Object name, java.lang.Object data)
Efficiently create a column definition explicitly specifying name and data. Typed data will be scanned for missing values and untyped data will be read element by element to discern datatype and missing information. The result can be assoc
d back into the dataset.
public static java.util.Map columnDef(java.lang.Object name, java.lang.Object data, java.lang.Object missing)
Efficiently create a column definition explicitly specifying name, data, and missing. The result can be assoc
d back into the dataset. Missing will be converted to a RoaringBitmap but can additionally be an integer array, a java set, or a sequence of integers.
public static java.util.Map columnDef(java.lang.Object name, java.lang.Object data, java.lang.Object missing, java.lang.Object metadata)
Efficiently create a column definition explicitly specifying name, data, missing, and metadata. The result can be assoc
d back into the dataset and saves the system the time required to scan for missing elements. Missing will be converted to a RoaringBitmap but can additionally be an integer array, a java set, or a sequence of integers.
public static java.util.Map select(java.lang.Object ds, java.lang.Object columnNames, java.lang.Object rows)
Select a sub-rect of the dataset. Dataset names is a sequence of column names that must exist in the dataset. Rows is a sequence, list, array, or bitmap of integer row indexes to select. Dataset returned has column in the order specified by columnNames
.
public static java.util.Map selectColumns(java.lang.Object ds, java.lang.Object columnNames)
Select columns by name. All names must exist in the dataset.
public static java.util.Map dropColumns(java.lang.Object ds, java.lang.Object columnNames)
Drop columns by name. All names must exist in the dataset. Another option is to use the Clojure function dissoc
.
public static java.util.Map renameColumns(java.lang.Object ds, java.util.Map renameMap)
Rename columns providing a map of oldname to newname.
public static java.util.Map selectRows(java.lang.Object ds, java.lang.Object rowIndexes)
Select rows by index.
public static java.util.Map dropRows(java.lang.Object ds, java.lang.Object rowIndexes)
Drop rows by index.
public static org.roaringbitmap.RoaringBitmap missing(java.lang.Object dsOrColumn)
Return the missing set of a dataset or a column in the form of a RoaringBitmap.
public static java.util.Map replaceMissing(java.lang.Object ds, java.lang.Object strategy, java.lang.Object columns)
Replace the missing values from a column or set of columns. To replace across all columns use the keyword :all.
Strategy can be:
:up
- take next value:down
- take previous value:lerp
- linearly interpolate across values. Datetime objects will have interpolation in done in millisecond space.vector(:value, val)
- Provide this value explicity to replace entries.:nearest
- use the nearest value.:midpoint
- use the mean of the range.:abb
- impute missing values using approximate bayesian bootstrap.Further documentation is located at replace-missing.
public static java.util.Map replaceMissing(java.lang.Object ds, java.lang.Object strategy)
Replace missing values. See 3-arity form of function for documentation.
public static tech.v3.datatype.Buffer rows(java.lang.Object ds)
Return the rows of the dataset in a flyweight map format. Maps share keys and read their data lazily from the base dataset.
public static tech.v3.datatype.Buffer rowvecs(java.lang.Object ds, boolean copying)
Return the rows of the dataset where each row is just a flat Buffer of data.
When copying is true data is copied upon each access from the underlying dataset. This makes doing something like using each row as the key in a map more efficient.
public static tech.v3.datatype.Buffer rowvecs(java.lang.Object ds)
Return the rows of the dataset where each row is just a flat Buffer of data.
public static java.util.Map head(java.lang.Object ds)
Return the first 5 rows of the dataset
public static java.util.Map head(java.lang.Object ds, long nRows)
Return the first N rows of the dataset
public static java.util.Map tail(java.lang.Object ds)
Return the last 5 rows of the dataset
public static java.util.Map tail(java.lang.Object ds, long nRows)
Return the last N rows of the dataset
public static java.util.Map sample(java.lang.Object ds)
Return a random sampling of 5 rows without replacement of the data
public static java.util.Map sample(java.lang.Object ds, long nRows)
Return a random sampling of N rows without replacement of the data
public static java.util.Map sample(java.lang.Object ds, long nRows, java.util.Map options)
Return a random sampling of N rows of the data.
Options:
:replacement?
- Do sampling with replacement. Defaults to false.:seed
- Either an integer or an implementation of java.util.Random.public static java.util.Map shuffle(java.lang.Object ds)
Randomly shuffle the dataset rows.
public static java.util.Map shuffle(java.lang.Object ds, java.util.Map options)
Randomly shuffle the dataset rows.
Options:
:seed
- Either an integer or an implementation of java.util.Random.public static java.util.Map reverseRows(java.lang.Object ds)
Reverse the rows of the dataset
public static java.util.Map columnMap(java.lang.Object ds, java.lang.Object resultCname, clojure.lang.IFn mapFn, java.lang.Object srcCnames)
Map a function across 1 or more columns to produce a new column. The new column is serially scanned to detect datatype and its missing set.
public static java.util.Map rowMap(java.lang.Object ds, clojure.lang.IFn mapFn)
Map a function across the rows of the dataset with each row in map form. Function must return a new map for each row. Result is generated in parallel so, when used with a map factory, this is a suprisingly efficient strategy to create multiple columns at once from each row.
public static java.lang.Object rowMap(java.lang.Object ds, clojure.lang.IFn mapFn, java.lang.Object options)
Map a function across the rows of the dataset with each row in map form. Function must return a new map for each row. Result is generated in parallel so, when used with a map factory, this is a suprisingly efficient strategy to create multiple columns at once from each row.
See options for pmapDs. Especially note :max-batch-size
and :result-type
. In order to conserve memory it may be much more efficient to return a sequence of datasets rather than one large dataset. If returning sequences of datasets perhaps consider a transducing pathway across them or the tech.v3.dataset.reductions namespace.
public static java.lang.Object rowMapcat(java.lang.Object ds, clojure.lang.IFn mapFn, java.lang.Object options)
Map a function across the rows of the dataset with each row in map form. Function must return either null or a sequence of maps and thus can produce many new rows for each input row. Function is called in a parallelized context. Maps returned must be an implementation of clojure’s IPersistentMap. See tech.v3.Clj.mapFactory for an efficient way to create those in bulk.
See options for pmapDs. Especially note :max-batch-size
and :result-type
. In order to conserve memory it may be much more efficient to return a sequence of datasets rather than one large dataset. If returning sequences of datasets perhaps consider a transducing pathway across them or the tech.v3.dataset.reductions namespace.
public static java.lang.Object pmapDS(java.lang.Object ds, clojure.lang.IFn mapFn, java.lang.Object options)
Parallelize mapping a function from dataset->dataset across a dataset. Function may return null. The original dataset is simply sliced into n-core results and map-fn is called n-core times with the results either concatenated into a new dataset or returned as an Iterable.
Most of the functions of the dataset (filter, sort, groupBy) will auto-parallelize but but there are many times where the most efficient use of machine resources is to parallelize a the outermost level. The parallelization primitives check and run in serial mode of the current thread is already in a parallelization pathway.
mapFn
- a function from dataset->dataset although it may return null.
Options:
:max-batch-size
- Defaults to 64000. This controls the size of each parallelized chunk.:result-type
- Either :as-seq
in which case the output of this function is a sequence of datasets or :as-ds
in which case the output is a single dataset. The default is :as-ds
.public static java.util.Map sortBy(java.lang.Object ds, clojure.lang.IFn sortFn, java.lang.Object compareFn, java.lang.Object options)
Sort a dataset by first mapping sortFn
over it and then sorting over the result. sortFn
is passed each row in map form and the return value is used to sort the dataset.
sortFn
- function taking a single argument which is the row-map and returns the value to sort on.compareFn
- Comparison operator or comparator. Some examples are the Clojure ‘<’ or ‘>’ operators - tech.v3.Clj.lessThanFn, tech.v3.Clj.greaterThanFn. The clojure keywords :tech.numerics/<
and :tech.numerics/>
can be used for somewhat higher performance unboxed primitive comparisons or the Clojure function compare
- tech.v3.Clj.compareFn - which is similar to .compareTo except it works with null and the input must implement Comparable. Finally you can instantiate an instance of java.util.Comparator.
Options:
:nan-strategy
- General missing strategy. Options are :first
, :last
, and :exception
.:parallel?
- Uses parallel quicksort when true and regular quicksort when false.public static java.util.Map sortBy(java.lang.Object ds, clojure.lang.IFn sortFn, java.lang.Object compareFn)
Sort a dataset. See documentation of 4-arity version.
public static java.util.Map sortBy(java.lang.Object ds, clojure.lang.IFn sortFn)
Sort a dataset. See documentation of 4-arity version.
public static java.util.Map sortByColumn(java.lang.Object ds, java.lang.Object cname, java.lang.Object compareFn, java.lang.Object options)
Sort a dataset by using the values from column cname
. to sort on.
compareFn
- Comparison operator or comparator. Some examples are the Clojure ‘<’ or ‘>’ operators - tech.v3.Clj.lessThanFn, tech.v3.Clj.greaterThanFn. The clojure keywords :tech.numerics/<
and :tech.numerics/>
can be used for somewhat higher performance unboxed primitive comparisons or the Clojure function compare
- tech.v3.Clj.compareFn - which is similar to .compareTo except it works with null and the input must implement Comparable. Finally you can instantiate an instance of java.util.Comparator.
Options:
:nan-strategy
- General missing strategy. Options are :first
, :last
, and :exception
.:parallel?
- Uses parallel quicksort when true and regular quicksort when false.public static java.util.Map sortByColumn(java.lang.Object ds, java.lang.Object cname, java.lang.Object compareFn)
Sort a dataset by a specific column. See documentation on 4-arity version.
public static java.util.Map sortByColumn(java.lang.Object ds, java.lang.Object cname)
Sort a dataset by a specific column. See documentation on 4-arity version.
public static java.util.Map filter(java.lang.Object ds, clojure.lang.IFn predicate)
Filter a dataset. Predicate gets passed all rows and must return a truthy
values.
public static java.util.Map filterColumn(java.lang.Object ds, java.lang.Object cname, clojure.lang.IFn predicate)
Filter a dataset. Predicate gets passed a values from column cname and must return a truthy
values.
public static java.util.Map groupBy(java.lang.Object ds, clojure.lang.IFn groupFn)
Group a dataset returning a Map of keys to dataset.
groupFn
- Gets passed each row in map format and must return the desired key.public static java.util.Map groupByColumn(java.lang.Object ds, java.lang.Object cname)
Group a dataset by a specific column returning a Map of keys to dataset.
public static java.util.Map concatCopying(java.lang.Object datasets)
Concatenate an Iterable of datasets into one dataset via copying data into one dataset. This generally results in higher performance than an in-place concatenation with the exception of small (< 3) numbers of datasets. Null datasets will be silently ignored.
public static java.util.Map concatInplace(java.lang.Object datasets)
Concatenate an Iterable of datasets into one dataset via creating virtual buffers that index into the previous datasets. This generally results in lower performance than a copying concatenation with the exception of small (< 3) numbers of datasets. Null datasets will be silently ignored.
public static java.util.Map uniqueBy(java.lang.Object ds, clojure.lang.IFn uniqueFn)
Create a dataset with no duplicates by taking first of duplicate values.
uniqueFn
- is passed a row and must return the uniqueness criteria. A uniqueFn is the identity function.public static java.util.Map uniqueByColumn(java.lang.Object ds, java.lang.Object cname)
Make a dataset unique using a particular column as the uniqueness criteria and taking the first value.
public static java.util.Map descriptiveStats(java.lang.Object ds, java.lang.Object options)
Create a dataset of the descriptive statistics of the input dataset. This works with date-time columns, missing values, etc. and serves as very fast way to quickly get a feel for a dataset.
Options:
:stat-names
- A set of desired stat names. Possible statistic operations are: [:col-name :datatype :n-valid :n-missing :min :quartile-1 :mean :mode :median
:quartile-3 :max :standard-deviation :skew :n-values :values :histogram :first
:last]
public static java.util.Map descriptiveStats(java.lang.Object ds)
Create a dataset of the descriptive statistics of the input dataset. This works with date-time columns, missing values, etc. and serves as very fast way to quickly get a feel for a dataset.
Options:
:stat-names
- A set of desired stat names. Possible statistic operations are: [:col-name :datatype :n-valid :n-missing :min :quartile-1 :mean :mode :median
:quartile-3 :max :standard-deviation :skew :n-values :values :histogram :first
:last]
public static java.util.Map join(java.util.Map leftDs, java.util.Map rightDs, java.util.Map options)
Perform a join operation between two datasets.
Options:
:on
- column name or list of columns names. Names must be found in both datasets.:left-on
- Column name or list of column names:right-on
- Column name or list of column names:how
- :left
, :right
:inner
, :outer
, :cross
. If :cross
, then it is an error to provide :on
, :left-on
, :right-on
. Defaults to :inner
.Examples:
Map dsa = makeDataset(hashmap("a", vector("a", "b", "b", "a", "c"),
"b", range(5),
"c", range(5)));
println(dsa);
//_unnamed [5 3]:
//| a | b | c |
//|---|--:|--:|
//| a | 0 | 0 |
//| b | 1 | 1 |
//| b | 2 | 2 |
//| a | 3 | 3 |
//| c | 4 | 4 |
Map dsb = makeDataset(hashmap("a", vector("a", "b", "a", "b", "d"),
"b", range(5),
"c", range(6,11)));
println(dsb);
//_unnamed [5 3]:
//| a | b | c |
//|---|--:|---:|
//| a | 0 | 6 |
//| b | 1 | 7 |
//| a | 2 | 8 |
//| b | 3 | 9 |
//| d | 4 | 10 |
//Join on the columns a,b. Default join mode is inner
println(join(dsa, dsb, hashmap(kw("on"), vector("a", "b"))));
//inner-join [2 4]:
//| a | b | c | right.c |
//|---|--:|--:|--------:|
//| a | 0 | 0 | 6 |
//| b | 1 | 1 | 7 |
//Outer join on same columns
println(join(dsa, dsb, hashmap(kw("on"), vector("a", "b"),
kw("how"), kw("outer"))));
//outer-join [8 4]:
//| a | b | c | right.c |
//|---|--:|--:|--------:|
//| a | 0 | 0 | 6 |
//| b | 1 | 1 | 7 |
//| b | 2 | 2 | |
//| a | 3 | 3 | |
//| c | 4 | 4 | |
//| a | 2 | | 8 |
//| b | 3 | | 9 |
//| d | 4 | | 10 |
public static java.util.Map leftJoinAsof(java.lang.Object colname, java.util.Map lhs, java.util.Map rhs, java.lang.Object options)
Perform a left join but join on nearest value as opposed to matching value. Both datasets must be sorted by the join column and the join column itself must be either a datetime column or a numeric column. When the join column is a datetime column the join happens in millisecond space.
Options:
:asof-op
- One of the keywords [:< :<= :nearest :>= :>]
. Defaults to :<=
.Examples:
println(head(googPrices, 200));
//GOOG [68 3]:
//| symbol | date | price |
//|--------|------------|-------:|
//| GOOG | 2004-08-01 | 102.37 |
//| GOOG | 2004-09-01 | 129.60 |
//| GOOG | 2005-03-01 | 180.51 |
//| GOOG | 2004-11-01 | 181.98 |
//| GOOG | 2005-02-01 | 187.99 |
//| GOOG | 2004-10-01 | 190.64 |
//| GOOG | 2004-12-01 | 192.79 |
//| GOOG | 2005-01-01 | 195.62 |
//| GOOG | 2005-04-01 | 220.00 |
//| GOOG | 2005-05-01 | 277.27 |
//| GOOG | 2005-08-01 | 286.00 |
//| GOOG | 2005-07-01 | 287.76 |
//| GOOG | 2008-11-01 | 292.96 |
//| GOOG | 2005-06-01 | 294.15 |
//| GOOG | 2008-12-01 | 307.65 |
//| GOOG | 2005-09-01 | 316.46 |
//| GOOG | 2009-02-01 | 337.99 |
//| GOOG | 2009-01-01 | 338.53 |
//| GOOG | 2009-03-01 | 348.06 |
//| GOOG | 2008-10-01 | 359.36 |
//| GOOG | 2006-02-01 | 362.62 |
//| GOOG | 2006-05-01 | 371.82 |
//| GOOG | 2005-10-01 | 372.14 |
//| GOOG | 2006-08-01 | 378.53 |
//| GOOG | 2006-07-01 | 386.60 |
//| GOOG | 2006-03-01 | 390.00 |
//| GOOG | 2009-04-01 | 395.97 |
//| GOOG | 2008-09-01 | 400.52 |
//| GOOG | 2006-09-01 | 401.90 |
//| GOOG | 2005-11-01 | 404.91 |
//| GOOG | 2005-12-01 | 414.86 |
//| GOOG | 2009-05-01 | 417.23 |
//| GOOG | 2006-04-01 | 417.94 |
//| GOOG | 2006-06-01 | 419.33 |
//| GOOG | 2009-06-01 | 421.59 |
//| GOOG | 2006-01-01 | 432.66 |
//| GOOG | 2008-03-01 | 440.47 |
//| GOOG | 2009-07-01 | 443.05 |
//| GOOG | 2007-02-01 | 449.45 |
//| GOOG | 2007-03-01 | 458.16 |
//| GOOG | 2006-12-01 | 460.48 |
//| GOOG | 2009-08-01 | 461.67 |
//| GOOG | 2008-08-01 | 463.29 |
//| GOOG | 2008-02-01 | 471.18 |
//| GOOG | 2007-04-01 | 471.38 |
//| GOOG | 2008-07-01 | 473.75 |
//| GOOG | 2006-10-01 | 476.39 |
//| GOOG | 2006-11-01 | 484.81 |
//| GOOG | 2009-09-01 | 495.85 |
//| GOOG | 2007-05-01 | 497.91 |
//| GOOG | 2007-01-01 | 501.50 |
//| GOOG | 2007-07-01 | 510.00 |
//| GOOG | 2007-08-01 | 515.25 |
//| GOOG | 2007-06-01 | 522.70 |
//| GOOG | 2008-06-01 | 526.42 |
//| GOOG | 2010-02-01 | 526.80 |
//| GOOG | 2010-01-01 | 529.94 |
//| GOOG | 2009-10-01 | 536.12 |
//| GOOG | 2010-03-01 | 560.19 |
//| GOOG | 2008-01-01 | 564.30 |
//| GOOG | 2007-09-01 | 567.27 |
//| GOOG | 2008-04-01 | 574.29 |
//| GOOG | 2009-11-01 | 583.00 |
//| GOOG | 2008-05-01 | 585.80 |
//| GOOG | 2009-12-01 | 619.98 |
//| GOOG | 2007-12-01 | 691.48 |
//| GOOG | 2007-11-01 | 693.00 |
//| GOOG | 2007-10-01 | 707.00 |
Map targetPrices = makeDataset(hashmap("price", new Double[] { 200.0, 300.0, 400.0 }));
println(leftJoinAsof("price", targetPrices, googPrices, hashmap(kw("asof-op"), kw("<="))));
//asof-<= [3 4]:
//| price | symbol | date | GOOG.price |
//|------:|--------|------------|-----------:|
//| 200.0 | GOOG | 2005-04-01 | 220.00 |
//| 300.0 | GOOG | 2008-12-01 | 307.65 |
//| 400.0 | GOOG | 2008-09-01 | 400.52 |
println(leftJoinAsof("price", targetPrices, googPrices, hashmap(kw("asof-op"), kw(">"))));
//asof-> [3 4]:
//| price | symbol | date | GOOG.price |
//|------:|--------|------------|-----------:|
//| 200.0 | GOOG | 2005-01-01 | 195.62 |
//| 300.0 | GOOG | 2005-06-01 | 294.15 |
//| 400.0 | GOOG | 2009-04-01 | 395.97 |
public static java.util.Map leftJoinAsof(java.lang.Object colname, java.util.Map lhs, java.util.Map rhs)
public static java.lang.Object toNeanderthal(java.lang.Object ds, clojure.lang.Keyword layout, clojure.lang.Keyword datatype)
Convert a dataset to a neanderthal 2D matrix such that the columns of the dataset become the columns of the matrix. This function dynamically loads the neanderthal MKL bindings so there may be some pause when first called. If you would like to have the pause somewhere else call require("tech.v3.dataset.neanderthal");
at some previous point of the program. You must have an update-to-date version of neanderthal in your classpath such as [uncomplicate/neanderthal "0.43.3"]
.
See the neanderthal documentation`
layout
- One of :column
or :row
.datatype
- One of :float32
or :float64
.
Note that you can get a tech tensor (tech.v3.datatype.NDBuffer) from a neanderthal matrix using tech.v3.DType.asTensor()
.
public static java.lang.Object toNeanderthal(java.lang.Object ds)
Convert a dataset to a neanderthal 2D matrix such that the columns of the dataset become the columns of the matrix. See documentation for 4-arity version of function. This function creates a column-major float64 (double) matrix.
public static java.util.Map neanderthalToDataset(java.lang.Object denseMat)
Convert a neanderthal matrix to a dataset such that the columns of the matrix become the columns of the dataset. Column names are the indexes of the columns.
public static tech.v3.datatype.NDBuffer toTensor(java.lang.Object ds, clojure.lang.Keyword datatype)
Convert a dataset to a jvm-heap based 2D tensor such that the columns of the dataset become the columns of the tensor.
datatype
- Any numeric datatype - :int8
, :uint8
, :float32
, :float64
, etc.public static tech.v3.datatype.NDBuffer toTensor(java.lang.Object ds)
Convert a dataset to a jvm-heap based 2D double (float64) tensor.
public static java.util.Map tensorToDataset(java.lang.Object tens)
Convert a tensor to a dataset such that the columns of the tensor become the columns of the dataset named after their index.
public static void writeDataset(java.lang.Object ds, java.lang.String path, java.lang.Object options)
Write a dataset to disc as csv, tsv, csv.gz, tsv.gz, json, json.gz or nippy.
Reading/writing to parquet or arrow is accessible via separate clasess
public static void writeDataset(java.lang.Object ds, java.lang.String path)
Write a dataset to disc as csv, tsv, csv.gz, tsv.gz or nippy.