tech.v3.dataset.metamorph

This is an auto-generated api system - it scans the namespaces and changes the first to be metamorph-compliant which means transforming an argument that is just a dataset into an argument that is a metamorph context - a map of {:metamorph/data ds}. They also return their result as a metamorph context.

add-column

(add-column column)

Add a new column. Error if name collision

add-or-update-column

(add-or-update-column colname column)(add-or-update-column column)

If column exists, replace. Else append new column.

append-columns

(append-columns column-seq)

assoc-ds

(assoc-ds cname cdata & args)

If dataset is not nil, calls clojure.core/assoc. Else creates a new empty dataset and then calls clojure.core/assoc. Guaranteed to return a dataset (unlike assoc).

assoc-metadata

(assoc-metadata filter-fn-or-ds k v & args)

Set metadata across a set of columns.

brief

(brief options)(brief)

Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.

build-pipelined-function

macro

(build-pipelined-function f m)

categorical->number

(categorical->number filter-fn-or-ds)(categorical->number filter-fn-or-ds table-args)(categorical->number filter-fn-or-ds table-args result-datatype)

Convert columns into a discrete , numeric representation See tech.v3.dataset.categorical/fit-categorical-map.

categorical->one-hot

(categorical->one-hot filter-fn-or-ds)(categorical->one-hot filter-fn-or-ds table-args)(categorical->one-hot filter-fn-or-ds table-args result-datatype)

Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot

column

(column colname)

column->dataset

(column->dataset colname transform-fn options)(column->dataset colname transform-fn)

Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.

column-cast

(column-cast colname datatype)

Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.

colname may be a scalar or a tuple of src-col dst-col.

datatype may be a datatype enumeration or a tuple of datatype cast-fn where cast-fn may return either a new value, :tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.

If the existing datatype is string, then tech.v3.datatype.column/parse-column is called.

Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.v3.dataset.column/parse-column.

column-count

(column-count)

column-labeled-mapseq

(column-labeled-mapseq value-colname-seq)

Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!

See also columnwise-concat

Return a sequence of maps with

  {... - columns not in colname-seq
   :value - value from one of the value columns
   :label - name of the column the value came from
  }

column-map

(column-map result-colname map-fn res-dtype-or-opts filter-fn-or-ds)(column-map result-colname map-fn filter-fn-or-ds)(column-map result-colname map-fn)

Produce a new (or updated) column as the result of mapping a fn over columns. This function is never lazy - all results are immediately calculated.

  • dataset - dataset.
  • result-colname - Name of new (or existing) column.
  • map-fn - function to map over columns. Same rules as tech.v3.datatype/emap.
  • res-dtype-or-opts - If not given result is scanned to infer missing and datatype. If using an option map, options are described below.
  • filter-fn-or-ds - A dataset, a sequence of columns, or a tech.v3.datasets/column-filters column filter function. Defaults to all the columns of the existing dataset.

Returns a new dataset with a new or updated column.

Options:

  • :datatype - Set the dataype of the result column. If not given result is scanned to infer result datatype and missing set.
  • :missing-fn - if given, columns are first passed to missing-fn as a sequence and this dictates the missing set. Else the missing set is by scanning the results during the inference process. See tech.v3.dataset.column/union-missing-sets and tech.v3.dataset.column/intersect-missing-sets for example functions to pass in here.

Examples:


  ;;From the tests --

  (let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
    ;;result scanned for both datatype and missing set
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
    ;;result scanned for missing set only.  Result used in-place.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %))
                               {:datatype :float64} [:b]))))
    ;;Nothing scanned at all.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(inc %)
                               {:datatype :float64
                                :missing-fn ds-col/union-missing-sets} [:b]))))
    ;;Missing set scanning causes NPE at inc.
    (is (thrown? Throwable
                 (ds/column-map testds :b2 #(inc %)
                                {:datatype :float64}
                                [:b]))))

  ;;Ad-hoc repl --

user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
          (ds/head))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   price^2 |
|--------|------------|-------|-----------|
|   MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|   MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|   MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|   MSFT | 2000-04-01 | 28.37 |  804.8569 |
|   MSFT | 2000-05-01 | 25.45 |  647.7025 |



user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:

|  :b | :a |
|----:|---:|
|     |  1 |
| 2.0 |    |
| 3.0 |  2 |
user> (ds/column-map ds1 :c (fn [a b]
                              (when (and a b)
                                (+ (double a) (double b))))
                     [:a :b])
_unnamed [3 3]:

|  :b | :a |  :c |
|----:|---:|----:|
|     |  1 |     |
| 2.0 |    |     |
| 3.0 |  2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}

column-names

(column-names)

In-order sequence of column names

column-values->categorical

(column-values->categorical src-column)

Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values. In the case of one-hot mappings, src-column must be the original column name before the one-hot map

columns

(columns)

Return sequence of all columns in dataset.

columns-with-missing-seq

(columns-with-missing-seq)

Return a sequence of:

  {:column-name column-name
   :missing-count missing-count
  }

or nil of no columns are missing data.

columnwise-concat

(columnwise-concat colnames options)(columnwise-concat colnames)

Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.

Example:

user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |

Options:

value-column-name - defaults to :value colname-column-name - defaults to :column

concat

(concat & args)(concat)

Concatenate datasets in place using a copying-concatenation. See also concat-inplace as it may be more efficient for your use case if you have a small number (like less than 3) of datasets.

concat-copying

(concat-copying & args)(concat-copying)

Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

concat-inplace

(concat-inplace & args)(concat-inplace)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

data->dataset

(data->dataset)

Convert a data-ized dataset created via dataset->data back into a full dataset

dataset->categorical-xforms

(dataset->categorical-xforms)

Given a dataset, return a map of column-name->xform information.

dataset->data

(dataset->data)

Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.

dataset-name

(dataset-name)

dataset?

(dataset?)

descriptive-stats

(descriptive-stats)(descriptive-stats options)

Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.

drop-columns

(drop-columns colname-seq-or-fn)

Same as remove-columns. Remove columns indexed by column name seq or column filter function. For example:

(drop-columns DS [:A :B])
(drop-columns DS cf/categorical)

drop-missing

(drop-missing)(drop-missing colname)

Remove missing entries by simply selecting out the missing indexes.

drop-rows

(drop-rows row-indexes)

Drop rows from dataset or column

empty-dataset

(empty-dataset)

ensure-array-backed

(ensure-array-backed options)(ensure-array-backed)

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

Options:

  • :unpack? - unpack packed datetime types. Defaults to true

feature-ecount

(feature-ecount)

Number of feature columns. Feature columns are columns that are not inference targets.

filter

(filter predicate)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

filter-column

(filter-column colname predicate)(filter-column colname)

Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %).

The 2-arity form of this function reads the column as a boolean reader so for instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are only false if nil?.

Returns a dataset.

filter-dataset

(filter-dataset filter-fn-or-ds)

Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace.

  • If filter-fn-or-ds is a dataset, it is returned.
  • If filter-fn-or-ds is sequential, then select-columns is called.
  • If filter-fn-or-ds is :all, all columns are returned
  • If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.

group-by

(group-by key-fn options)(group-by key-fn)

Produce a map of key-fn-value->dataset. The argument to key-fn is a map of colname->column-value representing a row in dataset. Each dataset in the resulting map contains all and only rows that produce the same key-fn-value.

Options - options are passed into dtype arggroup:

  • :group-by-finalizer - when provided this is run on each dataset immediately after the rows are selected. This can be used to immediately perform a reduction on each new dataset which is faster than doing it in a separate run.

group-by->indexes

(group-by->indexes key-fn options)(group-by->indexes key-fn)

(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes is an in-order contiguous group of indexes.

group-by-column

(group-by-column colname options)(group-by-column colname)

Return a map of column-value->dataset. Each dataset in the resulting map contains all and only rows with the same value in column.

  • :group-by-finalizer - when provided this is run on each dataset immediately after the rows are selected. This can be used to immediately perform a reduction on each new dataset which is faster than doing it in a separate run.

group-by-column->indexes

(group-by-column->indexes colname options)(group-by-column->indexes colname)

(Non-lazy) - Group a dataset by a column return a map of column-val->indexes where indexes is an in-order contiguous group of indexes.

Options are passed into dtype's arggroup method.

group-by-column-consumer

(group-by-column-consumer cname)

has-column?

(has-column? column-name)

head

(head n)(head)

Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.

induction

(induction induct-fn & args)

Given a dataset and a function from dataset->row produce a new dataset. The produced row will be merged with the current row and then added to the dataset.

Options are same as the options used for ->dataset in order for the user to control the parsing of the return values of induct-fn. A new dataset is returned.

Example:

user> (def ds (ds/->dataset {:a [0 1 2 3] :b [1 2 3 4]}))
#'user/ds
user> ds
_unnamed [4 2]:

| :a | :b |
|---:|---:|
|  0 |  1 |
|  1 |  2 |
|  2 |  3 |
|  3 |  4 |
user> (ds/induction ds (fn [ds]
                         {:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1))
                          :sum-a (dfn/sum (ds :a))
                          :sum-b (dfn/sum (ds :b))}))
_unnamed [4 5]:

| :a | :b | :sum-b | :sum-a | :sum-of-previous-row |
|---:|---:|-------:|-------:|---------------------:|
|  0 |  1 |    0.0 |    0.0 |                  0.0 |
|  1 |  2 |    1.0 |    0.0 |                  1.0 |
|  2 |  3 |    3.0 |    1.0 |                  5.0 |
|  3 |  4 |    6.0 |    3.0 |                 14.0 |

inference-column?

(inference-column?)

inference-target-column-names

(inference-target-column-names)

Return the names of the columns that are inference targets.

inference-target-ds

(inference-target-ds)

Given a dataset return reverse-mapped inference target columns or nil in the case where there are no inference targets.

inference-target-label-inverse-map

(inference-target-label-inverse-map & args)

Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.

inference-target-label-map

(inference-target-label-map & args)

k-fold-datasets

(k-fold-datasets k options)(k-fold-datasets k)

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Returns a sequence of {:test-ds :train-ds}

Options:

  • :randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be provided. Defaults to true.
  • :seed - when :randomize-dataset? is true then this can either be an implementation of java.util.Random or an integer seed which will be used to construct java.util.Random.

labels

(labels)

Return the labels. The labels sequence is the reverse mapped inference column. This returns a single column of data or errors out.

mapseq-reader

(mapseq-reader options)(mapseq-reader)

Return a reader that produces a map of column-name->column-value upon read.

min-n-by-column

(min-n-by-column cname N comparator options)(min-n-by-column cname N comparator)(min-n-by-column cname N)

Find the minimum N entries (unsorted) by column. Resulting data will be indexed in original order. If you want a sorted order then sort the result.

See options to sort-by-column.

Example:

user> (ds/min-n-by-column ds "price" 10 nil nil)
test/data/stocks.csv [10 3]:

| symbol |       date | price |
|--------|------------|------:|
|   AMZN | 2001-09-01 |  5.97 |
|   AMZN | 2001-10-01 |  6.98 |
|   AAPL | 2000-12-01 |  7.44 |
|   AAPL | 2002-08-01 |  7.38 |
|   AAPL | 2002-09-01 |  7.25 |
|   AAPL | 2002-12-01 |  7.16 |
|   AAPL | 2003-01-01 |  7.18 |
|   AAPL | 2003-02-01 |  7.51 |
|   AAPL | 2003-03-01 |  7.07 |
|   AAPL | 2003-04-01 |  7.11 |
user> (ds/min-n-by-column ds "price" 10 > nil)
test/data/stocks.csv [10 3]:

| symbol |       date |  price |
|--------|------------|-------:|
|   GOOG | 2007-09-01 | 567.27 |
|   GOOG | 2007-10-01 | 707.00 |
|   GOOG | 2007-11-01 | 693.00 |
|   GOOG | 2007-12-01 | 691.48 |
|   GOOG | 2008-01-01 | 564.30 |
|   GOOG | 2008-04-01 | 574.29 |
|   GOOG | 2008-05-01 | 585.80 |
|   GOOG | 2009-11-01 | 583.00 |
|   GOOG | 2009-12-01 | 619.98 |
|   GOOG | 2010-03-01 | 560.19 |

missing

(missing)

Given a dataset or a column, return the missing set as a roaring bitmap

model-type

(model-type & args)

Check the label column after dataset processing. Return either :regression :classification

new-column

(new-column data)(new-column data metadata)(new-column data metadata missing)(new-column)

Create a new column. Data will scanned for missing values unless the full 4-argument pathway is used.

new-dataset

(new-dataset ds-metadata column-seq)(new-dataset column-seq)(new-dataset)

Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.

num-inference-classes

(num-inference-classes)

Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.

order-column-names

(order-column-names colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

pmap-ds

(pmap-ds ds-map-fn options)(pmap-ds ds-map-fn)

Parallelize mapping a function from dataset->dataset across a single dataset. Results are coalesced back into a single dataset. The original dataset is simple sliced into n-core results and map-fn is called n-core times. ds-map-fn must be a function from dataset->dataset although it may return nil.

Options:

  • :max-batch-size - this is a default for tech.v3.parallel.for/indexed-map-reduce. You can control how many rows are processed in a given batch - the default is 64000. If your mapping pathway produces a large expansion in the size of the dataset then it may be good to reduce the max batch size and use :as-seq to produce a sequence of datasets.
  • :result-type
    • :as-seq - Return a sequence of datasets, one for each batch.
    • :as-ds - Return a single datasets with all results in memory (default option).

print-all

(print-all)

Helper function equivalent to (tech.v3.dataset.print/print-range ... :all)

probability-distributions->label-column

(probability-distributions->label-column dst-colname label-column-datatype)(probability-distributions->label-column dst-colname)

Given a dataset that has columns in which the column names describe labels and the rows describe a probability distribution, create a label column by taking the max value in each row and assign column that row value.

rand-nth

(rand-nth)

Return a random row from the dataset in map format

remove-column

(remove-column col-name)

Same as:

(dissoc dataset col-name)

remove-columns

(remove-columns colname-seq-or-fn)

Remove columns indexed by column name seq or column filter function. For example:

  (remove-columns DS [:A :B])
  (remove-columns DS cf/categorical)

remove-rows

(remove-rows row-indexes)

Same as drop-rows.

rename-columns

(rename-columns colnames)

Rename columns using a map or vector of column names.

Does not reorder columns; rename is in-place for maps and positional for vectors.

replace-missing

(replace-missing)(replace-missing strategy)(replace-missing columns-selector strategy)(replace-missing columns-selector strategy value)

Replace missing values in some columns with a given strategy. The columns selector may be:

  • seq of any legal column names
  • or a column filter function, such as numeric and categorical

Strategies may be:

  • :down - take value from previous non-missing row if possible else use provided value.

  • :up - take value from next non-missing row if possible else use provided value.

  • :downup - take value from previous if possible else use next.

  • :updown - take value from next if possible else use previous.

  • :nearest - Use nearest of next or previous values. :mid is an alias for :nearest.

  • :midpoint - Use midpoint of averaged values between previous and next nonmissing rows.

  • :abb - Impute missing with approximate bayesian bootstrap. See r's ABB.

  • :lerp - Linearly interpolate values between previous and next nonmissing rows.

  • :value - Value will be provided - see below.

    value may be provided which will then be used. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.

replace-missing-value

(replace-missing-value filter-fn-or-ds scalar-value)(replace-missing-value scalar-value)

reverse-rows

(reverse-rows)

Reverse the rows in the dataset or column.

row-at

(row-at idx)

Get the row at an individual index. If indexes are negative then the dataset is indexed from the end.

user> (ds/row-at stocks 1)
{"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"],
 "symbol" "MSFT",
 "price" 36.35}
user> (ds/row-at stocks -1)
{"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"],
 "symbol" "AAPL",
 "price" 223.02}

row-count

(row-count)

row-map

(row-map map-fn options)(row-map map-fn)

Map a function across the rows of the dataset producing a new dataset that is merged back into the original potentially replacing existing columns. Options are passed into the ->dataset function so you can control the resulting column types by the usual dataset parsing options described there.

Options:

See options for pmap-ds. In particular, note that you can produce a sequence of datasets as opposed to a single large dataset.

Speed demons should attempt both {:copying? false} and {:copying? true} in the options map as that changes rather drastically how data is read from the datasets. If you are going to read all the data in the dataset, {:copying? true} will most likely be the faster of the two.

Examples:

user> (def stocks (ds/->dataset "test/data/stocks.csv"))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|------:|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (ds/head (ds/row-map stocks (fn [row]
                                    {"symbol" (keyword (row "symbol"))
                                     :price2 (* (row "price")(row "price"))})))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   :price2 |
|--------|------------|------:|----------:|
|  :MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|  :MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|  :MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|  :MSFT | 2000-04-01 | 28.37 |  804.8569 |
|  :MSFT | 2000-05-01 | 25.45 |  647.7025 |

row-mapcat

(row-mapcat mapcat-fn options)(row-mapcat mapcat-fn)

Map a function across the rows of the dataset. The function must produce a sequence of maps and the original dataset rows will be duplicated and then merged into the result of calling (->> (apply concat) (->>dataset options) on the result of mapcat-fn. Options are the same as ->dataset.

The smaller the maps returned from mapcat-fn the better, perhaps consider using records. In the case that a mapcat-fn result map has a key that overlaps a column name the column will be replaced with the output of mapcat-fn. The returned map will have the key :_row-id assoc'd onto it so for absolutely minimal gc usage include this as a member variable in your map.

Options:

  • See options for pmap-ds. Especially note :max-batch-size and :result-type. In order to conserve memory it may be much more efficient to return a sequence of datasets rather than one large dataset. If returning sequences of datasets perhaps consider a transducing pathway across them or the tech.v3.dataset.reductions namespace.

Example:

user> (def ds (ds/->dataset {:rid (range 10)
                             :data (repeatedly 10 #(rand-int 3))}))
#'user/ds
user> (ds/head ds)
_unnamed [5 2]:

| :rid | :data |
|-----:|------:|
|    0 |     0 |
|    1 |     2 |
|    2 |     0 |
|    3 |     1 |
|    4 |     2 |
user> (def mapcat-fn (fn [row]
                       (for [idx (range (row :data))]
                         {:idx idx})))
#'user/mapcat-fn
user> (mapcat mapcat-fn (ds/rows ds))
({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1})
user> (ds/row-mapcat ds mapcat-fn)
_unnamed [9 3]:

| :rid | :data | :idx |
|-----:|------:|-----:|
|    1 |     2 |    0 |
|    1 |     2 |    1 |
|    3 |     1 |    0 |
|    4 |     2 |    0 |
|    4 |     2 |    1 |
|    6 |     2 |    0 |
|    6 |     2 |    1 |
|    8 |     2 |    0 |
|    8 |     2 |    1 |
user>

rows

(rows options)(rows)

Get the rows of the dataset as a list of potentially flyweight maps.

Options:

  • copying? - When true the data is copied out of the dataset row by row upon read of that row. When false the data is only referenced upon each read of a particular key. Copying is appropriate if you want to use the row values as keys a map and it is inappropriate if you are only going to read a very small portion of the row map.
  • nil-missing? - When true, maps returned have nil values for missing entries as opposed to eliding the missing keys entirely. It is legacy behavior and slightly faster to use :nil-missing? true.
user> (take 5 (ds/rows stocks))
({"date" #object[java.time.LocalDate 0x6c433971 "2000-01-01"],
  "symbol" "MSFT",
  "price" 39.81}
 {"date" #object[java.time.LocalDate 0x28f96b14 "2000-02-01"],
  "symbol" "MSFT",
  "price" 36.35}
 {"date" #object[java.time.LocalDate 0x7bdbf0a "2000-03-01"],
  "symbol" "MSFT",
  "price" 43.22}
 {"date" #object[java.time.LocalDate 0x16d3871e "2000-04-01"],
  "symbol" "MSFT",
  "price" 28.37}
 {"date" #object[java.time.LocalDate 0x47094da0 "2000-05-01"],
  "symbol" "MSFT",
  "price" 25.45})


user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]))
[{:a 1, :b 2} {:a 2} {:b 3}]

user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]) {:nil-missing? true})
[{:a 1, :b 2} {:a 2, :b nil} {:a nil, :b 3}]

rowvec-at

(rowvec-at idx)

Return a persisent-vector-like row at a given index. Negative indexes index from the end.

user> (ds/rowvec-at stocks 1)
["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35]
user> (ds/rowvec-at stocks -1)
["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02]

rowvecs

(rowvecs options)(rowvecs)

Return a randomly addressable list of rows in persistent vector-like form.

Options:

  • copying? - When true the data is copied out of the dataset row by row upon read of that row. When false the data is only referenced upon each read of a particular key. Copying is appropriate if you want to use the row values as keys a map and it is inappropriate if you are only going to read a given key for a given row once.
user> (take 5 (ds/rowvecs stocks))
(["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81]
 ["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35]
 ["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22]
 ["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37]
 ["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45])

sample

(sample n options)(sample n)(sample)

Sample n-rows from a dataset. Defaults to sampling without replacement.

For the definition of seed, see the argshuffle documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle)

The returned dataset's metadata is altered merging {:print-index-range (range n)} in so you will always see the entire returned dataset. If this isn't desired, vary-meta a good pathway.

Options:

  • :replacement? - Do sampling with replacement. Defaults to false.
  • :seed - Provide a seed as a number or provide a Random implementation.

select

(select colname-seq selection)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:

  • :all - all the columns
  • sequence of column names - those columns in that order.
  • implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to rename-columns except this trims the result to be only the columns in the map. selection - either keyword :all, a list of indexes to select, or a list of booleans where the index position of each true value indicates an index to select. When providing indices, duplicates will select the specified index position more than once.

select-by-index

(select-by-index col-index row-index)

Trim dataset according to this sequence of indexes. Returns a new dataset.

col-index and row-index - one of:

  • :all - all the columns
  • list of indexes. May contain duplicates. Negative values will be counted from the end of the sequence.

select-columns

(select-columns colname-seq-or-fn)

Select columns from the dataset by:

  • seq of column names
  • column selector function
  • :all keyword

For example:

(select-columns DS [:A :B])
(select-columns DS cf/numeric)
(select-columns DS :all)

select-columns-by-index

(select-columns-by-index col-index)

Select columns from the dataset by seq of index(includes negative) or :all.

See documentation for select-by-index.

select-missing

(select-missing)

Remove missing entries by simply selecting out the missing indexes

select-rows

(select-rows row-indexes options)(select-rows row-indexes)

Select rows from the dataset or column.

set-dataset-name

(set-dataset-name ds-name)

set-inference-target

(set-inference-target target-name-or-target-name-seq)

Set the inference target on the column. This sets the :column-type member of the column metadata to :inference-target?.

shape

(shape)

Returns shape in column-major format of n-columns n-rows.

shuffle

(shuffle options)(shuffle)

Shuffle the rows of the dataset optionally providing a seed. See https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle.

sort-by

(sort-by key-fn compare-fn & args)(sort-by key-fn)

Sort a dataset by a key-fn and compare-fn.

  • key-fn - function from map to sort value.
  • compare-fn may be one of:
    • a clojure operator like clojure.core/<
    • :tech.numerics/<, :tech.numerics/> for unboxing comparisons of primitive values.
    • clojure.core/compare
    • A custom java.util.Comparator instantiation.

Options:

  • :nan-strategy - General missing strategy. Options are :first, :last, and :exception.
  • :parallel? - Uses parallel quicksort when true and regular quicksort when false.

sort-by-column

(sort-by-column colname compare-fn & args)(sort-by-column colname)

Sort a dataset by a given column using the given compare fn.

  • compare-fn may be one of:
    • a clojure operator like clojure.core/<
    • :tech.numerics/<, :tech.numerics/> for unboxing comparisons of primitive values.
    • clojure.core/compare
    • A custom java.util.Comparator instantiation.

Options:

  • :nan-strategy - General missing strategy. Options are :first, :last, and :exception.
  • :parallel? - Uses parallel quicksort when true and regular quicksort when false.

tail

(tail n)(tail)

Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.

take-nth

(take-nth n-val)

train-test-split

(train-test-split options)(train-test-split)

Probabilistically split the dataset returning a map of {:train-ds :test-ds}.

Options:

  • :randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be provided. Defaults to true.
  • :seed - when :randomize-dataset? is true then this can either be an implementation of java.util.Random or an integer seed which will be used to construct java.util.Random.
  • :train-fraction - Fraction of the dataset to use as training set. Defaults to 0.7.

unique-by

(unique-by options map-fn)(unique-by map-fn)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).

unique-by-column

(unique-by-column options colname)(unique-by-column colname)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

unordered-select

(unordered-select colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

unroll-column

(unroll-column column-name)(unroll-column column-name options)

Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.

Any missing indexes are dropped.

user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |

Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.

update

(update filter-fn-or-ds update-fn & args)

Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset.

This pathways is designed to work with the tech.v3.dataset.column-filters namespace.

  • filter-fn-or-ds is a generalized parameter. May be a function, a dataset or a sequence of column names.
  • update-fn must take the dataset as the first argument and must return a dataset.
(ds/bind-> (ds/->dataset dataset) ds
           (ds/remove-column "Id")
           (ds/update cf/string ds/replace-missing-value "NA")
           (ds/update-elemwise cf/string #(get {"" "NA"} % %))
           (ds/update cf/numeric ds/replace-missing-value 0)
           (ds/update cf/boolean ds/replace-missing-value false)
           (ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
                                 #(dtype/elemwise-cast % :float64)))

update-column

(update-column col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

update-columns

(update-columns column-name-seq-or-fn update-fn)

Update a sequence of columns selected by column name seq or column selector function.

For example:

(update-columns DS [:A :B] #(dfn/+ % 2))
(update-columns DS cf/numeric #(dfn// % 2))

update-columnwise

(update-columnwise filter-fn-or-ds cwise-update-fn & args)

Call update-fn on each column of the dataset. Returns the dataset. See arguments to update

update-elemwise

(update-elemwise filter-fn-or-ds map-fn)(update-elemwise map-fn)

Replace all elements in selected columns by calling selected function on each element. column-name-seq must be a sequence of column names if provided. filter-fn-or-ds has same rules as update. Implicitly clears the missing set so function must deal with type-specific missing values correctly. Returns new dataset

value-reader

(value-reader options)(value-reader)

Return a reader that produces a reader of column values per index. Options: :copying? - Default to false - When true row values are copied on read.

write!

(write! output-path options)(write! output-path)

Write a dataset out to a file. Supported forms are:

(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)

Options:

  • :max-chars-per-column - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization.
  • :max-num-columns - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization.
  • :quoted-columns - csv specific - sequence of columns names that you would like to always have quoted.
  • :file-type - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type.
  • :headers? - if csv headers are written, defaults to true.