tech.v3.dataset

Column major dataset abstraction for efficiently manipulating in memory datasets.

->>dataset

(->>dataset options dataset)(->>dataset dataset)

Please see documentation of ->dataset. Options are the same.

->dataset

(->dataset dataset options)(->dataset dataset)

Create a dataset from either csv/tsv or a sequence of maps.

  • A String be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden.

  • InputStreams have no file type and thus a file-type must be provided in the options.

  • A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created.

Parquet, xlsx, and xls formats require that you require the appropriate libraries which are tech.v3.libs.parquet for parquet, tech.v3.libs.fastexcel for xlsx, and tech.v3.libs.poi for xls.

Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type overload as the Arrow project current has 3 different file types and it is not clear what their final suffix will be or which of the three file types it will indicate. Please see documentation in the tech.v3.libs.arrow namespace for further information on Arrow file types.

Options:

  • :dataset-name - set the name of the dataset.

  • :file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :parquet}.

  • :gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream.

  • :column-allowlist - either sequence of string column names or sequence of column indices of columns to allowlist. This is preferred to :column-whitelist

  • :column-blocklist - either sequence of string column names or sequence of column indices of columns to blocklist. This is preferred to :column-blacklist

  • :num-rows - Number of rows to read

  • :header-row? - Defaults to true, indicates the first row is a header.

  • :key-fn - function to be applied to column names. Typical use is: :key-fn keyword.

  • :separator - Add a character separator to the list of separators to auto-detect.

  • :csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such).

  • :bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row.

  • :skip-bad-rows? - Legacy option. Use :bad-row-policy.

  • :disable-comment-skipping? - As default, the # character is recognised as a line comment when found in the beginning of a line of text in a CSV file, and the row will be ignored. Set true to disable this behavior.

  • :max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception.

  • :max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301

  • :text-temp-dir - The temporary directory to use for file-backed text. Setting this value to boolean 'false' turns off file backed text which is the default. If a tech.v3.resource stack context is opened the file will be deleted when the context closes else it will be deleted when the gc cleans up the dataset. A shutdown hook is added as a last resort to ensure the file is cleaned up.

  • :n-initial-skip-rows - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets.

  • :parser-type - Default parser to use if no parser-fn is specified for that column. For csv files, the default parser type is :string which indicates a promotional string parser. For sequences of maps, the default parser type is :object. It can be useful in some contexts to use the :string parser with sequences of maps or maps of columns.

  • :parser-fn -

    • keyword? - all columns parsed to this datatype. For example: {:parser-fn :string}
    • map? - {column-name parse-method} parse each column with specified parse-method. The parse-method can be:
      • keyword? - parse the specified column to this datatype. For example: {:parser-fn {:answer :boolean :id :int32}}
      • tuple - pair of [datatype parse-data] in which case container of type [datatype] will be created. parse-data can be one of:
        • :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.
        • fn? - function from str-> one of :tech.v3.dataset/missing, :tech.v3.dataset/parse-failure, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated.
        • string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For :text you can specify the backing file to use.
        • DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
  • map? - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used.

Returns a new dataset

add-column

(add-column dataset column)

Add a new column. Error if name collision

add-or-update-column

(add-or-update-column dataset colname column)(add-or-update-column dataset column)

If column exists, replace. Else append new column.

all-descriptive-stats-names

(all-descriptive-stats-names)

Returns the names of all descriptive stats in the order they will be returned in the resulting dataset of descriptive stats. This allows easy filtering in the form for (descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names) (remove #{:values :num-distinct-values}))})

append-columns

(append-columns dataset column-seq)

assoc-ds

(assoc-ds dataset cname cdata & args)

If dataset is not nil, calls clojure.core/assoc. Else creates a new empty dataset and then calls clojure.core/assoc. Guaranteed to return a dataset (unlike assoc).

assoc-metadata

(assoc-metadata dataset filter-fn-or-ds k v & args)

Set metadata across a set of columns.

bind->

macro

(bind-> expr name & args)

Threads like -> but binds name to expr like as->:

(ds/bind-> (ds/->dataset "test/data/stocks.csv") ds
           (assoc :logprice2 (dfn/log1p (ds "price")))
           (assoc :logp3 (dfn/* 2 (ds :logprice2)))
           (ds/select-columns ["price" :logprice2 :logp3])
           (ds-tens/dataset->tensor)
           (first))

brief

(brief ds options)(brief ds)

Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.

categorical->number

(categorical->number dataset filter-fn-or-ds)(categorical->number dataset filter-fn-or-ds table-args)(categorical->number dataset filter-fn-or-ds table-args result-datatype)

Convert columns into a discrete , numeric representation See tech.v3.dataset.categorical/fit-categorical-map.

categorical->one-hot

(categorical->one-hot dataset filter-fn-or-ds)(categorical->one-hot dataset filter-fn-or-ds table-args)(categorical->one-hot dataset filter-fn-or-ds table-args result-datatype)

Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot

column

(column dataset colname)

column->dataset

(column->dataset dataset colname transform-fn options)(column->dataset dataset colname transform-fn)

Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.

column-cast

(column-cast dataset colname datatype)

Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.

colname may be a scalar or a tuple of src-col dst-col.

datatype may be a datatype enumeration or a tuple of datatype cast-fn where cast-fn may return either a new value, :tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.

If the existing datatype is string, then tech.v3.datatype.column/parse-column is called.

Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.v3.dataset.column/parse-column.

column-count

(column-count dataset)

column-labeled-mapseq

(column-labeled-mapseq dataset value-colname-seq)

Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!

See also columnwise-concat

Return a sequence of maps with

  {... - columns not in colname-seq
   :value - value from one of the value columns
   :label - name of the column the value came from
  }

column-map

(column-map dataset result-colname map-fn res-dtype-or-opts filter-fn-or-ds)(column-map dataset result-colname map-fn filter-fn-or-ds)(column-map dataset result-colname map-fn)

Produce a new (or updated) column as the result of mapping a fn over columns. This function is never lazy - all results are immediately calculated.

  • dataset - dataset.
  • result-colname - Name of new (or existing) column.
  • map-fn - function to map over columns. Same rules as tech.v3.datatype/emap.
  • res-dtype-or-opts - If not given result is scanned to infer missing and datatype. If using an option map, options are described below.
  • filter-fn-or-ds - A dataset, a sequence of columns, or a tech.v3.datasets/column-filters column filter function. Defaults to all the columns of the existing dataset.

Returns a new dataset with a new or updated column.

Options:

  • :datatype - Set the dataype of the result column. If not given result is scanned to infer result datatype and missing set.
  • :missing-fn - if given, columns are first passed to missing-fn as a sequence and this dictates the missing set. Else the missing set is by scanning the results during the inference process. See tech.v3.dataset.column/union-missing-sets and tech.v3.dataset.column/intersect-missing-sets for example functions to pass in here.

Examples:


  ;;From the tests --

  (let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
    ;;result scanned for both datatype and missing set
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
    ;;result scanned for missing set only.  Result used in-place.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %))
                               {:datatype :float64} [:b]))))
    ;;Nothing scanned at all.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(inc %)
                               {:datatype :float64
                                :missing-fn ds-col/union-missing-sets} [:b]))))
    ;;Missing set scanning causes NPE at inc.
    (is (thrown? Throwable
                 (ds/column-map testds :b2 #(inc %)
                                {:datatype :float64}
                                [:b]))))

  ;;Ad-hoc repl --

user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
          (ds/head))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   price^2 |
|--------|------------|-------|-----------|
|   MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|   MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|   MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|   MSFT | 2000-04-01 | 28.37 |  804.8569 |
|   MSFT | 2000-05-01 | 25.45 |  647.7025 |



user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:

|  :b | :a |
|----:|---:|
|     |  1 |
| 2.0 |    |
| 3.0 |  2 |
user> (ds/column-map ds1 :c (fn [a b]
                              (when (and a b)
                                (+ (double a) (double b))))
                     [:a :b])
_unnamed [3 3]:

|  :b | :a |  :c |
|----:|---:|----:|
|     |  1 |     |
| 2.0 |    |     |
| 3.0 |  2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}

column-map-m

macro

(column-map-m ds result-colname src-colnames body)

Map a function across one or more columns via a macro. The function will have arguments in the order of the src-colnames. column names of the form right.id will be bound to variables named right-id.

Example:

user> (-> (ds/->dataset [{:a.a 1} {:b 2.0} {:a.a 2 :b 3.0}])
          (ds/column-map-m :a [:a.a :b]
                           (when (and a-a b)
                             (+ (double a-a) (double b)))))
_unnamed [3 3]:

|  :b | :a.a |  :a |
|----:|-----:|----:|
|     |    1 |     |
| 2.0 |      |     |
| 3.0 |    2 | 5.0 |

column-names

(column-names dataset)

In-order sequence of column names

columns

(columns dataset)

Return sequence of all columns in dataset.

columns-with-missing-seq

(columns-with-missing-seq dataset)

Return a sequence of:

  {:column-name column-name
   :missing-count missing-count
  }

or nil of no columns are missing data.

columnwise-concat

(columnwise-concat dataset colnames options)(columnwise-concat dataset colnames)

Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.

Example:

user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |

Options:

value-column-name - defaults to :value colname-column-name - defaults to :column

concat

(concat dataset & args)(concat)

Concatenate datasets in place using a copying-concatenation. See also concat-inplace as it may be more efficient for your use case if you have a small number (like less than 3) of datasets.

concat-copying

(concat-copying dataset & args)(concat-copying)

Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

concat-inplace

(concat-inplace dataset & args)(concat-inplace)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

data->dataset

(data->dataset input)

Convert a data-ized dataset created via dataset->data back into a full dataset

dataset->data

(dataset->data ds)

Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.

dataset-name

(dataset-name dataset)

dataset-parser

(dataset-parser options)(dataset-parser)

Implements protocols/PDatasetParser, Counted, Indexed, IReduceInit, and IDeref (returns the new dataset). See documentation for mapseq-parser.

dataset?

(dataset? ds)

descriptive-stats

(descriptive-stats dataset)(descriptive-stats dataset options)

Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.

drop-columns

(drop-columns dataset colname-seq-or-fn)

Same as remove-columns. Remove columns indexed by column name seq or column filter function. For example:

(drop-columns DS [:A :B])
(drop-columns DS cf/categorical)

drop-missing

(drop-missing dataset-or-col)(drop-missing ds colname)

Remove missing entries by simply selecting out the missing indexes.

drop-rows

(drop-rows dataset-or-col row-indexes)

Drop rows from dataset or column

empty-dataset

(empty-dataset)

ensure-array-backed

(ensure-array-backed ds options)(ensure-array-backed ds)

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

Options:

  • :unpack? - unpack packed datetime types. Defaults to true

filter

(filter dataset predicate)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

filter-column

(filter-column dataset colname predicate)(filter-column dataset colname)

Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %).

The 2-arity form of this function reads the column as a boolean reader so for instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are only false if nil?.

Returns a dataset.

filter-dataset

(filter-dataset dataset filter-fn-or-ds)

Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace.

  • If filter-fn-or-ds is a dataset, it is returned.
  • If filter-fn-or-ds is sequential, then select-columns is called.
  • If filter-fn-or-ds is :all, all columns are returned
  • If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.

group-by

(group-by dataset key-fn options)(group-by dataset key-fn)

Produce a map of key-fn-value->dataset. The argument to key-fn is a map of colname->column-value representing a row in dataset. Each dataset in the resulting map contains all and only rows that produce the same key-fn-value.

Options - options are passed into dtype arggroup:

  • :group-by-finalizer - when provided this is run on each dataset immediately after the rows are selected. This can be used to immediately perform a reduction on each new dataset which is faster than doing it in a separate run.

group-by->indexes

(group-by->indexes dataset key-fn options)(group-by->indexes dataset key-fn)

(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes is an in-order contiguous group of indexes.

group-by-column

(group-by-column dataset colname options)(group-by-column dataset colname)

Return a map of column-value->dataset. Each dataset in the resulting map contains all and only rows with the same value in column.

  • :group-by-finalizer - when provided this is run on each dataset immediately after the rows are selected. This can be used to immediately perform a reduction on each new dataset which is faster than doing it in a separate run.

group-by-column->indexes

(group-by-column->indexes dataset colname options)(group-by-column->indexes dataset colname)

(Non-lazy) - Group a dataset by a column return a map of column-val->indexes where indexes is an in-order contiguous group of indexes.

Options are passed into dtype's arggroup method.

group-by-column-consumer

(group-by-column-consumer ds cname)

has-column?

(has-column? dataset column-name)

head

(head dataset n)(head dataset)

Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.

induction

(induction ds induct-fn & args)

Given a dataset and a function from dataset->row produce a new dataset. The produced row will be merged with the current row and then added to the dataset.

Options are same as the options used for ->dataset in order for the user to control the parsing of the return values of induct-fn. A new dataset is returned.

Example:

user> (def ds (ds/->dataset {:a [0 1 2 3] :b [1 2 3 4]}))
#'user/ds
user> ds
_unnamed [4 2]:

| :a | :b |
|---:|---:|
|  0 |  1 |
|  1 |  2 |
|  2 |  3 |
|  3 |  4 |
user> (ds/induction ds (fn [ds]
                         {:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1))
                          :sum-a (dfn/sum (ds :a))
                          :sum-b (dfn/sum (ds :b))}))
_unnamed [4 5]:

| :a | :b | :sum-b | :sum-a | :sum-of-previous-row |
|---:|---:|-------:|-------:|---------------------:|
|  0 |  1 |    0.0 |    0.0 |                  0.0 |
|  1 |  2 |    1.0 |    0.0 |                  1.0 |
|  2 |  3 |    3.0 |    1.0 |                  5.0 |
|  3 |  4 |    6.0 |    3.0 |                 14.0 |

major-version

mapseq-parser

(mapseq-parser options)(mapseq-parser)

Return a clojure function that when called with one arg that arg must be the next map to add to the dataset. When called with no args returns the current dataset. This can be used to efficiently transform a stream of maps into a dataset while getting intermediate datasets during the parse operation.

Options are the same for ->dataset.

user> (require '[tech.v3.dataset :as ds])
nil
user> (def pfn (ds/mapseq-parser))
#'user/pfn
user> (pfn {:a 1 :b 2})
nil
user> (pfn {:a 1 :b 2})
nil
user> (pfn {:a 2 :c 3})
nil
user> (pfn)
_unnamed [3 3]:

| :a | :b | :c |
|---:|---:|---:|
|  1 |  2 |    |
|  1 |  2 |    |
|  2 |    |  3 |
user> (pfn {:a 3 :d 4})
nil
user> (pfn {:a 5 :c 6})
nil
user> (pfn)
_unnamed [5 4]:

| :a | :b | :c | :d |
|---:|---:|---:|---:|
|  1 |  2 |    |    |
|  1 |  2 |    |    |
|  2 |    |  3 |    |
|  3 |    |    |  4 |
|  5 |    |  6 |    |

mapseq-reader

(mapseq-reader dataset options)(mapseq-reader dataset)

Return a reader that produces a map of column-name->column-value upon read.

mapseq-rf

(mapseq-rf)(mapseq-rf options)

Create a transduce-compatible rf that reduces a sequence of maps into a dataset. Same options as ->dataset.

user> (transduce (map identity) (ds/mapseq-rf {:dataset-name :transduced}) [{:a 1 :b 2}])
:transduced [1 2]:

| :a | :b |
|---:|---:|
|  1 |  2 |

min-n-by-column

(min-n-by-column dataset cname N comparator options)(min-n-by-column dataset cname N comparator)(min-n-by-column dataset cname N)

Find the minimum N entries (unsorted) by column. Resulting data will be indexed in original order. If you want a sorted order then sort the result.

See options to sort-by-column.

Example:

user> (ds/min-n-by-column ds "price" 10 nil nil)
test/data/stocks.csv [10 3]:

| symbol |       date | price |
|--------|------------|------:|
|   AMZN | 2001-09-01 |  5.97 |
|   AMZN | 2001-10-01 |  6.98 |
|   AAPL | 2000-12-01 |  7.44 |
|   AAPL | 2002-08-01 |  7.38 |
|   AAPL | 2002-09-01 |  7.25 |
|   AAPL | 2002-12-01 |  7.16 |
|   AAPL | 2003-01-01 |  7.18 |
|   AAPL | 2003-02-01 |  7.51 |
|   AAPL | 2003-03-01 |  7.07 |
|   AAPL | 2003-04-01 |  7.11 |
user> (ds/min-n-by-column ds "price" 10 > nil)
test/data/stocks.csv [10 3]:

| symbol |       date |  price |
|--------|------------|-------:|
|   GOOG | 2007-09-01 | 567.27 |
|   GOOG | 2007-10-01 | 707.00 |
|   GOOG | 2007-11-01 | 693.00 |
|   GOOG | 2007-12-01 | 691.48 |
|   GOOG | 2008-01-01 | 564.30 |
|   GOOG | 2008-04-01 | 574.29 |
|   GOOG | 2008-05-01 | 585.80 |
|   GOOG | 2009-11-01 | 583.00 |
|   GOOG | 2009-12-01 | 619.98 |
|   GOOG | 2010-03-01 | 560.19 |

missing

(missing dataset-or-col)

Given a dataset or a column, return the missing set as a roaring bitmap

new-column

(new-column name data)(new-column name data metadata)(new-column name data metadata missing)(new-column data-or-data-map)

Create a new column. Data will scanned for missing values unless the full 4-argument pathway is used.

new-dataset

(new-dataset options ds-metadata column-seq)(new-dataset options column-seq)(new-dataset column-seq)

Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.

order-column-names

(order-column-names dataset colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

pmap-ds

(pmap-ds ds ds-map-fn options)(pmap-ds ds ds-map-fn)

Parallelize mapping a function from dataset->dataset across a single dataset. Results are coalesced back into a single dataset. The original dataset is simple sliced into n-core results and map-fn is called n-core times. ds-map-fn must be a function from dataset->dataset although it may return nil.

Options:

  • :max-batch-size - this is a default for tech.v3.parallel.for/indexed-map-reduce. You can control how many rows are processed in a given batch - the default is 64000. If your mapping pathway produces a large expansion in the size of the dataset then it may be good to reduce the max batch size and use :as-seq to produce a sequence of datasets.
  • :result-type
    • :as-seq - Return a sequence of datasets, one for each batch.
    • :as-ds - Return a single datasets with all results in memory (default option).

print-all

(print-all dataset)

Helper function equivalent to (tech.v3.dataset.print/print-range ... :all)

rand-nth

(rand-nth dataset)

Return a random row from the dataset in map format

remove-column

(remove-column dataset col-name)

Same as:

(dissoc dataset col-name)

remove-columns

(remove-columns dataset colname-seq-or-fn)

Remove columns indexed by column name seq or column filter function. For example:

  (remove-columns DS [:A :B])
  (remove-columns DS cf/categorical)

remove-rows

(remove-rows dataset-or-col row-indexes)

Same as drop-rows.

rename-columns

(rename-columns dataset colnames)

Rename columns using a map or vector of column names.

Does not reorder columns; rename is in-place for maps and positional for vectors.

replace-missing

(replace-missing ds)(replace-missing ds strategy)(replace-missing ds columns-selector strategy)(replace-missing ds columns-selector strategy value)

Replace missing values in some columns with a given strategy. The columns selector may be:

  • seq of any legal column names
  • or a column filter function, such as numeric and categorical

Strategies may be:

  • :down - take value from previous non-missing row if possible else use provided value.

  • :up - take value from next non-missing row if possible else use provided value.

  • :downup - take value from previous if possible else use next.

  • :updown - take value from next if possible else use previous.

  • :nearest - Use nearest of next or previous values. :mid is an alias for :nearest.

  • :midpoint - Use midpoint of averaged values between previous and next nonmissing rows.

  • :abb - Impute missing with approximate bayesian bootstrap. See r's ABB.

  • :lerp - Linearly interpolate values between previous and next nonmissing rows.

  • :value - Value will be provided - see below.

    value may be provided which will then be used. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.

replace-missing-value

(replace-missing-value dataset filter-fn-or-ds scalar-value)(replace-missing-value dataset scalar-value)

reverse-rows

(reverse-rows dataset-or-col)

Reverse the rows in the dataset or column.

row-at

(row-at ds idx)

Get the row at an individual index. If indexes are negative then the dataset is indexed from the end.

user> (ds/row-at stocks 1)
{"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"],
 "symbol" "MSFT",
 "price" 36.35}
user> (ds/row-at stocks -1)
{"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"],
 "symbol" "AAPL",
 "price" 223.02}

row-count

(row-count dataset-or-col)

row-map

(row-map ds map-fn options)(row-map ds map-fn)

Map a function across the rows of the dataset producing a new dataset that is merged back into the original potentially replacing existing columns. Options are passed into the ->dataset function so you can control the resulting column types by the usual dataset parsing options described there.

Options:

See options for pmap-ds. In particular, note that you can produce a sequence of datasets as opposed to a single large dataset.

Speed demons should attempt both {:copying? false} and {:copying? true} in the options map as that changes rather drastically how data is read from the datasets. If you are going to read all the data in the dataset, {:copying? true} will most likely be the faster of the two.

Examples:

user> (def stocks (ds/->dataset "test/data/stocks.csv"))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|------:|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (ds/head (ds/row-map stocks (fn [row]
                                    {"symbol" (keyword (row "symbol"))
                                     :price2 (* (row "price")(row "price"))})))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   :price2 |
|--------|------------|------:|----------:|
|  :MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|  :MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|  :MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|  :MSFT | 2000-04-01 | 28.37 |  804.8569 |
|  :MSFT | 2000-05-01 | 25.45 |  647.7025 |

row-mapcat

(row-mapcat ds mapcat-fn options)(row-mapcat ds mapcat-fn)

Map a function across the rows of the dataset. The function must produce a sequence of maps and the original dataset rows will be duplicated and then merged into the result of calling (->> (apply concat) (->>dataset options) on the result of mapcat-fn. Options are the same as ->dataset.

The smaller the maps returned from mapcat-fn the better, perhaps consider using records. In the case that a mapcat-fn result map has a key that overlaps a column name the column will be replaced with the output of mapcat-fn. The returned map will have the key :_row-id assoc'd onto it so for absolutely minimal gc usage include this as a member variable in your map.

Options:

  • See options for pmap-ds. Especially note :max-batch-size and :result-type. In order to conserve memory it may be much more efficient to return a sequence of datasets rather than one large dataset. If returning sequences of datasets perhaps consider a transducing pathway across them or the tech.v3.dataset.reductions namespace.

Example:

user> (def ds (ds/->dataset {:rid (range 10)
                             :data (repeatedly 10 #(rand-int 3))}))
#'user/ds
user> (ds/head ds)
_unnamed [5 2]:

| :rid | :data |
|-----:|------:|
|    0 |     0 |
|    1 |     2 |
|    2 |     0 |
|    3 |     1 |
|    4 |     2 |
user> (def mapcat-fn (fn [row]
                       (for [idx (range (row :data))]
                         {:idx idx})))
#'user/mapcat-fn
user> (mapcat mapcat-fn (ds/rows ds))
({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1})
user> (ds/row-mapcat ds mapcat-fn)
_unnamed [9 3]:

| :rid | :data | :idx |
|-----:|------:|-----:|
|    1 |     2 |    0 |
|    1 |     2 |    1 |
|    3 |     1 |    0 |
|    4 |     2 |    0 |
|    4 |     2 |    1 |
|    6 |     2 |    0 |
|    6 |     2 |    1 |
|    8 |     2 |    0 |
|    8 |     2 |    1 |
user>

rows

(rows ds options)(rows ds)

Get the rows of the dataset as a list of potentially flyweight maps.

Options:

  • copying? - When true the data is copied out of the dataset row by row upon read of that row. When false the data is only referenced upon each read of a particular key. Copying is appropriate if you want to use the row values as keys a map and it is inappropriate if you are only going to read a very small portion of the row map.
  • nil-missing? - When true, maps returned have nil values for missing entries as opposed to eliding the missing keys entirely. It is legacy behavior and slightly faster to use :nil-missing? true.
user> (take 5 (ds/rows stocks))
({"date" #object[java.time.LocalDate 0x6c433971 "2000-01-01"],
  "symbol" "MSFT",
  "price" 39.81}
 {"date" #object[java.time.LocalDate 0x28f96b14 "2000-02-01"],
  "symbol" "MSFT",
  "price" 36.35}
 {"date" #object[java.time.LocalDate 0x7bdbf0a "2000-03-01"],
  "symbol" "MSFT",
  "price" 43.22}
 {"date" #object[java.time.LocalDate 0x16d3871e "2000-04-01"],
  "symbol" "MSFT",
  "price" 28.37}
 {"date" #object[java.time.LocalDate 0x47094da0 "2000-05-01"],
  "symbol" "MSFT",
  "price" 25.45})


user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]))
[{:a 1, :b 2} {:a 2} {:b 3}]

user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]) {:nil-missing? true})
[{:a 1, :b 2} {:a 2, :b nil} {:a nil, :b 3}]

rowvec-at

(rowvec-at ds idx)

Return a persisent-vector-like row at a given index. Negative indexes index from the end.

user> (ds/rowvec-at stocks 1)
["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35]
user> (ds/rowvec-at stocks -1)
["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02]

rowvecs

(rowvecs ds options)(rowvecs ds)

Return a randomly addressable list of rows in persistent vector-like form.

Options:

  • copying? - When true the data is copied out of the dataset row by row upon read of that row. When false the data is only referenced upon each read of a particular key. Copying is appropriate if you want to use the row values as keys a map and it is inappropriate if you are only going to read a given key for a given row once.
user> (take 5 (ds/rowvecs stocks))
(["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81]
 ["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35]
 ["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22]
 ["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37]
 ["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45])

sample

(sample dataset n options)(sample dataset n)(sample dataset)

Sample n-rows from a dataset. Defaults to sampling without replacement.

For the definition of seed, see the argshuffle documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle)

The returned dataset's metadata is altered merging {:print-index-range (range n)} in so you will always see the entire returned dataset. If this isn't desired, vary-meta a good pathway.

Options:

  • :replacement? - Do sampling with replacement. Defaults to false.
  • :seed - Provide a seed as a number or provide a Random implementation.

select

(select dataset colname-seq selection)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:

  • :all - all the columns
  • sequence of column names - those columns in that order.
  • implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to rename-columns except this trims the result to be only the columns in the map. selection - either keyword :all, a list of indexes to select, or a list of booleans where the index position of each true value indicates an index to select. When providing indices, duplicates will select the specified index position more than once.

select-by-index

(select-by-index dataset col-index row-index)

Trim dataset according to this sequence of indexes. Returns a new dataset.

col-index and row-index - one of:

  • :all - all the columns
  • list of indexes. May contain duplicates. Negative values will be counted from the end of the sequence.

select-columns

(select-columns dataset colname-seq-or-fn)

Select columns from the dataset by:

  • seq of column names
  • column selector function
  • :all keyword

For example:

(select-columns DS [:A :B])
(select-columns DS cf/numeric)
(select-columns DS :all)

select-columns-by-index

(select-columns-by-index dataset col-index)

Select columns from the dataset by seq of index(includes negative) or :all.

See documentation for select-by-index.

select-missing

(select-missing dataset-or-col)

Remove missing entries by simply selecting out the missing indexes

select-rows

(select-rows dataset-or-col row-indexes options)(select-rows dataset-or-col row-indexes)

Select rows from the dataset or column.

set-dataset-name

(set-dataset-name dataset ds-name)

shape

(shape dataset)

Returns shape in column-major format of n-columns n-rows.

shuffle

(shuffle dataset options)(shuffle dataset)

Shuffle the rows of the dataset optionally providing a seed. See https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle.

sort-by

(sort-by dataset key-fn compare-fn & args)(sort-by dataset key-fn)

Sort a dataset by a key-fn and compare-fn.

  • key-fn - function from map to sort value.
  • compare-fn may be one of:
    • a clojure operator like clojure.core/<
    • :tech.numerics/<, :tech.numerics/> for unboxing comparisons of primitive values.
    • clojure.core/compare
    • A custom java.util.Comparator instantiation.

Options:

  • :nan-strategy - General missing strategy. Options are :first, :last, and :exception.
  • :parallel? - Uses parallel quicksort when true and regular quicksort when false.

sort-by-column

(sort-by-column dataset colname compare-fn & args)(sort-by-column dataset colname)

Sort a dataset by a given column using the given compare fn.

  • compare-fn may be one of:
    • a clojure operator like clojure.core/<
    • :tech.numerics/<, :tech.numerics/> for unboxing comparisons of primitive values.
    • clojure.core/compare
    • A custom java.util.Comparator instantiation.

Options:

  • :nan-strategy - General missing strategy. Options are :first, :last, and :exception.
  • :parallel? - Uses parallel quicksort when true and regular quicksort when false.

tail

(tail dataset n)(tail dataset)

Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.

take-nth

(take-nth dataset n-val)

unique-by

(unique-by dataset options map-fn)(unique-by dataset map-fn)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).

unique-by-column

(unique-by-column dataset options colname)(unique-by-column dataset colname)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

unordered-select

(unordered-select dataset colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

unroll-column

(unroll-column dataset column-name)(unroll-column dataset column-name options)

Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.

Any missing indexes are dropped.

user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |

Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.

update

(update lhs-ds filter-fn-or-ds update-fn & args)

Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset.

This pathways is designed to work with the tech.v3.dataset.column-filters namespace.

  • filter-fn-or-ds is a generalized parameter. May be a function, a dataset or a sequence of column names.
  • update-fn must take the dataset as the first argument and must return a dataset.
(ds/bind-> (ds/->dataset dataset) ds
           (ds/remove-column "Id")
           (ds/update cf/string ds/replace-missing-value "NA")
           (ds/update-elemwise cf/string #(get {"" "NA"} % %))
           (ds/update cf/numeric ds/replace-missing-value 0)
           (ds/update cf/boolean ds/replace-missing-value false)
           (ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
                                 #(dtype/elemwise-cast % :float64)))

update-column

(update-column dataset col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

update-columns

(update-columns dataset column-name-seq-or-fn update-fn)

Update a sequence of columns selected by column name seq or column selector function.

For example:

(update-columns DS [:A :B] #(dfn/+ % 2))
(update-columns DS cf/numeric #(dfn// % 2))

update-columnwise

(update-columnwise dataset filter-fn-or-ds cwise-update-fn & args)

Call update-fn on each column of the dataset. Returns the dataset. See arguments to update

update-elemwise

(update-elemwise dataset filter-fn-or-ds map-fn)(update-elemwise dataset map-fn)

Replace all elements in selected columns by calling selected function on each element. column-name-seq must be a sequence of column names if provided. filter-fn-or-ds has same rules as update. Implicitly clears the missing set so function must deal with type-specific missing values correctly. Returns new dataset

value-reader

(value-reader dataset options)(value-reader dataset)

Return a reader that produces a reader of column values per index. Options: :copying? - Default to false - When true row values are copied on read.

write!

(write! dataset output-path options)(write! dataset output-path)

Write a dataset out to a file. Supported forms are:

(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)

Options:

  • :max-chars-per-column - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization.
  • :max-num-columns - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization.
  • :quoted-columns - csv specific - sequence of columns names that you would like to always have quoted.
  • :file-type - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type.
  • :headers? - if csv headers are written, defaults to true.