tech.v3.dataset

Column major dataset abstraction for efficiently manipulating in memory datasets.

->>dataset

(->>dataset options dataset)(->>dataset dataset)

Please see documentation of ->dataset. Options are the same.

view source

->dataset

(->dataset dataset options)(->dataset dataset)

Create a dataset from either csv/tsv or a sequence of maps.

A String be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden.
InputStreams have no file type and thus a file-type must be provided in the options.
A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created.

Parquet, xlsx, and xls formats require that you require the appropriate libraries which are tech.v3.libs.parquet for parquet, tech.v3.libs.fastexcel for xlsx, and tech.v3.libs.poi for xls.

Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type overload as the Arrow project current has 3 different file types and it is not clear what their final suffix will be or which of the three file types it will indicate. Please see documentation in the tech.v3.libs.arrow namespace for further information on Arrow file types.

Options:

:dataset-name - set the name of the dataset.
:file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :parquet}.
:gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream.
:column-allowlist - either sequence of string column names or sequence of column indices of columns to allowlist. This is preferred to :column-whitelist
:column-blocklist - either sequence of string column names or sequence of column indices of columns to blocklist. This is preferred to :column-blacklist
:num-rows - Number of rows to read
:header-row? - Defaults to true, indicates the first row is a header.
:key-fn - function to be applied to column names. Typical use is: :key-fn keyword.
:separator - Add a character separator to the list of separators to auto-detect.
:csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such).
:bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row.
:skip-bad-rows? - Legacy option. Use :bad-row-policy.
:disable-comment-skipping? - As default, the # character is recognised as a line comment when found in the beginning of a line of text in a CSV file, and the row will be ignored. Set true to disable this behavior.
:max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception.
:max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301
:text-temp-dir - The temporary directory to use for file-backed text. Setting this value to boolean 'false' turns off file backed text which is the default. If a tech.v3.resource stack context is opened the file will be deleted when the context closes else it will be deleted when the gc cleans up the dataset. A shutdown hook is added as a last resort to ensure the file is cleaned up.
:n-initial-skip-rows - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets.
:parser-type - Default parser to use if no parser-fn is specified for that column. For csv files, the default parser type is :string which indicates a promotional string parser. For sequences of maps, the default parser type is :object. It can be useful in some contexts to use the :string parser with sequences of maps or maps of columns.
:parser-fn -
- keyword? - all columns parsed to this datatype. For example: {:parser-fn :string}
- map? - {column-name parse-method} parse each column with specified parse-method. The parse-method can be:
  - keyword? - parse the specified column to this datatype. For example: {:parser-fn {:answer :boolean :id :int32}}
  - tuple - pair of [datatype parse-data] in which case container of type [datatype] will be created. parse-data can be one of:
    - :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.
    - fn? - function from str-> one of :tech.v3.dataset/missing, :tech.v3.dataset/parse-failure, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated.
    - string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For :text you can specify the backing file to use.
    - DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
map? - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used.

Returns a new dataset

view source

add-column

(add-column dataset column)

Add a new column. Error if name collision

view source

add-or-update-column

(add-or-update-column dataset colname column)(add-or-update-column dataset column)

If column exists, replace. Else append new column.

view source

all-descriptive-stats-names

(all-descriptive-stats-names)

Returns the names of all descriptive stats in the order they will be returned in the resulting dataset of descriptive stats. This allows easy filtering in the form for (descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names) (remove #{:values :num-distinct-values}))})

view source

append-columns

(append-columns dataset column-seq)

view source

assoc-ds

(assoc-ds dataset cname cdata & args)

If dataset is not nil, calls clojure.core/assoc. Else creates a new empty dataset and then calls clojure.core/assoc. Guaranteed to return a dataset (unlike assoc).

view source

assoc-metadata

(assoc-metadata dataset filter-fn-or-ds k v & args)

Set metadata across a set of columns.

view source

bind->

macro

(bind-> expr name & args)

Threads like -> but binds name to expr like as->:

(ds/bind-> (ds/->dataset "test/data/stocks.csv") ds
           (assoc :logprice2 (dfn/log1p (ds "price")))
           (assoc :logp3 (dfn/* 2 (ds :logprice2)))
           (ds/select-columns ["price" :logprice2 :logp3])
           (ds-tens/dataset->tensor)
           (first))

view source

brief

(brief ds options)(brief ds)

Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.

view source

categorical->number

(categorical->number dataset filter-fn-or-ds)(categorical->number dataset filter-fn-or-ds table-args)(categorical->number dataset filter-fn-or-ds table-args result-datatype)

Convert columns into a discrete , numeric representation See tech.v3.dataset.categorical/fit-categorical-map.

view source

categorical->one-hot

(categorical->one-hot dataset filter-fn-or-ds)(categorical->one-hot dataset filter-fn-or-ds table-args)(categorical->one-hot dataset filter-fn-or-ds table-args result-datatype)

Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot

view source

column

(column dataset colname)

view source

column->dataset

(column->dataset dataset colname transform-fn options)(column->dataset dataset colname transform-fn)

Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.

view source

column-cast

(column-cast dataset colname datatype)(column-cast dataset colname datatype options)

Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.

colname may be a scalar or a tuple of src-col dst-col.

datatype may be a datatype enumeration or a tuple of datatype cast-fn where cast-fn may return either a new value, :tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.

If the existing datatype is string, then tech.v3.datatype.column/parse-column is called.

Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.v3.dataset.column/parse-column.

Options:

:track-parse-errors - defaults to false. When true extra metadata keys :unparsed-indexes :unparsed-data will be appended to the metadata. Be aware these values may not serialize as unparsed indexes is a roaring bitmap.

view source

column-count

(column-count dataset)

view source

column-labeled-mapseq

(column-labeled-mapseq dataset value-colname-seq)

Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!

column-map

(column-map dataset result-colname map-fn res-dtype-or-opts filter-fn-or-ds)(column-map dataset result-colname map-fn filter-fn-or-ds)(column-map dataset result-colname map-fn)

Produce a new (or updated) column as the result of mapping a fn over columns. This function is never lazy - all results are immediately calculated.

dataset - dataset.
result-colname - Name of new (or existing) column.
map-fn - function to map over columns. Same rules as tech.v3.datatype/emap.
res-dtype-or-opts - If not given result is scanned to infer missing and datatype. If using an option map, options are described below.
filter-fn-or-ds - A dataset, a sequence of columns, or a tech.v3.datasets/column-filters column filter function. Defaults to all the columns of the existing dataset.

Returns a new dataset with a new or updated column.

Options:

:datatype - Set the dataype of the result column. If not given result is scanned to infer result datatype and missing set.
:missing-fn - if given, columns are first passed to missing-fn as a sequence and this dictates the missing set. Else the missing set is by scanning the results during the inference process. See tech.v3.dataset.column/union-missing-sets and tech.v3.dataset.column/intersect-missing-sets for example functions to pass in here.

Examples:


  ;;From the tests --

  (let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
    ;;result scanned for both datatype and missing set
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
    ;;result scanned for missing set only.  Result used in-place.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %))
                               {:datatype :float64} [:b]))))
    ;;Nothing scanned at all.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(inc %)
                               {:datatype :float64
                                :missing-fn ds-col/union-missing-sets} [:b]))))
    ;;Missing set scanning causes NPE at inc.
    (is (thrown? Throwable
                 (ds/column-map testds :b2 #(inc %)
                                {:datatype :float64}
                                [:b]))))

  ;;Ad-hoc repl --

user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
          (ds/head))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   price^2 |
|--------|------------|-------|-----------|
|   MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|   MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|   MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|   MSFT | 2000-04-01 | 28.37 |  804.8569 |
|   MSFT | 2000-05-01 | 25.45 |  647.7025 |



user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:

|  :b | :a |
|----:|---:|
|     |  1 |
| 2.0 |    |
| 3.0 |  2 |
user> (ds/column-map ds1 :c (fn [a b]
                              (when (and a b)
                                (+ (double a) (double b))))
                     [:a :b])
_unnamed [3 3]:

|  :b | :a |  :c |
|----:|---:|----:|
|     |  1 |     |
| 2.0 |    |     |
| 3.0 |  2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}

view source

column-map-m

macro

(column-map-m ds result-colname src-colnames body)

Map a function across one or more columns via a macro. The function will have arguments in the order of the src-colnames. column names of the form right.id will be bound to variables named right-id.

Example:

user> (-> (ds/->dataset [{:a.a 1} {:b 2.0} {:a.a 2 :b 3.0}])
          (ds/column-map-m :a [:a.a :b]
                           (when (and a-a b)
                             (+ (double a-a) (double b)))))
_unnamed [3 3]:

|  :b | :a.a |  :a |
|----:|-----:|----:|
|     |    1 |     |
| 2.0 |      |     |
| 3.0 |    2 | 5.0 |

view source

column-names

(column-names dataset)

In-order sequence of column names

view source

columns

(columns dataset)

Return sequence of all columns in dataset.

view source

columns-with-missing-seq

(columns-with-missing-seq dataset)

Return a sequence of:

  {:column-name column-name
   :missing-count missing-count
  }

or nil of no columns are missing data.

view source

columnwise-concat

(columnwise-concat dataset colnames options)(columnwise-concat dataset colnames)

Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.

Example:

user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |

Options:

value-column-name - defaults to :value colname-column-name - defaults to :column

view source

concat

(concat dataset & args)(concat)

Concatenate datasets using a copying-concatenation. See also concat-inplace as it may be more efficient for your use case if you have a small number (like less than 3) of datasets.

view source

concat-copying

(concat-copying dataset & args)(concat-copying)

Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

view source

concat-inplace

(concat-inplace dataset & args)(concat-inplace)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

view source

data->dataset

(data->dataset input)

Convert a data-ized dataset created via dataset->data back into a full dataset

view source

dataset->data

(dataset->data ds)

Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.

view source

dataset-name

(dataset-name dataset)

view source

dataset-parser

(dataset-parser options)(dataset-parser)

Implements protocols/PDatasetParser, Counted, Indexed, IReduceInit, and IDeref (returns the new dataset). See documentation for mapseq-parser.

view source

dataset?

(dataset? ds)

view source

descriptive-stats

(descriptive-stats dataset)(descriptive-stats dataset options)

Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.

view source

drop-columns

(drop-columns dataset colname-seq-or-fn)

Same as remove-columns. Remove columns indexed by column name seq or column filter function. For example:

(drop-columns DS [:A :B])
(drop-columns DS cf/categorical)

view source

drop-missing

(drop-missing dataset-or-col)(drop-missing ds colname)

Remove missing entries by simply selecting out the missing indexes.

view source

drop-rows

(drop-rows dataset-or-col row-indexes)

Drop rows from dataset or column

view source

empty-column-names

(empty-column-names ds)

Return a sequence of column names whose empty set length matches the row count of the dataset.

view source

empty-dataset

(empty-dataset)

view source

ensure-array-backed

(ensure-array-backed ds options)(ensure-array-backed ds)

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

Options:

:unpack? - unpack packed datetime types. Defaults to true

view source

filter

(filter dataset predicate)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

view source

filter-column

(filter-column dataset colname predicate)(filter-column dataset colname)

Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %).

The 2-arity form of this function reads the column as a boolean reader so for instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are only false if nil?.

Returns a dataset.

view source

filter-dataset

(filter-dataset dataset filter-fn-or-ds)

Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace.

If filter-fn-or-ds is a dataset, it is returned.
If filter-fn-or-ds is sequential, then select-columns is called.
If filter-fn-or-ds is :all, all columns are returned
If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.

view source

group-by

(group-by dataset key-fn options)(group-by dataset key-fn)

Produce a map of key-fn-value->dataset. The argument to key-fn is a map of colname->column-value representing a row in dataset. Each dataset in the resulting map contains all and only rows that produce the same key-fn-value.

Options - options are passed into dtype arggroup:

:group-by-finalizer - when provided this is run on each dataset immediately after the rows are selected. This can be used to immediately perform a reduction on each new dataset which is faster than doing it in a separate run.

view source

group-by->indexes

(group-by->indexes dataset key-fn options)(group-by->indexes dataset key-fn)

(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes is an in-order contiguous group of indexes.

view source

group-by-column

(group-by-column dataset colname options)(group-by-column dataset colname)

Return a map of column-value->dataset. Each dataset in the resulting map contains all and only rows with the same value in column.

:group-by-finalizer - when provided this is run on each dataset immediately after the rows are selected. This can be used to immediately perform a reduction on each new dataset which is faster than doing it in a separate run.

view source

group-by-column->indexes

(group-by-column->indexes dataset colname options)(group-by-column->indexes dataset colname)

(Non-lazy) - Group a dataset by a column return a map of column-val->indexes where indexes is an in-order contiguous group of indexes.

Options are passed into dtype's arggroup method.

view source

group-by-column-consumer

(group-by-column-consumer ds cname)

view source

has-column?

(has-column? dataset column-name)

view source

head

(head dataset n)(head dataset)

Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.

view source

induction

(induction ds induct-fn & args)

Given a dataset and a function from dataset->row produce a new dataset. The produced row will be merged with the current row and then added to the dataset.

Options are same as the options used for ->dataset in order for the user to control the parsing of the return values of induct-fn. A new dataset is returned.

Example:

user> (def ds (ds/->dataset {:a [0 1 2 3] :b [1 2 3 4]}))
#'user/ds
user> ds
_unnamed [4 2]:

| :a | :b |
|---:|---:|
|  0 |  1 |
|  1 |  2 |
|  2 |  3 |
|  3 |  4 |
user> (ds/induction ds (fn [ds]
                         {:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1))
                          :sum-a (dfn/sum (ds :a))
                          :sum-b (dfn/sum (ds :b))}))
_unnamed [4 5]:

| :a | :b | :sum-b | :sum-a | :sum-of-previous-row |
|---:|---:|-------:|-------:|---------------------:|
|  0 |  1 |    0.0 |    0.0 |                  0.0 |
|  1 |  2 |    1.0 |    0.0 |                  1.0 |
|  2 |  3 |    3.0 |    1.0 |                  5.0 |
|  3 |  4 |    6.0 |    3.0 |                 14.0 |

view source

major-version

view source

mapseq-parser

(mapseq-parser options)(mapseq-parser)

Return a clojure function that when called with one arg that arg must be the next map to add to the dataset. When called with no args returns the current dataset. This can be used to efficiently transform a stream of maps into a dataset while getting intermediate datasets during the parse operation.

Options are the same for ->dataset.

user> (require '[tech.v3.dataset :as ds])
nil
user> (def pfn (ds/mapseq-parser))
#'user/pfn
user> (pfn {:a 1 :b 2})
nil
user> (pfn {:a 1 :b 2})
nil
user> (pfn {:a 2 :c 3})
nil
user> (pfn)
_unnamed [3 3]:

| :a | :b | :c |
|---:|---:|---:|
|  1 |  2 |    |
|  1 |  2 |    |
|  2 |    |  3 |
user> (pfn {:a 3 :d 4})
nil
user> (pfn {:a 5 :c 6})
nil
user> (pfn)
_unnamed [5 4]:

| :a | :b | :c | :d |
|---:|---:|---:|---:|
|  1 |  2 |    |    |
|  1 |  2 |    |    |
|  2 |    |  3 |    |
|  3 |    |    |  4 |
|  5 |    |  6 |    |

view source

mapseq-reader

(mapseq-reader dataset options)(mapseq-reader dataset)

Return a reader that produces a map of column-name->column-value upon read.

view source

mapseq-rf

(mapseq-rf)(mapseq-rf options)

Create a transduce-compatible rf that reduces a sequence of maps into a dataset. Same options as ->dataset.

user> (transduce (map identity) (ds/mapseq-rf {:dataset-name :transduced}) [{:a 1 :b 2}])
:transduced [1 2]:

| :a | :b |
|---:|---:|
|  1 |  2 |

view source

min-n-by-column

(min-n-by-column dataset cname N comparator options)(min-n-by-column dataset cname N comparator)(min-n-by-column dataset cname N)

Find the minimum N entries (unsorted) by column. Resulting data will be indexed in original order. If you want a sorted order then sort the result.

See options to sort-by-column.

Example:

user> (ds/min-n-by-column ds "price" 10 nil nil)
test/data/stocks.csv [10 3]:

| symbol |       date | price |
|--------|------------|------:|
|   AMZN | 2001-09-01 |  5.97 |
|   AMZN | 2001-10-01 |  6.98 |
|   AAPL | 2000-12-01 |  7.44 |
|   AAPL | 2002-08-01 |  7.38 |
|   AAPL | 2002-09-01 |  7.25 |
|   AAPL | 2002-12-01 |  7.16 |
|   AAPL | 2003-01-01 |  7.18 |
|   AAPL | 2003-02-01 |  7.51 |
|   AAPL | 2003-03-01 |  7.07 |
|   AAPL | 2003-04-01 |  7.11 |
user> (ds/min-n-by-column ds "price" 10 > nil)
test/data/stocks.csv [10 3]:

| symbol |       date |  price |
|--------|------------|-------:|
|   GOOG | 2007-09-01 | 567.27 |
|   GOOG | 2007-10-01 | 707.00 |
|   GOOG | 2007-11-01 | 693.00 |
|   GOOG | 2007-12-01 | 691.48 |
|   GOOG | 2008-01-01 | 564.30 |
|   GOOG | 2008-04-01 | 574.29 |
|   GOOG | 2008-05-01 | 585.80 |
|   GOOG | 2009-11-01 | 583.00 |
|   GOOG | 2009-12-01 | 619.98 |
|   GOOG | 2010-03-01 | 560.19 |

view source

missing

(missing dataset-or-col)

Given a dataset or a column, return the missing set as a roaring bitmap

view source

new-column

(new-column name data)(new-column name data metadata)(new-column name data metadata missing)(new-column data-or-data-map)

Create a new column. Data will scanned for missing values unless the full 4-argument pathway is used.

view source

new-dataset

(new-dataset options ds-metadata column-seq)(new-dataset options column-seq)(new-dataset column-seq)

Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.

view source

order-column-names

(order-column-names dataset colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

view source

pmap-ds

(pmap-ds ds ds-map-fn options)(pmap-ds ds ds-map-fn)

Parallelize mapping a function from dataset->dataset across a single dataset. Results are coalesced back into a single dataset. The original dataset is simple sliced into n-core results and map-fn is called n-core times. ds-map-fn must be a function from dataset->dataset although it may return nil.

Options:

:max-batch-size - this is a default for tech.v3.parallel.for/indexed-map-reduce. You can control how many rows are processed in a given batch - the default is 64000. If your mapping pathway produces a large expansion in the size of the dataset then it may be good to reduce the max batch size and use :as-seq to produce a sequence of datasets.
:result-type
- :as-seq - Return a sequence of datasets, one for each batch.
- :as-ds - Return a single datasets with all results in memory (default option).

view source

print-all

(print-all dataset)

Helper function equivalent to (tech.v3.dataset.print/print-range ... :all)

view source

rand-nth

(rand-nth dataset)

Return a random row from the dataset in map format

view source

remove-column

(remove-column dataset col-name)

Same as:

(dissoc dataset col-name)

view source

remove-columns

(remove-columns dataset colname-seq-or-fn)

Remove columns indexed by column name seq or column filter function. For example:

  (remove-columns DS [:A :B])
  (remove-columns DS cf/categorical)

view source

remove-empty-columns

(remove-empty-columns ds)

Remove all columns that have no data - missing set length equals row count.

view source

remove-rows

(remove-rows dataset-or-col row-indexes)

Same as drop-rows.

view source

rename-columns

(rename-columns dataset colnames)

Rename columns using a map or vector of column names.

Does not reorder columns; rename is in-place for maps and positional for vectors.

view source

replace-missing

(replace-missing ds)(replace-missing ds strategy)(replace-missing ds columns-selector strategy)(replace-missing ds columns-selector strategy value)

Replace missing values in some columns with a given strategy. The columns selector may be:

seq of any legal column names
or a column filter function, such as numeric and categorical

Strategies may be:

:down - take value from previous non-missing row if possible else use provided value.
:up - take value from next non-missing row if possible else use provided value.
:downup - take value from previous if possible else use next.
:updown - take value from next if possible else use previous.
:nearest - Use nearest of next or previous values. :mid is an alias for :nearest.
:midpoint - Use midpoint of averaged values between previous and next nonmissing rows.
:abb - Impute missing with approximate bayesian bootstrap. See r's ABB.
:lerp - Linearly interpolate values between previous and next nonmissing rows.
:value - Value will be provided - see below.

value may be provided which will then be used. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.

view source

replace-missing-value

(replace-missing-value dataset filter-fn-or-ds scalar-value)(replace-missing-value dataset scalar-value)

view source

reverse-rows

(reverse-rows dataset-or-col)

Reverse the rows in the dataset or column.

view source

row-at

(row-at ds idx)

Get the row at an individual index. If indexes are negative then the dataset is indexed from the end.

user> (ds/row-at stocks 1)
{"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"],
 "symbol" "MSFT",
 "price" 36.35}
user> (ds/row-at stocks -1)
{"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"],
 "symbol" "AAPL",
 "price" 223.02}

view source

row-count

(row-count dataset-or-col)

view source

row-map

(row-map ds map-fn options)(row-map ds map-fn)

Map a function across the rows of the dataset producing a new dataset that is merged back into the original potentially replacing existing columns. Options are passed into the ->dataset function so you can control the resulting column types by the usual dataset parsing options described there.

Options:

See options for pmap-ds. In particular, note that you can produce a sequence of datasets as opposed to a single large dataset.

Speed demons should attempt both {:copying? false} and {:copying? true} in the options map as that changes rather drastically how data is read from the datasets. If you are going to read all the data in the dataset, {:copying? true} will most likely be the faster of the two.

Examples:

user> (def stocks (ds/->dataset "test/data/stocks.csv"))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|------:|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (ds/head (ds/row-map stocks (fn [row]
                                    {"symbol" (keyword (row "symbol"))
                                     :price2 (* (row "price")(row "price"))})))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   :price2 |
|--------|------------|------:|----------:|
|  :MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|  :MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|  :MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|  :MSFT | 2000-04-01 | 28.37 |  804.8569 |
|  :MSFT | 2000-05-01 | 25.45 |  647.7025 |

view source

row-mapcat

(row-mapcat ds mapcat-fn options)(row-mapcat ds mapcat-fn)

Map a function across the rows of the dataset. The function must produce a sequence of maps and the original dataset rows will be duplicated and then merged into the result of calling (->> (apply concat) (->>dataset options) on the result of mapcat-fn. Options are the same as ->dataset.

The smaller the maps returned from mapcat-fn the better, perhaps consider using records. In the case that a mapcat-fn result map has a key that overlaps a column name the column will be replaced with the output of mapcat-fn. The returned map will have the key :_row-id assoc'd onto it so for absolutely minimal gc usage include this as a member variable in your map.

Options:

See options for pmap-ds. Especially note :max-batch-size and :result-type. In order to conserve memory it may be much more efficient to return a sequence of datasets rather than one large dataset. If returning sequences of datasets perhaps consider a transducing pathway across them or the tech.v3.dataset.reductions namespace.

Example:

user> (def ds (ds/->dataset {:rid (range 10)
                             :data (repeatedly 10 #(rand-int 3))}))
#'user/ds
user> (ds/head ds)
_unnamed [5 2]:

| :rid | :data |
|-----:|------:|
|    0 |     0 |
|    1 |     2 |
|    2 |     0 |
|    3 |     1 |
|    4 |     2 |
user> (def mapcat-fn (fn [row]
                       (for [idx (range (row :data))]
                         {:idx idx})))
#'user/mapcat-fn
user> (mapcat mapcat-fn (ds/rows ds))
({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1})
user> (ds/row-mapcat ds mapcat-fn)
_unnamed [9 3]:

| :rid | :data | :idx |
|-----:|------:|-----:|
|    1 |     2 |    0 |
|    1 |     2 |    1 |
|    3 |     1 |    0 |
|    4 |     2 |    0 |
|    4 |     2 |    1 |
|    6 |     2 |    0 |
|    6 |     2 |    1 |
|    8 |     2 |    0 |
|    8 |     2 |    1 |
user>

view source

rows

(rows ds options)(rows ds)

Get the rows of the dataset as a list of potentially flyweight maps.

Options:

copying? - When true the data is copied out of the dataset row by row upon read of that row. When false the data is only referenced upon each read of a particular key. Copying is appropriate if you want to use the row values as keys a map and it is inappropriate if you are only going to read a very small portion of the row map.
nil-missing? - When true, maps returned have nil values for missing entries as opposed to eliding the missing keys entirely. It is legacy behavior and slightly faster to use :nil-missing? true.

user> (take 5 (ds/rows stocks))
({"date" #object[java.time.LocalDate 0x6c433971 "2000-01-01"],
  "symbol" "MSFT",
  "price" 39.81}
 {"date" #object[java.time.LocalDate 0x28f96b14 "2000-02-01"],
  "symbol" "MSFT",
  "price" 36.35}
 {"date" #object[java.time.LocalDate 0x7bdbf0a "2000-03-01"],
  "symbol" "MSFT",
  "price" 43.22}
 {"date" #object[java.time.LocalDate 0x16d3871e "2000-04-01"],
  "symbol" "MSFT",
  "price" 28.37}
 {"date" #object[java.time.LocalDate 0x47094da0 "2000-05-01"],
  "symbol" "MSFT",
  "price" 25.45})


user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]))
[{:a 1, :b 2} {:a 2} {:b 3}]

user> (ds/rows (ds/->dataset [{:a 1 :b 2} {:a 2} {:b 3}]) {:nil-missing? true})
[{:a 1, :b 2} {:a 2, :b nil} {:a nil, :b 3}]

view source

rowvec-at

(rowvec-at ds idx)

Return a persisent-vector-like row at a given index. Negative indexes index from the end.

user> (ds/rowvec-at stocks 1)
["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35]
user> (ds/rowvec-at stocks -1)
["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02]

view source

rowvecs

(rowvecs ds options)(rowvecs ds)

Return a randomly addressable list of rows in persistent vector-like form.

Options:

copying? - When true the data is copied out of the dataset row by row upon read of that row. When false the data is only referenced upon each read of a particular key. Copying is appropriate if you want to use the row values as keys a map and it is inappropriate if you are only going to read a given key for a given row once.

user> (take 5 (ds/rowvecs stocks))
(["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81]
 ["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35]
 ["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22]
 ["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37]
 ["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45])

view source

sample

(sample dataset n options)(sample dataset n)(sample dataset)

Sample n-rows from a dataset. Defaults to sampling without replacement.

For the definition of seed, see the argshuffle documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle)

The returned dataset's metadata is altered merging {:print-index-range (range n)} in so you will always see the entire returned dataset. If this isn't desired, vary-meta a good pathway.

Options:

:replacement? - Do sampling with replacement. Defaults to false.
:seed - Provide a seed as a number or provide a Random implementation.

view source

select

(select dataset colname-seq selection)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:

:all - all the columns
sequence of column names - those columns in that order.
implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to rename-columns except this trims the result to be only the columns in the map. selection - either keyword :all, a list of indexes to select, or a list of booleans where the index position of each true value indicates an index to select. When providing indices, duplicates will select the specified index position more than once.

view source

select-by-index

(select-by-index dataset col-index row-index)

Trim dataset according to this sequence of indexes. Returns a new dataset.

col-index and row-index - one of:

:all - all the columns
list of indexes. May contain duplicates. Negative values will be counted from the end of the sequence.

view source

select-columns

(select-columns dataset colname-seq-or-fn)

Select columns from the dataset by:

seq of column names
column selector function
:all keyword

For example:

(select-columns DS [:A :B])
(select-columns DS cf/numeric)
(select-columns DS :all)

view source

select-columns-by-index

(select-columns-by-index dataset col-index)

Select columns from the dataset by seq of index(includes negative) or :all.

See documentation for select-by-index.

view source

select-missing

(select-missing dataset-or-col)

Remove missing entries by simply selecting out the missing indexes

view source

select-rows

(select-rows dataset-or-col row-indexes options)(select-rows dataset-or-col row-indexes)

Select rows from the dataset or column.

view source

set-dataset-name

(set-dataset-name dataset ds-name)

view source

shape

(shape dataset)

Returns shape in column-major format of n-columns n-rows.

view source

shuffle

(shuffle dataset options)(shuffle dataset)

Shuffle the rows of the dataset optionally providing a seed. See https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle.

view source

sort-by

(sort-by dataset key-fn compare-fn & args)(sort-by dataset key-fn)

Sort a dataset by a key-fn and compare-fn.

key-fn - function from map to sort value.
compare-fn may be one of:
- a clojure operator like clojure.core/<
- :tech.numerics/<, :tech.numerics/> for unboxing comparisons of primitive values.
- clojure.core/compare
- A custom java.util.Comparator instantiation.

Options:

:nan-strategy - General missing strategy. Options are :first, :last, and :exception.
:parallel? - Uses parallel quicksort when true and regular quicksort when false.

view source

sort-by-column

(sort-by-column dataset colname compare-fn & args)(sort-by-column dataset colname)

Sort a dataset by a given column using the given compare fn.

compare-fn may be one of:
- a clojure operator like clojure.core/<
- :tech.numerics/<, :tech.numerics/> for unboxing comparisons of primitive values.
- clojure.core/compare
- A custom java.util.Comparator instantiation.

Options:

:nan-strategy - General missing strategy. Options are :first, :last, and :exception.
:parallel? - Uses parallel quicksort when true and regular quicksort when false.

view source

tail

(tail dataset n)(tail dataset)

Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.

view source

take-nth

(take-nth dataset n-val)

view source

unique-by

(unique-by dataset options map-fn)(unique-by dataset map-fn)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).

view source

unique-by-column

(unique-by-column dataset options colname)(unique-by-column dataset colname)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

view source

unordered-select

(unordered-select dataset colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

view source

unroll-column

(unroll-column dataset column-name)(unroll-column dataset column-name options)

Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.

Any missing indexes are dropped.

user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |

Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.

view source

update

(update lhs-ds filter-fn-or-ds update-fn & args)

Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset.

This pathways is designed to work with the tech.v3.dataset.column-filters namespace.

filter-fn-or-ds is a generalized parameter. May be a function, a dataset or a sequence of column names.
update-fn must take the dataset as the first argument and must return a dataset.

(ds/bind-> (ds/->dataset dataset) ds
           (ds/remove-column "Id")
           (ds/update cf/string ds/replace-missing-value "NA")
           (ds/update-elemwise cf/string #(get {"" "NA"} % %))
           (ds/update cf/numeric ds/replace-missing-value 0)
           (ds/update cf/boolean ds/replace-missing-value false)
           (ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
                                 #(dtype/elemwise-cast % :float64)))

view source

update-column

(update-column dataset col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

view source

update-columns

(update-columns dataset column-name-seq-or-fn update-fn)

Update a sequence of columns selected by column name seq or column selector function.

For example:

(update-columns DS [:A :B] #(dfn/+ % 2))
(update-columns DS cf/numeric #(dfn// % 2))

view source

update-columnwise

(update-columnwise dataset filter-fn-or-ds cwise-update-fn & args)

Call update-fn on each column of the dataset. Returns the dataset. See arguments to update

view source

update-elemwise

(update-elemwise dataset filter-fn-or-ds map-fn)(update-elemwise dataset map-fn)

Replace all elements in selected columns by calling selected function on each element. column-name-seq must be a sequence of column names if provided. filter-fn-or-ds has same rules as update. Implicitly clears the missing set so function must deal with type-specific missing values correctly. Returns new dataset

view source

value-reader

(value-reader dataset options)(value-reader dataset)

Return a reader that produces a reader of column values per index. Options: :copying? - Default to false - When true row values are copied on read.

view source

write!

(write! dataset output-path options)(write! dataset output-path)

Write a dataset out to a file. Supported forms are:

(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)

Options:

:max-chars-per-column - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization.
:max-num-columns - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization.
:quoted-columns - csv specific - sequence of columns names that you would like to always have quoted.
:file-type - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type.
:headers? - if csv headers are written, defaults to true.

view source

Generated by Codox with RDash UI theme

TMD 7.061

Project

Topics

Namespaces

Public Vars

tech.v3.dataset

->>dataset

->dataset

add-column

add-or-update-column

all-descriptive-stats-names

append-columns

assoc-ds

assoc-metadata

bind->

macro

brief

categorical->number

categorical->one-hot

column

column->dataset

column-cast

column-count

column-labeled-mapseq

column-map

column-map-m

macro

column-names

columns

columns-with-missing-seq

columnwise-concat

concat

concat-copying

concat-inplace

data->dataset

dataset->data

dataset-name

dataset-parser

dataset?

descriptive-stats

drop-columns

drop-missing

drop-rows

empty-column-names

empty-dataset

ensure-array-backed

filter

filter-column

filter-dataset

group-by

group-by->indexes

group-by-column

group-by-column->indexes

group-by-column-consumer

has-column?

head

induction

major-version

mapseq-parser

mapseq-reader

mapseq-rf

min-n-by-column

missing

new-column

new-dataset

order-column-names

pmap-ds

print-all

rand-nth

remove-column

remove-columns

remove-empty-columns

remove-rows

rename-columns

replace-missing

replace-missing-value

reverse-rows

row-at

row-count