tech.v3.dataset.io.csv

CSV parsing based on charred.api/read-csv.

csv->dataset

(csv->dataset input & [options])

Read a csv into a dataset. Same options as tech.v3.dataset/->dataset.

csv->dataset-seq

(csv->dataset-seq input & [options])

Read a csv into a lazy sequence of datasets. All options of tech.v3.dataset/->dataset are suppored aside from :n-initial-skip-rows with an additional option of :batch-size which defaults to 128000.

Options are passed through to charred.bulk/batch-csv-rows renaming where necessary. This method defaults to using a load thread - see above method for more options. To disable the load thread use :csv-load-thread-name nil.

When using multithreaded loading, options are also passed through to ham-fisted.api/pmap-opts so you can change the amount of :n-lookahead the pmap opteration uses when submitting jobs to the thread pool. By default this is set to 4 to decrease possible OOM sitations.

The input will only be closed once the entire sequence is realized.

Options:

  • :load-tfn - dataset->x transformation function to be performed on in the same thread context just after dataset is loaded. Doing some operations in this transform function can be considerably more efficient than only loading the dataset when using multithreaded loading.

rows->csv!

(rows->csv! output headers rows)(rows->csv! output headers rows {:keys [separator], :or {separator \tab}, :as options})

Given an something convertible to an output stream, an optional set of headers as string arrays, and a sequence of string arrows, write a CSV or a TSV file.

Options:

  • :separator - Defaults to ab.
  • :quote - Default "
  • :quote? A predicate function which determines if a string should be quoted. Defaults to quoting only when necessary. May also be the the value 'true' in which case every field is quoted.
  • :newline - :lf (default) or :cr+lf.
  • :close-writer? - defaults to true. When true, close writer when finished.

rows->dataset-fn

(rows->dataset-fn {:keys [header-row?], :or {header-row? true}, :as options})

Create an efficiently callable function to parse row-batches into datasets. Returns function from row-iter->dataset. Options passed in here are the same as ->dataset.