tech.v3.libs.arrow

Support for reading/writing apache arrow datasets. Datasets may be memory mapped but default to being read via an input stream.

Supported datatypes:

All numeric types - :uint8, :int8, :uint16, :int16, :uint32, :int32, :uint64, :int64, :float32, :float64, :boolean.
String types - :string, :text. During write you have the option to always write data as text which can be more efficient in the memory-mapped read case as it doesnt' require the creation of string tables at load time.
Datetime Types - :local-date, :local-time, :instant. During read you have the option to keep these types in their source numeric format e.g. 32 bit :epoch-days for :local-date datatypes. This format can make some types of processing, such as set creation, more efficient.

When writing a dataset an arrow file with a single record set is created. When writing a sequence of datasets downstream schemas must be compatible with the schema of the initial dataset so for instance a conversion of int32 to double is fine but double to int32 is not.

mmap support on systems running JDK-17 requires the foreign or memory module to be loaded. Appropriate JVM arguments can be found here.

Example (with zstd compression):

  ;; Writing
  (arrow/dataset->stream! ds fname {:compression :zstd})
  ;; Reading
  (arrow/stream->dataset path)

Required Dependencies

In order to support both memory mapping and JDK-17, we only rely on the Arrow SDK's flatbuffer and schema definitions:

  ;; netty isn't required and will inevitably conflict with some more recent version
  [org.apache.arrow/arrow-vector "6.0.0":exclusions [netty/netty io.netty/netty-common]]
  [com.cnuernber/jarrow "1.000"]
  [org.apache.commons/commons-compress "1.21"]

  ;;Compression codecs
  [org.lz4/lz4-java "1.8.0"]
  ;;Required for decompressing lz4 streams with dependent blocks.
  [net.java.dev.jna/jna "5.10.0"]
  [com.github.luben/zstd-jni "1.5.4-1"]

The lz4 decompression system will fallback to lz4-java if liblz4 isn't installed or if jna isn't loaded. The lz4-java java library will fail for arrow files that have dependent block compression which are sometimes saved by python or R arrow implementations. On current ubuntu, in order to install the lz4 library you need to do:

  sudo apt install liblz4-1

Performance

Arrow has hands down highest performance of any of the formats although nippy comes very close when using any compression. The highest performance pathway is to save out data with :strings-as-text? true and zero compression then read them in using mmap - optionally with :text-as-strings? if you never want to see tech.v3.datatype.Text objects in your dataset. This avoids the creation of string dictionaries during deserialization as these have to be done greedily. It can dramatically increase many dataset sizes but when mmap is used the overall size is irrelevant aside from iteration which can be heavily parallelized.

Example:

  ;; Writing
  (arrow/dataset->stream! ds fname {:strings-as-text? true})
  ;; Reading
  (arrow/stream->dataset path {:text-as-strings? true :open-type :mmap})

col->buffers

(col->buffers col col-idx options)

view source

construct-column

(construct-column sparse? node field buffers col-data-fn)

view source

dataset->stream!

(dataset->stream! ds path options)(dataset->stream! ds path)

Write a dataset as an arrow file. File will contain one record set. See documentation for dataset-seq->stream!.

:strings-as-text? defaults to false.

view source

dataset-seq->stream!

(dataset-seq->stream! path options ds-seq)(dataset-seq->stream! path ds-seq)

Write a sequence of datasets as an arrow stream file. File will contain one record set per dataset. Datasets in the sequence must have matching schemas or downstream schema must be able to be safely widened to the first schema.

Options:

:strings-as-text? - defaults to true - Save out strings into arrow files without dictionaries. This works well if you want to load an arrow file in-place or if you know the strings in your dataset are either really large or should not be in string tables. Saving multiple datasets with {:strings-as-text false} requires arrow 7.0.0+ support from your python or R code due to Arrow issue 13467. - the conservative pathway for now is to set :strings-as-text? to true and only save text!!.
:format - one of [:file :ipc], defaults to :file.
- :file - arrow file format, compatible with pyarrow's open_file. The suggested suffix is .arrow.
- :ipc - arrow streaming format, compatible with pyarrow's open_ipc pathway. The suggested suffix is .arrows.
:compression - Either :zstd or :lz4, defaults to no compression (nil). Per-column compression of the data can result in some significant size savings (2x+) and thus some significant time savings when loading over the network. Using compression makes loading via mmap non-lazy - If you are going to use compression mmap probably doesn't make sense and most likely will result in slower loading times.
- :lz4 - Decent and very fast compression.
- :zstd - Good compression, somewhat slower than :lz4. Can also have a level parameter that ranges from 1-12 in which case compression is specified in map form: {:compression-type :zstd :level 5}.

view source

decimal-column-metadata

(decimal-column-metadata col)

view source

stream->dataset

(stream->dataset fname options)(stream->dataset fname)

Reads data non-lazily in arrow streaming format expecting to find a single dataset.

Options:

:open-type - Either :mmap or :input-stream defaulting to the slower but more robust :input-stream pathway. When using :mmap resources will be released when the resource system dictates - see documentation for tech.v3.resource. When using :input-stream the stream will be closed when the lazy sequence is either fully realized or an exception is thrown. Memory mapping is not supported on m-1 macs unless you are using JDK-17.
close-input-stream? - When using :input-stream :open-type, close the input stream upon exception or when stream is fully realized. Defaults to true.
:integer-datetime-types? - when true arrow columns in the appropriate packed datatypes will be represented as their integer types as opposed to their respective packed types. For example columns of type :epoch-days will be returned to the user as datatype :epoch-days as opposed to :packed-local-date. This means reading values will return integers as opposed to java.time.LocalDates.

view source

stream->dataset-iterable

(stream->dataset-iterable fname & [options])

Loads data up to and including the first data record. Returns the a lazy sequence of datasets. Datasets can be loaded using mmapped data and when that is true realizing the entire sequence is usually safe, even for datasets that are larger than available RAM. The default resourc management pathway for this is :auto but you can override this by explicity setting the option :resource-type. See documentation for tech.v3.datatype.mmap/mmap-file.

Options:

:open-type - Either :mmap or :input-stream defaulting to the slower but more robust :input-stream pathway. When using :mmap resources will be released when the resource system dictates - see documentation for tech.v3.resource. When using :input-stream the stream will be closed when the lazy sequence is either fully realized or an exception is thrown.
close-input-stream? - When using :input-stream :open-type, close the input stream upon exception or when stream is fully realized. Defaults to true.
:integer-datetime-types? - when true arrow columns in the appropriate packed datatypes will be represented as their integer types as opposed to their respective packed types. For example columns of type :epoch-days will be returned to the user as datatype :epoch-days as opposed to :packed-local-date. This means reading values will return integers as opposed to java.time.LocalDates.
:text-as-strings? - Return strings instead of Text objects. This breaks automatic round-tripping as it changes datatypes but can be useful when used with :strings-as-text? when writing data out. When used like this uncompressed mmap pathways typically have the highest performance - roughly 100x any other method.