public class Arrow
extends java.lang.Object
Bindings to save/load datasets apache arrow streaming format. These bindings support JDK-17, memory mapping, and per-column compression.
Required Dependencies:
[org.apache.arrow/arrow-vector "6.0.0"]
[org.lz4/lz4-java "1.8.0"]
[com.github.luben/zstd-jni "1.5.1-1"]
Modifier and Type | Method and Description |
---|---|
static void |
datasetSeqToStream(java.lang.Iterable dsSeq,
java.lang.Object pathOrInputStream,
java.lang.Object options)
Save a sequence of datasets to a single stream file.
|
static void |
datasetToStream(java.lang.Object ds,
java.lang.Object pathOrInputStream,
java.lang.Object options)
Save a dataset to apache stream format.
|
static java.util.Map |
streamToDataset(java.lang.Object pathOrInputStream,
java.lang.Object options)
Load an apache arrow streaming file returning a single dataset.
|
static java.lang.Iterable |
streamToDatasetSeq(java.lang.Object pathOrInputStream,
java.lang.Object options)
Load an apache arrow streaming file returning a sequence of datasets, one for each record batch.
|
public static void datasetToStream(java.lang.Object ds, java.lang.Object pathOrInputStream, java.lang.Object options)
Save a dataset to apache stream format.
Options:
strings-as-text?
: - defaults to false - Save out strings into arrow files without dictionaries. This works well if you want to load an arrow file in-place or if you know the strings in your dataset are either really large or should not be in string tables.
:compression
- Either :zstd
or :lz4
, defaults to no compression (nil). Per-column compression of the data can result in some significant size savings (2x+) and thus some significant time savings when transferring over the network. Using compression makes loading via mmap non-in-place - If you are going to use compression mmap probably doesn’t make sense on load and most likely will result on slower loading times. Zstd can also be passed in map form with an addition parameter, :level
which defaults to 3.
//Slightly higher compression than the default.
datasetToStream(ds, "data.arrow-ipc", hashmap(kw("compression"),
hashmap(kw("compression-type"), kw("zstd"),
kw("level"), 5)));
public static void datasetSeqToStream(java.lang.Iterable dsSeq, java.lang.Object pathOrInputStream, java.lang.Object options)
Save a sequence of datasets to a single stream file. Datasets must either have matching schemas or downstream dataset column datatypes must be able to be widened to the initial dataset column datatypes.
For options see datasetToStream
.
public static java.util.Map streamToDataset(java.lang.Object pathOrInputStream, java.lang.Object options)
Load an apache arrow streaming file returning a single dataset. File must only contain a single record batch.
Options:
:open-type
- Either :mmap
or :input-stream
defaulting to the slower but more robust :input-stream
pathway. When using :mmap
resources will be released when the resource system dictates - see documentation for tech.v3.DType.stackResourceContext. When using :input-stream
the stream will be closed when the lazy sequence is either fully realized or an exception is thrown.
close-input-stream?
- When using :input-stream
:open-type
, close the input stream upon exception or when stream is fully realized. Defaults to true.
:integer-datatime-types?
- when true arrow columns in the appropriate packed datatypes will be represented as their integer types as opposed to their respective packed types. For example columns of type :epoch-days
will be returned to the user as datatype :epoch-days
as opposed to :packed-local-date
. This means reading values will return integers as opposed to java.time.LocalDate
s.
public static java.lang.Iterable streamToDatasetSeq(java.lang.Object pathOrInputStream, java.lang.Object options)
Load an apache arrow streaming file returning a sequence of datasets, one for each record batch. For options see streamToDataset.