Arrow (tech.ml.dataset Documentation)

java.lang.Object
- tech.v3.libs.Arrow

```
public class Arrow
extends java.lang.Object
```
Bindings to save/load datasets apache arrow streaming format. These bindings support JDK-17, memory mapping, and per-column compression.

Required Dependencies:
```
[org.apache.arrow/arrow-vector "6.0.0"]
[org.lz4/lz4-java "1.8.0"]
[com.github.luben/zstd-jni "1.5.1-1"]
```

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`datasetSeqToStream(java.lang.Iterable dsSeq, java.lang.Object pathOrInputStream, java.lang.Object options)` Save a sequence of datasets to a single stream file.
`static void`	`datasetToStream(java.lang.Object ds, java.lang.Object pathOrInputStream, java.lang.Object options)` Save a dataset to apache stream format.
`static java.util.Map`	`streamToDataset(java.lang.Object pathOrInputStream, java.lang.Object options)` Load an apache arrow streaming file returning a single dataset.
`static java.lang.Iterable`	`streamToDatasetSeq(java.lang.Object pathOrInputStream, java.lang.Object options)` Load an apache arrow streaming file returning a sequence of datasets, one for each record batch.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - datasetToStream
```
public static void datasetToStream(java.lang.Object ds,
                                   java.lang.Object pathOrInputStream,
                                   java.lang.Object options)
```
    Save a dataset to apache stream format.
    
    Options:
    - strings-as-text?: - defaults to false - Save out strings into arrow files without dictionaries. This works well if you want to load an arrow file in-place or if you know the strings in your dataset are either really large or should not be in string tables.
    - :compression - Either :zstd or :lz4, defaults to no compression (nil). Per-column compression of the data can result in some significant size savings (2x+) and thus some significant time savings when transferring over the network. Using compression makes loading via mmap non-in-place - If you are going to use compression mmap probably doesn’t make sense on load and most likely will result on slower loading times. Zstd can also be passed in map form with an addition parameter, :level which defaults to 3.
```
//Slightly higher compression than the default.
datasetToStream(ds, "data.arrow-ipc", hashmap(kw("compression"),
                                             hashmap(kw("compression-type"), kw("zstd"),
                                                     kw("level"), 5)));
```
  - datasetSeqToStream
```
public static void datasetSeqToStream(java.lang.Iterable dsSeq,
                                      java.lang.Object pathOrInputStream,
                                      java.lang.Object options)
```
    Save a sequence of datasets to a single stream file. Datasets must either have matching schemas or downstream dataset column datatypes must be able to be widened to the initial dataset column datatypes.
    
    For options see datasetToStream.
  - streamToDataset
```
public static java.util.Map streamToDataset(java.lang.Object pathOrInputStream,
                                            java.lang.Object options)
```
    Load an apache arrow streaming file returning a single dataset. File must only contain a single record batch.
    
    Options:
    - :open-type - Either :mmap or :input-stream defaulting to the slower but more robust :input-stream pathway. When using :mmap resources will be released when the resource system dictates - see documentation for tech.v3.DType.stackResourceContext. When using :input-stream the stream will be closed when the lazy sequence is either fully realized or an exception is thrown.
    - close-input-stream? - When using :input-stream :open-type, close the input stream upon exception or when stream is fully realized. Defaults to true.
    - :integer-datatime-types? - when true arrow columns in the appropriate packed datatypes will be represented as their integer types as opposed to their respective packed types. For example columns of type :epoch-days will be returned to the user as datatype :epoch-days as opposed to :packed-local-date. This means reading values will return integers as opposed to java.time.LocalDates.
  - streamToDatasetSeq
```
public static java.lang.Iterable streamToDatasetSeq(java.lang.Object pathOrInputStream,
                                                    java.lang.Object options)
```
    Load an apache arrow streaming file returning a sequence of datasets, one for each record batch. For options see streamToDataset.

Class Arrow

Method Summary

Methods inherited from class java.lang.Object

Method Detail

datasetToStream

datasetSeqToStream

streamToDataset

streamToDatasetSeq