tech.v3.libs.parquet
Support for reading Parquet files. You must require this namespace to enable parquet read/write support.
Supported datatypes:
- all numeric types
- strings
- java.time LocalDate, Instant
- UUIDs (get read/written as strings in accordance to R's write_parquet function)
Parsing parquet file options include more general io/->dataset options:
:key-fn
:column-allowlist
in preference to:column-whitelist
:column-blocklist
in preference to:column-blacklist
:parser-fn
Please include these dependencies in your project and be sure to read the notes later in this document. Note that these exclusions are carefully chosen to avoid the myriad of serious CVE issues associated with hadoop:
org.apache.parquet/parquet-hadoop {:mvn/version "1.12.0"
:exclusions [org.slf4j/slf4j-log4j12]}
org.apache.hadoop/hadoop-common {:mvn/version "3.3.0"
:exclusions [com.sun.jersey/jersey-core
com.sun.jersey/jersey-json
com.sun.jersey/jersey-server
com.sun.jersey/jersey-servlet
dnsjava/dnsjava
org.eclipse.jetty/jetty-server
org.eclipse.jetty/jetty-servlet
org.eclipse.jetty/jetty-util
org.eclipse.jetty/jetty-webapp
javax.activation/javax.activation-api
javax.servlet.jsp/jsp-api
javax.servlet/javax.servlet-api
io.netty/netty-codec
io.netty/netty-handler
io.netty/netty-transport
io.netty/netty-transport-native-epoll
org.codehaus.jettison/jettison
org.apache.zookeeper/zookeeper
org.apache.curator/curator-recipes
org.apache.curator/curator-client
org.apache.htrace/htrace-core4
org.apache.hadoop.thirdparty/hadoop-shaded-protobuf_3_7
org.apache.hadoop/hadoop-auth
org.apache.kerby/kerb-core
commons-cli/commons-cli
commons-net/commons-net
org.apache.commons/commons-lang3
org.apache.commons/commons-text
org.apache.commons/commons-configuration2
com.google.re2j/re2j
com.google.code.findbugs/jsr305
com.jcraft/jsch
log4j/log4j
org.slf4j/slf4j-log4j12]
}
;; We literally need this for 1 POJO formatting object.
org.apache.hadoop/hadoop-mapreduce-client-core {:mvn/version "3.3.0"
:exclusions [org.slf4j/slf4j-log4j12
org.apache.avro/avro
org.apache.hadoop/hadoop-yarn-client
org.apache.hadoop/hadoop-yarn-common
org.apache.hadoop/hadoop-annotations
org.apache.hadoop/hadoop-hdfs-client
io.netty/netty
com.google.inject.extensions/guice-servlet]}
;; M-1 mac support for snappy
org.xerial.snappy/snappy-java {:mvn/version "1.1.8.4"}
Logging
When writing parquet files you may notice a truly excessive amount of logging and/or
extremely slow write speeds. The solution to this, if you are using
the default tech.ml.dataset
implementation with logback-classic as the concrete
logger is to disable debug logging by placing a file named logback.xml
in the classpath where the root node has a log-level above debug. The logback.xml
file that 'tmd' uses by default during development is located in
dev-resources and is
enabled via a profile in project.clj.
Large-ish Datasets
The parquet writer will automatically split your dataset up into multiple parquet
records so it is possible that you can attempt to write one large dataset then when
you read it back you get a parquet file with multiple datasets. This is perhaps
confusing but it is a side effect of the hadoop architecture. The simplest solution
to this is to, when loading parquet files, use parquet->ds-seq and then a final
concat-copying operation to produce one final dataset. ->dataset
will do
this operation for you but it will emit a warning when doing so as this may
lead to OOM situations with some parquet files. To disable this warning use the
option :disable-parquet-warn-on-multiple-datasets
set to truthy.
->row-group-supplier
(->row-group-supplier path)
Recommended way of low-level reading the file. The metadata of the supplier contains a
:row-group
member that contains a vector of row group metadata.
The supplier implements java.util.Supplier java.util.Iterable and clojure.lang.IReduce.
Each time it is called it returns a tuple of ParquetFileReader, PageReadStore, row-group-metadata.
ds->parquet
(ds->parquet ds path options)
(ds->parquet ds path)
Write a dataset to a parquet file. Many parquet options are possible; these can also be passed in via ds/->write!
Options are the same as ds-seq->parquet.
ds-seq->parquet
(ds-seq->parquet path options ds-seq)
(ds-seq->parquet path ds-seq)
Write a sequence of datasets to a parquet file. Parquet will break the data stream up according to parquet file properties. Path may be a string path or a java.io.OutputStream.
Options:
:hadoop-configuration
- Either nil or an instance oforg.apache.hadoop.conf.Configuration
.:compression-codec
- keyword describing compression codec. Options are[:brotli :gzip :lz4 :lzo :snappy :uncompressed :zstd]
. Defaults to:snappy
.:block-size
- Defaults toParquetWriter/DEFAULT_BLOCK_SIZE
.:page-size
- Defaults toParquetWriter/DEFAULT_PAGE_SIZE
.:dictionary-page-size
- Defaults toParquetWriter/DEFAULT_PAGE_SIZE
.:dictionary-enabled?
- Defaults toParquetWriter/DEFAULT_IS_DICTIONARY_ENABLED
.:validating?
- Defaults toParquetWriter/DEFAULT_IS_VALIDATING_ENABLED
. parquet file.:writer-version
- Defaults toParquetWriter/DEFAULT_WRITER_VERSION
.
parquet->ds
(parquet->ds input options)
(parquet->ds input)
Load a parquet file. Input must be a file on disk.
Options are a subset of the options used for loading datasets -
specifically :column-allowlist
and :column-blocklist
can be
useful here. The parquet metadata ends up as metadata on the
datasets. :column-whitelist
and :column-blacklist
are available
but not preferred.
parquet->ds-seq
(parquet->ds-seq path options)
(parquet->ds-seq path)
Given a string, hadoop path, or a parquet InputFile, return a sequence of datasets.
Column will have parquet metadata merged into their normal metadata.
Reader will be closed upon termination of the sequence.
The return value can be efficiently reduced over and iterated without leaking memory.
See ham-fisted's lazy noncaching namespace for help.
parquet->metadata-seq
(parquet->metadata-seq path)
Given a local parquet file, return a sequence of metadata, one for each row-group. A row-group maps directly to a dataset.