tech.v3.libs.parquet

Support for reading Parquet files. You must require this namespace to enable parquet read/write support.

Supported datatypes:

  • all numeric types
  • strings
  • java.time LocalDate, Instant
  • UUIDs (get read/written as strings in accordance to R's write_parquet function)

Parsing parquet file options include more general io/->dataset options:

  • :key-fn
  • :column-allowlist in preference to :column-whitelist
  • :column-blocklist in preference to :column-blacklist
  • :parser-fn

Please include these dependencies in your project and be sure to read the notes later in this document. Note that these exclusions are carefully chosen to avoid the myriad of serious CVE issues associated with hadoop:

org.apache.parquet/parquet-hadoop {:mvn/version "1.12.0"
                                   :exclusions [org.slf4j/slf4j-log4j12]}
  org.apache.hadoop/hadoop-common              {:mvn/version "3.3.0"
                                                :exclusions  [com.sun.jersey/jersey-core
                                                              com.sun.jersey/jersey-json
                                                              com.sun.jersey/jersey-server
                                                              com.sun.jersey/jersey-servlet

                                                              dnsjava/dnsjava

                                                              org.eclipse.jetty/jetty-server
                                                              org.eclipse.jetty/jetty-servlet
                                                              org.eclipse.jetty/jetty-util
                                                              org.eclipse.jetty/jetty-webapp

                                                              javax.activation/javax.activation-api
                                                              javax.servlet.jsp/jsp-api
                                                              javax.servlet/javax.servlet-api

                                                              io.netty/netty-codec
                                                              io.netty/netty-handler
                                                              io.netty/netty-transport
                                                              io.netty/netty-transport-native-epoll

                                                              org.codehaus.jettison/jettison

                                                              org.apache.zookeeper/zookeeper

                                                              org.apache.curator/curator-recipes
                                                              org.apache.curator/curator-client
                                                              org.apache.htrace/htrace-core4

                                                              org.apache.hadoop.thirdparty/hadoop-shaded-protobuf_3_7
                                                              org.apache.hadoop/hadoop-auth


                                                              org.apache.kerby/kerb-core

                                                              commons-cli/commons-cli
                                                              commons-net/commons-net
                                                              org.apache.commons/commons-lang3
                                                              org.apache.commons/commons-text
                                                              org.apache.commons/commons-configuration2

                                                              com.google.re2j/re2j
                                                              com.google.code.findbugs/jsr305

                                                              com.jcraft/jsch

                                                              log4j/log4j
                                                              org.slf4j/slf4j-log4j12]
                                                }
;; We literally need this for 1 POJO formatting object.
org.apache.hadoop/hadoop-mapreduce-client-core {:mvn/version "3.3.0"
                                                :exclusions  [org.slf4j/slf4j-log4j12
                                                              org.apache.avro/avro
                                                              org.apache.hadoop/hadoop-yarn-client
                                                              org.apache.hadoop/hadoop-yarn-common
                                                              org.apache.hadoop/hadoop-annotations
                                                              org.apache.hadoop/hadoop-hdfs-client
                                                              io.netty/netty
                                                              com.google.inject.extensions/guice-servlet]}
;; M-1 mac support for snappy
org.xerial.snappy/snappy-java {:mvn/version "1.1.8.4"}

Logging

When writing parquet files you may notice a truly excessive amount of logging and/or extremely slow write speeds. The solution to this, if you are using the default tech.ml.dataset implementation with logback-classic as the concrete logger is to disable debug logging by placing a file named logback.xml in the classpath where the root node has a log-level above debug. The logback.xml file that 'tmd' uses by default during development is located in dev-resources and is enabled via a profile in project.clj.

Large-ish Datasets

The parquet writer will automatically split your dataset up into multiple parquet records so it is possible that you can attempt to write one large dataset then when you read it back you get a parquet file with multiple datasets. This is perhaps confusing but it is a side effect of the hadoop architecture. The simplest solution to this is to, when loading parquet files, use parquet->ds-seq and then a final concat-copying operation to produce one final dataset. ->dataset will do this operation for you but it will emit a warning when doing so as this may lead to OOM situations with some parquet files. To disable this warning use the option :disable-parquet-warn-on-multiple-datasets set to truthy.

->row-group-supplier

(->row-group-supplier path)

Recommended way of low-level reading the file. The metadata of the supplier contains a :row-group member that contains a vector of row group metadata. The supplier implements java.util.Supplier java.util.Iterable and clojure.lang.IReduce.
Each time it is called it returns a tuple of ParquetFileReader, PageReadStore, row-group-metadata.

ds->parquet

(ds->parquet ds path options)(ds->parquet ds path)

Write a dataset to a parquet file. Many parquet options are possible; these can also be passed in via ds/->write!

Options are the same as ds-seq->parquet.

ds-seq->parquet

(ds-seq->parquet path options ds-seq)(ds-seq->parquet path ds-seq)

Write a sequence of datasets to a parquet file. Parquet will break the data stream up according to parquet file properties. Path may be a string path or a java.io.OutputStream.

Options:

  • :hadoop-configuration - Either nil or an instance of org.apache.hadoop.conf.Configuration.
  • :compression-codec - keyword describing compression codec. Options are [:brotli :gzip :lz4 :lzo :snappy :uncompressed :zstd]. Defaults to :snappy.
  • :block-size - Defaults to ParquetWriter/DEFAULT_BLOCK_SIZE.
  • :page-size - Defaults to ParquetWriter/DEFAULT_PAGE_SIZE.
  • :dictionary-page-size - Defaults to ParquetWriter/DEFAULT_PAGE_SIZE.
  • :dictionary-enabled? - Defaults to ParquetWriter/DEFAULT_IS_DICTIONARY_ENABLED.
  • :validating? - Defaults to ParquetWriter/DEFAULT_IS_VALIDATING_ENABLED. parquet file.
  • :writer-version - Defaults to ParquetWriter/DEFAULT_WRITER_VERSION.

parquet->ds

(parquet->ds input options)(parquet->ds input)

Load a parquet file. Input must be a file on disk.

Options are a subset of the options used for loading datasets - specifically :column-allowlist and :column-blocklist can be useful here. The parquet metadata ends up as metadata on the datasets. :column-whitelist and :column-blacklist are available but not preferred.

parquet->ds-seq

(parquet->ds-seq path options)(parquet->ds-seq path)

Given a string, hadoop path, or a parquet InputFile, return a sequence of datasets. Column will have parquet metadata merged into their normal metadata. Reader will be closed upon termination of the sequence. The return value can be efficiently reduced over and iterated without leaking memory.
See ham-fisted's lazy noncaching namespace for help.

parquet->metadata-seq

(parquet->metadata-seq path)

Given a local parquet file, return a sequence of metadata, one for each row-group. A row-group maps directly to a dataset.