tech.v3.libs.parquet

Support for reading Parquet files. You must require this namespace to enable parquet read/write support.

Supported datatypes:

  • all numeric types
  • strings
  • java.time LocalDate, Instant
  • UUIDs (get read/written as strings in accordance to R's write_parquet function)

Parsing parquet file options include more general io/->dataset options:

  • :key-fn
  • :column-whitelist
  • :column-blacklist
  • :parser-fn

Please include these dependencies in your project and be sure to read the notes later in this document:

org.apache.parquet/parquet-hadoop {:mvn/version "1.12.0"
                                   :exclusions [org.slf4j/slf4j-log4j12]}
org.apache.hadoop/hadoop-common {:mvn/version "3.3.0"
                                 :exclusions [org.slf4j/slf4j-log4j12]}
;; We literally need this for 1 POJO formatting object.
org.apache.hadoop/hadoop-mapreduce-client-core {:mvn/version "3.3.0"
                                                :exclusions [org.slf4j/slf4j-log4j12]}

Logging

When writing parquet files you may notice a truly excessive amount of logging and/or extremely slow write speeds. The solution to this, if you are using the default tech.ml.dataset implementation with logback-classic as the concrete logger is to disable debug logging by placing a file named logback.xml in the classpath where the root node has a log-level above debug. The logback.xml file that 'tmd' uses by default during development is located in dev-resources and is enabled via a profile in project.clj.

Large-ish Datasets

The parquet writer will automatically split your dataset up into multiple parquet records so it is possible that you can attempt to write one large dataset then when you read it back you get a parquet file with multiple datasets. This is perhaps confusing but it is a side effect of the hadoop architecture. The simplest solution to this is to, when loading parquet files, use parquet->ds-seq and then a final concat-copying operation to produce one final dataset. ->dataset will do this operation for you but it will emit a warning when doing so as this may lead to OOM situations with some parquet files. To disable this warning use the option :disable-parquet-warn-on-multiple-datasets set to truthy.

ds->parquet

(ds->parquet ds path options)(ds->parquet ds path)

Write a dataset to a parquet file. Many parquet options are possible; these can also be passed in via ds/->write!

Options are the same as ds-seq->parquet.

ds-seq->parquet

(ds-seq->parquet path options ds-seq)(ds-seq->parquet path ds-seq)

Write a sequence of datasets to a parquet file. Parquet will break the data stream up according to parquet file properties.

Options:

  • :hadoop-configuration - Either nil or an instance of org.apache.hadoop.conf.Configuration.
  • :compression-codec - keyword describing compression codec. Options are [:brotli :gzip :lz4 :lzo :snappy :uncompressed :zstd]. Defaults to :snappy.
  • :block-size - Defaults to ParquetWriter/DEFAULT_BLOCK_SIZE.
  • :page-size - Defaults to ParquetWriter/DEFAULT_PAGE_SIZE.
  • :dictionary-page-size - Defaults to ParquetWriter/DEFAULT_PAGE_SIZE.
  • :dictionary-enabled? - Defaults to ParquetWriter/DEFAULT_IS_DICTIONARY_ENABLED.
  • :validating? - Defaults to ParquetWriter/DEFAULT_IS_VALIDATING_ENABLED. parquet file.
  • :writer-version - Defaults to ParquetWriter/DEFAULT_WRITER_VERSION.

parquet->ds

(parquet->ds input options)(parquet->ds input)

Load a parquet file. Input must be a file on disk.

Options are a subset of the options used for loading datasets - specifically :column-whitelist and :column-blacklist can be useful here. The parquet metadata ends up as metadata on the datasets.

parquet->ds-seq

(parquet->ds-seq path options)(parquet->ds-seq path)

Given a string, hadoop path, or a parquet InputFile, return a sequence of datasets. Column will have parquet metadata merged into their normal metadata. Reader will be closed upon termination of the sequence.

parquet->metadata-seq

(parquet->metadata-seq path)

Given a local parquet file, return a sequence of metadata, one for each row-group. A row-group maps directly to a dataset.