tech.ml.dataset Supported Datatypes

tech.ml.dataset supports a wide range of datatypes and has a system for expanding the supported datatype set, aliasing new names to existing datatypes, and packing object datatypes into primitive containers. Let's walk through each of these topics and finally see how they relate to actually getting data into and out of a dataset.

Typesystem Fundamentals

Base Concepts

There are two fundamental namespaces that describe the entire type system for dtype-next derived projects. The first is the casting namespace

  • this registers the various datatypes and has maps describing the current set of datatypes. dtype-next has a simple typesystem in order to support primitve unsigned types which are completely unsupported on the JVM otherwise.

If we just load the casting namespace we see the base dtype-next datatypes:

user> (require '[tech.v3.datatype.casting :as casting])
nil
user> @casting/valid-datatype-set
#{:byte :int8 :float32 :long :bool :int32 :int :object :float64 :string :uint64 :uint16 :boolean :short :double :char :keyword :uint8 :uuid :uint32 :int16 :float :int64}

Now if we load the dtype-next namespace we see quite a few more datatypes registered:

user> (require '[tech.v3.datatype :as dtype])
nil
user> @casting/valid-datatype-set
#{:byte :int8 :float32 :char-array :int :object-array :float64 :list :uint64 :uint16 :char :int64-array :uint8 :int32-array :boolean-array :persistent-map :persistent-vector :persistent-set :float :long :bool :int32 :object :int16-array :string :boolean :short :float64-array :double :float32-array :keyword :uuid :int8-array :native-buffer :uint32 :array-buffer :int16 :int64}

Right away you can perhaps tell that there is a dynamic mechanism for registering more datatypes

  • we will get to that later. This set ties into the dtype-next datatype api:
user> (dtype/datatype (java.util.UUID/randomUUID))
:uuid
user> (dtype/datatype (int 10))
:int32
user> (dtype/datatype (float 10))
:float32
user> (dtype/datatype (double 10))
:float64

If we have a container of data one important question we have is what type of data is in the container. This is where the elemwise-datatype api comes in:

user> (dtype/elemwise-datatype (float-array 10))
:float32
user> (dtype/elemwise-datatype (int-array 10))
:int32

Given 2 (or more) numeric datatypes we can ask the typesystem what datatype should a combined operation, such as +, operate in?

user> (casting/widest-datatype :float32 :int32)
:float64

The root of our type system is the object datatype. All types can be represented by the object datatype albeit at some cost and generic containers such as persistent vectors or java ArrayLists and generic sequences produced by operations such as map do not have any information about the type of data they contain and thus they have the dataytpe of :object:

user> (dtype/elemwise-datatype (range 10))
:int64
user> (dtype/elemwise-datatype (vec (range 10)))
:object

If we include the dataset api then we see the typesystem is extended to include support for various datetime types:

user> (require '[tech.v3.dataset :as ds])
nil
user> @casting/valid-datatype-set
#{:byte :int8 :local-date-time :float32 :char-array :int :object-array :epoch-milliseconds :uint64 :char :packed-instant :uint8 :bitmap :int32-array :boolean-array :persistent-map :persistent-vector :days :tensor :persistent-set :seconds :long :microseconds :int32 :boolean :short :double :epoch-days :float32-array :instant :zoned-date-time :keyword :dataset :text :native-buffer :array-buffer :years :int64 :epoch-microseconds :milliseconds :float64 :list :uint16 :int64-array :nanoseconds :duration :packed-duration :float :bool :object :int16-array :string :hours :float64-array :epoch-seconds :packed-local-date :epoch-hours :uuid :weeks :local-date :int8-array :uint32 :int16}

Given a container of a with a specific datatype we can create a new read-only representation of a datatype that we desire with make-reader:

user> (def generic-data (vec (range 10)))
#'user/generic-data
user> generic-data
[0 1 2 3 4 5 6 7 8 9]
user> (dtype/make-reader :float32 (count generic-data) (float (generic-data idx)))
[0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0]

The default datetime definition of all datatypes is in datatype/base.clj.

Packing

The second fundamental concept to the typesystem is the concept of packing which is storing a java object in a primitive datatype. This allows us to use :int64 data to represent java.time.Instant objects and :int32 data to represent java.time.LocalDate objects. This compression has both speed and size benefits especially when it comes to serializing the data. It also allows us to support parquet and apache arrow file formats more transparently because they represent, e.g. LocalDate objects as epoch days. Currently only datetime objects are packed.

Packing has generic support in the underlying buffer system so that it works in an integrated fashion throughout the system.

user> (dtype/make-container :packed-local-date (repeat 10 (java.time.LocalDate/now)))
#array-buffer<packed-local-date>[10]
[2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15]
user> (def packed *1)
#'user/packed
user> (def unpacked (dtype/make-container :local-date (repeat 10 (java.time.LocalDate/now))))
#'user/unpacked
user> unpacked
#array-buffer<local-date>[10]
[2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15, 2021-12-15]
user> (.readLong (dtype/->reader packed) 0)
18976
user> (.readObject (dtype/->reader packed) 0)
#object[java.time.LocalDate 0x2d867250 "2021-12-15"]
user> (.toEpochDay *1)
18976
user> (.readLong (dtype/->reader unpacked) 0)
Execution error at tech.v3.datatype.NumericConversions/numberCast (NumericConversions.java:22).
Invalid argument
user> (.readObject (dtype/->reader unpacked) 0)
#object[java.time.LocalDate 0x2f13a18c "2021-12-15"]

Packing is defined in the namespace tech.v3.datatype.packing. We can add new packed datatypes but I strongly suggest avoiding this in general. While it certainly works well it is usually unnecessary and less clear than simply defining an alias and conversion methods do/from the alias.

The best example of using the packing system is the definition of the datetime packed datatypes.

Aliasing Datatypes

C/C++ contain the concept of datatype aliasing in the typedef keyword. For our use cases it is useful, especially when dealiing with datetime types to alias some datatypes to integers of various sizes so you can have a container of :milliseconds and such. You can see several examples in the aforementioned datatime/base.clj.

user> (casting/alias-datatype! :foobar :float32)
#{:byte :int8 :local-date-time :float32 :char-array :int :object-array :epoch-milliseconds :uint64 :char :packed-instant :uint8 :bitmap :int32-array :boolean-array :persistent-map :persistent-vector :days :tensor :persistent-set :seconds :long :microseconds :int32 :boolean :short :double :epoch-days :float32-array :instant :zoned-date-time :keyword :dataset :text :native-buffer :array-buffer :years :int64 :epoch-microseconds :milliseconds :float64 :list :uint16 :int64-array :nanoseconds :duration :packed-duration :float :bool :foobar :object :int16-array :string :hours :float64-array :epoch-seconds :packed-local-date :epoch-hours :uuid :weeks :local-date :int8-array :uint32 :int16}
user> (dtype/make-container :foobar (range 10))
#array-buffer<foobar>[10]
[0.000, 1.000, 2.000, 3.000, 4.000, 5.000, 6.000, 7.000, 8.000, 9.000]

In general, because this is all done at runtime I ask that people refrain aliasing new datatypes, defining new datatypes, and packing new datatypes. This doesn't mean it is an error if someone does it but it does mean that every new datatype definition, packing definition, and alias definition slightly slows down the system.

Supported Meaningful Datatypes

For dataset processing, the currently supported meaningful datatypes are:

  • [:int8 :uint8 :int16 :uint16 :int32 :uint32 :int64 :uint64 :float32 :float64 :string :keyword :uuid :local-date :packed-local-date :instant :packed-instant :duration :packed-duration :local-date-time]

There are more datatypes but for general purpose dataset processing these are a reasonable subset.

When parsing data into the dataset system we can define both the container of the data and the parser of the data:

user> (def data-maps (for [idx (range 10)]
                           {:a idx
                            :b (str (.plusDays (java.time.LocalDate/now) idx))}))
#'user/data-maps
user> data-maps
({:a 0, :b "2021-12-15"}
 {:a 1, :b "2021-12-16"}
 {:a 2, :b "2021-12-17"}
 {:a 3, :b "2021-12-18"}
 {:a 4, :b "2021-12-19"}
 {:a 5, :b "2021-12-20"}
 {:a 6, :b "2021-12-21"}
 {:a 7, :b "2021-12-22"}
 {:a 8, :b "2021-12-23"}
 {:a 9, :b "2021-12-24"})
user> (:b (ds/->dataset data-maps))
#tech.v3.dataset.column<string>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
user> (:b (ds/->dataset data-maps {:parser-fn {:b :local-date}}))
#tech.v3.dataset.column<local-date>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
user> (:b (ds/->dataset data-maps {:parser-fn {:b :packed-local-date}}))
#tech.v3.dataset.column<packed-local-date>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]
user> (:b (ds/->dataset data-maps {:parser-fn {:b [:packed-local-date
                                                   (fn [data]
                                                     (java.time.LocalDate/parse (str data)))]}}))
#tech.v3.dataset.column<packed-local-date>[10]
:b
[2021-12-15, 2021-12-16, 2021-12-17, 2021-12-18, 2021-12-19, 2021-12-20, 2021-12-21, 2021-12-22, 2021-12-23, 2021-12-24]

Extending Datatype System

Lets say we have tons of data in which only year-months are relevant. We can have escalating levels of support depending on how much it really matters. The first level is to convert the data into a type the system already understands, in this case a LocalDate:

user> (:b (ds/->dataset
           data-maps
           {:parser-fn {:b [:packed-local-date
                            (fn [data]
                              (let [ym (java.time.YearMonth/parse (str data))]
                                (java.time.LocalDate/of (.getYear ym) (.getMonth ym) 1)))]
                              }}))
#tech.v3.dataset.column<packed-local-date>[10]
:b
[2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01, 2021-12-01]

The second is to parse to year-months and accept our column type will just be :object -

user> (:b (ds/->dataset
           data-maps
           {:parser-fn {:b [:object
                            (fn [data]
                              (let [ym (java.time.YearMonth/parse (str data))]
                                ym))]
                              }}))
#tech.v3.dataset.column<object>[10]
:b
[2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12]

Third, we can extend the type system to support year month's as object datatypes. This is only slightly better than base object support but it does allow us to ensure we can make containers with only YearMonth or nil objects in them:

user> (casting/add-object-datatype! :year-month java.time.YearMonth)
:ok
user> (:b (ds/->dataset
           data-maps
           {:parser-fn {:b [:year-month
                            (fn [data]
                              (let [ym (java.time.YearMonth/parse (str data))]
                                ym))]
                              }}))
#tech.v3.dataset.column<year-month>[10]
:b
[2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12, 2021-12]

And finally we could implement packing for this type. This means we could store year-month as perhaps 32-bit integer epoch-months or something like that. We won't demonstrate this as it is tedious but the example in datetime/packing.clj may be sufficient to show how to do this - if not let us know on Zulip or drop me an email.