tech.v3.dataset.reductions.apache-data-sketch

Reduction reducers based on the apache data sketch family of algorithms.

apache data sketches

Algorithms included here are:

Set Cardinality

Quantiles

doubles

Example:

user> (require '[tech.v3.dataset :as ds])
11:04:44.508 [nREPL-session-e40a19c2-8d41-40a8-8853-abe1293abe20] DEBUG tech.v3.tensor.dimensions.global-to-local - insn custom indexing enabled!
nil
user> (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user> (require '[tech.v3.dataset.reductions.apache-data-sketch :as ds-sketch])
#'user/stocks
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
  #'user/stocks
user> (ds-reduce/group-by-column-agg
       :symbol
       {:symbol (ds-reduce/first-value :symbol)
        :price-quantiles (ds-sketch/prob-quantiles :price [0.25 0.5 0.75])
        :price-cdfs (ds-sketch/prob-cdfs :price [25 50 75])}
       [stocks stocks stocks])
:symbol-aggregation [5 3]:

| :symbol |      :price-quantiles |              :price-cdfs |
|---------|-----------------------|--------------------------|
|    AAPL | [11.03, 36.81, 105.1] | [0.4065, 0.5528, 0.6423] |
|     IBM | [77.26, 88.70, 102.4] |   [0.000, 0.000, 0.1382] |
|    AMZN | [30.12, 41.50, 67.00] | [0.2249, 0.6396, 0.8103] |
|    MSFT | [21.75, 24.11, 27.34] |   [0.5772, 1.000, 1.000] |
|    GOOG | [338.5, 421.6, 510.0] |    [0.000, 0.000, 0.000] |

doubles-sketch-reducer

(doubles-sketch-reducer k finalize-fn)

Return a doubles updater. This is the reservoir and k is the reservoir size. From a reservoir we can then get various different statistical quantities.

A k of 128 results in about 1.7% error in returned quantities.

view source

hll-reducer

(hll-reducer {:keys [hll-lgk hll-type datatype], :or {hll-lgk 12, hll-type 8, datatype :float64}})

Return a hamf parallel reducer that produces a hyper-log-log-based set cardinality.

At any point you can get an estimate from the reduced value - call sketch-estimate.

Options:

:hll-lgk - defaults to 12, this is log-base2 of k, so k = 4096. lgK can be from 4 to 21.
:hll-type - One of #{4,6,8}, defaults to 8. The HLL_4, HLL_6 and HLL_8 represent different levels of compression of the final HLL array where the 4, 6 and 8 refer to the number of bits each bucket of the HLL array is compressed down to. The HLL_4 is the most compressed but generally slightly slower than the other two, especially during union operations.
:datatype - One of :float64, :int64, :string

view source