tech.v3.dataset.reductions.apache-data-sketch

Reduction reducers based on the apache data sketch family of algorithms.

Algorithms included here are:

Set Cardinality

Quantiles

Example:

user> (require '[tech.v3.dataset :as ds])
11:04:44.508 [nREPL-session-e40a19c2-8d41-40a8-8853-abe1293abe20] DEBUG tech.v3.tensor.dimensions.global-to-local - insn custom indexing enabled!
nil
user> (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user> (require '[tech.v3.dataset.reductions.apache-data-sketch :as ds-sketch])
#'user/stocks
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
  #'user/stocks
user> (ds-reduce/group-by-column-agg
       :symbol
       {:symbol (ds-reduce/first-value :symbol)
        :price-quantiles (ds-sketch/prob-quantiles :price [0.25 0.5 0.75])
        :price-cdfs (ds-sketch/prob-cdfs :price [25 50 75])}
       [stocks stocks stocks])
:symbol-aggregation [5 3]:

| :symbol |      :price-quantiles |              :price-cdfs |
|---------|-----------------------|--------------------------|
|    AAPL | [11.03, 36.81, 105.1] | [0.4065, 0.5528, 0.6423] |
|     IBM | [77.26, 88.70, 102.4] |   [0.000, 0.000, 0.1382] |
|    AMZN | [30.12, 41.50, 67.00] | [0.2249, 0.6396, 0.8103] |
|    MSFT | [21.75, 24.11, 27.34] |   [0.5772, 1.000, 1.000] |
|    GOOG | [338.5, 421.6, 510.0] |    [0.000, 0.000, 0.000] |

doubles-sketch-reducer

(doubles-sketch-reducer k finalize-fn)

Return a doubles updater. This is the reservoir and k is the reservoir size. From a reservoir we can then get various different statistical quantities.

A k of 128 results in about 1.7% error in returned quantities.

hll-reducer

(hll-reducer {:keys [hll-lgk hll-type datatype], :or {hll-lgk 12, hll-type 8, datatype :float64}})

Return a hamf parallel reducer that produces a hyper-log-log-based set cardinality.

At any point you can get an estimate from the reduced value - call sketch-estimate.

Options:

  • :hll-lgk - defaults to 12, this is log-base2 of k, so k = 4096. lgK can be from 4 to 21.
  • :hll-type - One of #{4,6,8}, defaults to 8. The HLL_4, HLL_6 and HLL_8 represent different levels of compression of the final HLL array where the 4, 6 and 8 refer to the number of bits each bucket of the HLL array is compressed down to. The HLL_4 is the most compressed but generally slightly slower than the other two, especially during union operations.
  • :datatype - One of :float64, :int64, :string

prob-cdf

(prob-cdf colname cdf k)(prob-cdf colname cdf)

Probabilistic cdfs, one for each value passed in. See DoublesSketch. See prob-quantiles for k.

prob-cdfs

(prob-cdfs colname cdfs k)(prob-cdfs colname cdfs)

Probabilistic cdfs, one for each value passed in. See DoublesSketch. See prob-quantiles for k.

prob-interquartile-range

(prob-interquartile-range colname k)(prob-interquartile-range colname)

Probabilistic interquartile range - DoublesSketch.

prob-median

(prob-median colname k)(prob-median colname)

Probabilistic median - DoublesSketch.

prob-pmfs

(prob-pmfs colname pmfs k)(prob-pmfs colname pmfs)

Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values). See DoublesSketch.

See prog-quantiles for k

prob-quantile

(prob-quantile colname quantile k)(prob-quantile colname quantile)

Probabilistic quantile estimation - see DoublesSketch.

  • k - defaults to 128. This produces a normalized rank error of about 1.7%

prob-quantiles

(prob-quantiles colname quantiles k)(prob-quantiles colname quantiles)

Probabilistic quantile estimation - see DoublesSketch.

  • quantiles - sequence of quantiles.
  • k - defaults to 128. This produces a normalized rank error of about 1.7%

prob-set-cardinality

(prob-set-cardinality colname options)(prob-set-cardinality colname)

Get the probabilistic set cardinality using hyper-log-log. See hll-reducer.