tech.v3.dataset.reductions.apache-data-sketch
Reduction reducers based on the apache data sketch family of algorithms.
Algorithms included here are:
Set Cardinality
Quantiles
Example:
user> (require '[tech.v3.dataset :as ds])
11:04:44.508 [nREPL-session-e40a19c2-8d41-40a8-8853-abe1293abe20] DEBUG tech.v3.tensor.dimensions.global-to-local - insn custom indexing enabled!
nil
user> (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user> (require '[tech.v3.dataset.reductions.apache-data-sketch :as ds-sketch])
#'user/stocks
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
  #'user/stocks
user> (ds-reduce/group-by-column-agg
       :symbol
       {:symbol (ds-reduce/first-value :symbol)
        :price-quantiles (ds-sketch/prob-quantiles :price [0.25 0.5 0.75])
        :price-cdfs (ds-sketch/prob-cdfs :price [25 50 75])}
       [stocks stocks stocks])
:symbol-aggregation [5 3]:
| :symbol |      :price-quantiles |              :price-cdfs |
|---------|-----------------------|--------------------------|
|    AAPL | [11.03, 36.81, 105.1] | [0.4065, 0.5528, 0.6423] |
|     IBM | [77.26, 88.70, 102.4] |   [0.000, 0.000, 0.1382] |
|    AMZN | [30.12, 41.50, 67.00] | [0.2249, 0.6396, 0.8103] |
|    MSFT | [21.75, 24.11, 27.34] |   [0.5772, 1.000, 1.000] |
|    GOOG | [338.5, 421.6, 510.0] |    [0.000, 0.000, 0.000] |
doubles-sketch-reducer
(doubles-sketch-reducer k finalize-fn)Return a doubles updater. This is the reservoir and k is the reservoir size. From a reservoir we can then get various different statistical quantities.
A k of 128 results in about 1.7% error in returned quantities.
hll-reducer
(hll-reducer {:keys [hll-lgk hll-type datatype], :or {hll-lgk 12, hll-type 8, datatype :float64}})Return a hamf parallel reducer that produces a hyper-log-log-based set cardinality.
At any point you can get an estimate from the reduced value - call sketch-estimate.
Options:
- :hll-lgk- defaults to 12, this is log-base2 of k, so k = 4096. lgK can be from 4 to 21.
- :hll-type- One of #{4,6,8}, defaults to 8. The HLL_4, HLL_6 and HLL_8 represent different levels of compression of the final HLL array where the 4, 6 and 8 refer to the number of bits each bucket of the HLL array is compressed down to. The HLL_4 is the most compressed but generally slightly slower than the other two, especially during union operations.
- :datatype- One of :float64, :int64, :string
prob-cdf
(prob-cdf colname cdf k)(prob-cdf colname cdf)Probabilistic cdfs, one for each value passed in. See DoublesSketch. See prob-quantiles for k.
prob-cdfs
(prob-cdfs colname cdfs k)(prob-cdfs colname cdfs)Probabilistic cdfs, one for each value passed in. See DoublesSketch. See prob-quantiles for k.
prob-interquartile-range
(prob-interquartile-range colname k)(prob-interquartile-range colname)Probabilistic interquartile range - DoublesSketch.
prob-pmfs
(prob-pmfs colname pmfs k)(prob-pmfs colname pmfs)Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values). See DoublesSketch.
See prog-quantiles for k
prob-quantile
(prob-quantile colname quantile k)(prob-quantile colname quantile)Probabilistic quantile estimation - see DoublesSketch.
- k - defaults to 128. This produces a normalized rank error of about 1.7%
prob-quantiles
(prob-quantiles colname quantiles k)(prob-quantiles colname quantiles)Probabilistic quantile estimation - see DoublesSketch.
- quantiles - sequence of quantiles.
- k - defaults to 128. This produces a normalized rank error of about 1.7%
prob-set-cardinality
(prob-set-cardinality colname options)(prob-set-cardinality colname)Get the probabilistic set cardinality using hyper-log-log. See hll-reducer.