tech.v3.dataset.reductions.apache-data-sketch
Reduction reducers based on the apache data sketch family of algorithms.
Algorithms included here are:
Set Cardinality
Quantiles
Example:
user> (require '[tech.v3.dataset :as ds])
11:04:44.508 [nREPL-session-e40a19c2-8d41-40a8-8853-abe1293abe20] DEBUG tech.v3.tensor.dimensions.global-to-local - insn custom indexing enabled!
nil
user> (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user> (require '[tech.v3.dataset.reductions.apache-data-sketch :as ds-sketch])
#'user/stocks
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
#'user/stocks
user> (ds-reduce/group-by-column-agg
:symbol
{:symbol (ds-reduce/first-value :symbol)
:price-quantiles (ds-sketch/prob-quantiles :price [0.25 0.5 0.75])
:price-cdfs (ds-sketch/prob-cdfs :price [25 50 75])}
[stocks stocks stocks])
:symbol-aggregation [5 3]:
| :symbol | :price-quantiles | :price-cdfs |
|---------|-----------------------|--------------------------|
| AAPL | [11.03, 36.81, 105.1] | [0.4065, 0.5528, 0.6423] |
| IBM | [77.26, 88.70, 102.4] | [0.000, 0.000, 0.1382] |
| AMZN | [30.12, 41.50, 67.00] | [0.2249, 0.6396, 0.8103] |
| MSFT | [21.75, 24.11, 27.34] | [0.5772, 1.000, 1.000] |
| GOOG | [338.5, 421.6, 510.0] | [0.000, 0.000, 0.000] |
doubles-sketch-reducer
(doubles-sketch-reducer k finalize-fn)
Return a doubles updater. This is the reservoir and k is the reservoir size. From a reservoir we can then get various different statistical quantities.
A k of 128 results in about 1.7% error in returned quantities.
hll-reducer
(hll-reducer {:keys [hll-lgk hll-type datatype], :or {hll-lgk 12, hll-type 8, datatype :float64}})
Return a hamf parallel reducer that produces a hyper-log-log-based set cardinality.
At any point you can get an estimate from the reduced value - call sketch-estimate.
Options:
:hll-lgk
- defaults to 12, this is log-base2 of k, so k = 4096. lgK can be from 4 to 21.:hll-type
- One of #{4,6,8}, defaults to 8. The HLL_4, HLL_6 and HLL_8 represent different levels of compression of the final HLL array where the 4, 6 and 8 refer to the number of bits each bucket of the HLL array is compressed down to. The HLL_4 is the most compressed but generally slightly slower than the other two, especially during union operations.:datatype
- One of :float64, :int64, :string
prob-cdf
(prob-cdf colname cdf k)
(prob-cdf colname cdf)
Probabilistic cdfs, one for each value passed in. See DoublesSketch. See prob-quantiles for k.
prob-cdfs
(prob-cdfs colname cdfs k)
(prob-cdfs colname cdfs)
Probabilistic cdfs, one for each value passed in. See DoublesSketch. See prob-quantiles for k.
prob-interquartile-range
(prob-interquartile-range colname k)
(prob-interquartile-range colname)
Probabilistic interquartile range - DoublesSketch.
prob-pmfs
(prob-pmfs colname pmfs k)
(prob-pmfs colname pmfs)
Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values). See DoublesSketch.
See prog-quantiles for k
prob-quantile
(prob-quantile colname quantile k)
(prob-quantile colname quantile)
Probabilistic quantile estimation - see DoublesSketch.
- k - defaults to 128. This produces a normalized rank error of about 1.7%
prob-quantiles
(prob-quantiles colname quantiles k)
(prob-quantiles colname quantiles)
Probabilistic quantile estimation - see DoublesSketch.
- quantiles - sequence of quantiles.
- k - defaults to 128. This produces a normalized rank error of about 1.7%
prob-set-cardinality
(prob-set-cardinality colname options)
(prob-set-cardinality colname)
Get the probabilistic set cardinality using hyper-log-log. See hll-reducer.