tech.v3.dataset.set
Extensions to datasets to do per-row bag-semantics set/union and intersection.
intersection
(intersection a)
(intersection a b)
(intersection a b & args)
Intersect two datasets producing a new dataset with the union of tuples. Tuples repeated across all datasets repeated in final dataset at their minimum per-dataset repetition count.
reduce-intersection
(reduce-intersection options datasets)
(reduce-intersection datasets)
Given a sequence of datasets, union the rows such that tuples that exist in all datasets appear in the final dataset at their mininum repetition amount. Can return either a dataset with duplicate tuples or a dataset with a :count column.
Options:
:count
- Name of count column, if nil then tuples are duplicated and count is implicit.
user> (def ds-a (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 2 :b 3}]))
#'user/ds-a
user> (def ds-b (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 3 :b 3}]))
#'user/ds-b
user> (ds-set/reduce-intersection [ds-a ds-b])
_unnamed [2 2]:
| :a | :b |
|---:|---:|
| 1 | 2 |
| 1 | 2 |
user> (ds-set/reduce-intersection {:count :count} [ds-a ds-b])
_unnamed [1 3]:
| :a | :b | :count |
|---:|---:|-------:|
| 1 | 2 | 2 |
reduce-union
(reduce-union options datasets)
(reduce-union datasets)
Given a sequence of datasets, union the rows such that all tuples appear in the final dataset at their maximum repetition amount. Can return either a dataset with duplicate tuples or a dataset with a :count column.
Options:
:count
- Name of count column, if nil then tuples are duplicated and count is implicit.
user> (def ds-a (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 2 :b 3}]))
#'user/ds-a
user> (def ds-b (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 3 :b 3}]))
#'user/ds-b
user> (ds-set/reduce-union [ds-a ds-b])
_unnamed [4 2]:
| :a | :b |
|---:|---:|
| 2 | 3 |
| 3 | 3 |
| 1 | 2 |
| 1 | 2 |
user> (ds-set/reduce-union {:count :count} [ds-a ds-b])
_unnamed [3 3]:
| :a | :b | :count |
|---:|---:|-------:|
| 2 | 3 | 1 |
| 3 | 3 | 1 |
| 1 | 2 | 2 |
union
(union a)
(union a b)
(union a b & args)
Union two datasets producing a new dataset with the union of tuples. Repeated tuples will be repeated in final dataset at their maximum per-dataset repetition count.