tech.v3.dataset.set

Extensions to datasets to do per-row bag-semantics set/union and intersection.

difference

(difference a)(difference a b)

Remove tuples from a that also appear in b.

intersection

(intersection a)(intersection a b)(intersection a b & args)

Intersect two datasets producing a new dataset with the union of tuples. Tuples repeated across all datasets repeated in final dataset at their minimum per-dataset repetition count.

reduce-intersection

(reduce-intersection options datasets)(reduce-intersection datasets)

Given a sequence of datasets, union the rows such that tuples that exist in all datasets appear in the final dataset at their mininum repetition amount. Can return either a dataset with duplicate tuples or a dataset with a :count column.

Options:

  • :count - Name of count column, if nil then tuples are duplicated and count is implicit.
user> (def ds-a (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 2 :b 3}]))
#'user/ds-a
user> (def ds-b (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 3 :b 3}]))
#'user/ds-b
user> (ds-set/reduce-intersection [ds-a ds-b])
_unnamed [2 2]:

| :a | :b |
|---:|---:|
|  1 |  2 |
|  1 |  2 |
user> (ds-set/reduce-intersection {:count :count} [ds-a ds-b])
_unnamed [1 3]:

| :a | :b | :count |
|---:|---:|-------:|
|  1 |  2 |      2 |

reduce-union

(reduce-union options datasets)(reduce-union datasets)

Given a sequence of datasets, union the rows such that all tuples appear in the final dataset at their maximum repetition amount. Can return either a dataset with duplicate tuples or a dataset with a :count column.

Options:

  • :count - Name of count column, if nil then tuples are duplicated and count is implicit.
user> (def ds-a (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 2 :b 3}]))
#'user/ds-a
user> (def ds-b (ds/->dataset [{:a 1 :b 2} {:a 1 :b 2} {:a 3 :b 3}]))
#'user/ds-b
user> (ds-set/reduce-union [ds-a ds-b])
_unnamed [4 2]:

| :a | :b |
|---:|---:|
|  2 |  3 |
|  3 |  3 |
|  1 |  2 |
|  1 |  2 |
user> (ds-set/reduce-union {:count :count} [ds-a ds-b])
_unnamed [3 3]:

| :a | :b | :count |
|---:|---:|-------:|
|  2 |  3 |      1 |
|  3 |  3 |      1 |
|  1 |  2 |      2 |

union

(union a)(union a b)(union a b & args)

Union two datasets producing a new dataset with the union of tuples. Repeated tuples will be repeated in final dataset at their maximum per-dataset repetition count.