tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
hash-join
(hash-join colname lhs rhs)
(hash-join colname lhs rhs {:keys [operation-space], :or {operation-space :int32}, :as options})
Join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :lhs-missing? Calculate the missing lhs indexes and left outer join table. :rhs-missing? Calculate the missing rhs indexes and right outer join table. :operation-space - either :int32 or :int64. Defaults to :int32. Returns {:join-table - joined-table :lhs-indexes - matched lhs indexes :rhs-indexes - matched rhs indexes ;; -- when rhs-missing? is true -- :rhs-missing - missing indexes of rhs. :rhs-outer-join - rhs outer join table. ;; -- when lhs-missing? is true -- :lhs-missing - missing indexes of lhs. :lhs-outer-join - lhs outer join table. }
inner-join
(inner-join colname lhs rhs)
(inner-join colname lhs rhs options)
Inner join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
left-join
(left-join colname lhs rhs)
(left-join colname lhs rhs options)
Left join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
left-join-asof
(left-join-asof colname lhs rhs {:keys [asof-op], :or {asof-op :<=}})
(left-join-asof colname lhs rhs)
Perform a left join asof. Similar to left join except this will join on nearest value. lhs and rhs must be sorted by join-column. join columns must be either datetime columns in which the join happens in millisecond space or they must be numeric - integer or floating point datatypes.
Options:
asof-op
- may be :< :<= :nearest :>= :> - type of join operation. Defaults to <=.
pd-merge
(pd-merge left-ds right-ds options)
(pd-merge left-ds right-ds)
Pandas-style merge. This is similar to join except it can merge on multiple columns for the left and right sides.
Options:
:on
- column name or list of columns names. Names must be found in both datasets.:left-on
- Column name or list of column names:right-on
- Column name or list of column names:how
- left, right inner, outer, cross. If cross, then no on, left-on, right-on can be provided.
Examples:
user> (require '[tech.v3.dataset :as ds])
nil
user> (require '[tech.v3.dataset.join :as ds-join])
nil
user> (def ds-a (ds/->dataset {:a [:a :b :b :a :c]
:b (range 5)
:c (range 5)}))
#'user/ds-a
user> (def ds-b (ds/->dataset {:a [:a :b :a :b :d]
:b (range 5)
:c (range 6 11)}))
#'user/ds-b
user> ds-a
_unnamed [5 3]:
| :a | :b | :c |
|----|---:|---:|
| :a | 0 | 0 |
| :b | 1 | 1 |
| :b | 2 | 2 |
| :a | 3 | 3 |
| :c | 4 | 4 |
user> ds-b
_unnamed [5 3]:
| :a | :b | :c |
|----|---:|---:|
| :a | 0 | 6 |
| :b | 1 | 7 |
| :a | 2 | 8 |
| :b | 3 | 9 |
| :d | 4 | 10 |
user> (ds-join/pd-merge ds-a ds-b {:on [:a :b] :how :inner})
inner-join [2 4]:
| :a | :b | :c | :right.c |
|----|---:|---:|---------:|
| :a | 0 | 0 | 6 |
| :b | 1 | 1 | 7 |
user> (ds-join/pd-merge ds-a ds-b {:on [:a :b] :how :outer})
outer-join [8 4]:
| :a | :b | :c | :right.c |
|----|---:|---:|---------:|
| :a | 0 | 0 | 6 |
| :b | 1 | 1 | 7 |
| :b | 2 | 2 | |
| :a | 3 | 3 | |
| :c | 4 | 4 | |
| :a | 2 | | 8 |
| :b | 3 | | 9 |
| :d | 4 | | 10 |
right-join
(right-join colname lhs rhs)
(right-join colname lhs rhs options)
Right join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table