tech.v3.dataset.join

implementation of join algorithms, both exact (hash-join) and near.

hash-join

(hash-join colname lhs rhs)(hash-join colname lhs rhs {:keys [operation-space], :or {operation-space :int32}, :as options})

Join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :lhs-missing? Calculate the missing lhs indexes and left outer join table. :rhs-missing? Calculate the missing rhs indexes and right outer join table. :operation-space - either :int32 or :int64. Defaults to :int32. Returns {:join-table - joined-table :lhs-indexes - matched lhs indexes :rhs-indexes - matched rhs indexes ;; -- when rhs-missing? is true -- :rhs-missing - missing indexes of rhs. :rhs-outer-join - rhs outer join table. ;; -- when lhs-missing? is true -- :lhs-missing - missing indexes of lhs. :lhs-outer-join - lhs outer join table. }

inner-join

(inner-join colname lhs rhs)(inner-join colname lhs rhs options)

Inner join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

left-join

(left-join colname lhs rhs)(left-join colname lhs rhs options)

Left join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

left-join-asof

(left-join-asof colname lhs rhs {:keys [asof-op], :or {asof-op :<=}})(left-join-asof colname lhs rhs)

Perform a left join asof. Similar to left join except this will join on nearest value. lhs and rhs must be sorted by join-column. join columns must be either datetime columns in which the join happens in millisecond space or they must be numeric - integer or floating point datatypes.

Options:

pd-merge

(pd-merge left-ds right-ds options)(pd-merge left-ds right-ds)

Pandas-style merge. This is similar to join except it can merge on multiple columns for the left and right sides.

Options:

  • :on - column name or list of columns names. Names must be found in both datasets.
  • :left-on - Column name or list of column names
  • :right-on - Column name or list of column names
  • :how - left, right inner, outer, cross. If cross, then no on, left-on, right-on can be provided.

Examples:

user> (require '[tech.v3.dataset :as ds])
nil
user> (require '[tech.v3.dataset.join :as ds-join])
nil
user> (def ds-a (ds/->dataset {:a [:a :b :b :a :c]
                            :b (range 5)
                            :c (range 5)}))
#'user/ds-a
user> (def ds-b (ds/->dataset {:a [:a :b :a :b :d]
                            :b (range 5)
                            :c (range 6 11)}))
#'user/ds-b
user> ds-a
_unnamed [5 3]:

| :a | :b | :c |
|----|---:|---:|
| :a |  0 |  0 |
| :b |  1 |  1 |
| :b |  2 |  2 |
| :a |  3 |  3 |
| :c |  4 |  4 |
user> ds-b
_unnamed [5 3]:

| :a | :b | :c |
|----|---:|---:|
| :a |  0 |  6 |
| :b |  1 |  7 |
| :a |  2 |  8 |
| :b |  3 |  9 |
| :d |  4 | 10 |
user> (ds-join/pd-merge ds-a ds-b {:on [:a :b] :how :inner})
inner-join [2 4]:

| :a | :b | :c | :right.c |
|----|---:|---:|---------:|
| :a |  0 |  0 |        6 |
| :b |  1 |  1 |        7 |
user> (ds-join/pd-merge ds-a ds-b {:on [:a :b] :how :outer})
outer-join [8 4]:

| :a | :b | :c | :right.c |
|----|---:|---:|---------:|
| :a |  0 |  0 |        6 |
| :b |  1 |  1 |        7 |
| :b |  2 |  2 |          |
| :a |  3 |  3 |          |
| :c |  4 |  4 |          |
| :a |  2 |    |        8 |
| :b |  3 |    |        9 |
| :d |  4 |    |       10 |

right-join

(right-join colname lhs rhs)(right-join colname lhs rhs options)

Right join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table