tech.v3.dataset.sql

Pathways to transform dataset to and from SQL databases. Built directly on java.sql interfaces. Generically SQL doesn't support very many datatypes but an effort has been made to enable database-specific support for these base datatypes:

  • Integer Types - :int16, :int32, :int64
  • Float Types - :float32, :float64
  • String, Text
  • LocalDate (date), Instant (database specific, defaults to timestamp), LocalTime (time).
  • UUID (uuid in postgres, uniqueidentifier in sql server)

Users can serialize result sets into one dataset or a sequence of datasets depending on their downstream use case.

Column Metadata For Writing To The Database

For creating tables and uploading data to the database column metadata may be used to extend how the system writes data.

  • :sql-datatype - Change the sql datatype used during create-table. This string will be written in verbatim into the create-table sql.
  • :column->sql - pair of integer sql type and function that takes a column and row and must return a datatype in the SQL space of sql-datatype. To find the integer sql type see the tech.v3.datatset.sql.datatypes namespace or use database-type-table to find the type table for your particular database.
  • :insert-sql - When creating the prepared statement the default will be to use ? but users can override this to setup specialized types in the prepared statement. For example the postgres UUID type requires ? ::UUID in order to correctly serialize.

Metadata For Reading A ResultSet

When reading the result set the normal database option may be used such as key-fn and dataset-name. In addition, similar to parsing datasets via ->dataset users can pass in a keyed map as :parser-fn in which case the keys must match the post-key-fn'd result column name. Entries in parser-fn must be tuples of [datatype result-set-read-fn] where result-set-read-fn takes a resultset and a column index and returns data of the appropriate datatype.

If you are unsure of the datatype you can return nil for the datatype and the system will divine the datatype to use from the return value of result-set-read-fn. This has the drawback that columns with all missing values will be :boolean columns as per the dataset default and thus this may break round-tripping.


user> (def stocks (-> (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv" {:key-fn keyword})
                      (vary-meta assoc
                                 :name "stocks"
                                 :primary-keys ["date" "symbol"])))

#'user/stocks
user> (ds/head stocks)
stocks [5 3]:

| :symbol |      :date | :price |
|---------|------------|-------:|
|    MSFT | 2000-01-01 |  39.81 |
|    MSFT | 2000-02-01 |  36.35 |
|    MSFT | 2000-03-01 |  43.22 |
|    MSFT | 2000-04-01 |  28.37 |
|    MSFT | 2000-05-01 |  25.45 |
user> (sql/table-exists? dev-conn stocks)
false
user> (sql/create-table! dev-conn stocks)
nil
user> (sql/table-exists? dev-conn stocks)
true
user> (sql/insert-dataset! dev-conn stocks)
nil
user> ;;Note the column names are now strings.
user> (def sql-stocks (sql/sql->dataset dev-conn "Select * from stocks"))
#'user/sql-stocks
user> (ds/head sql-stocks)
_unnamed [5 3]:

| symbol |       date | price |
|--------|------------|------:|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (ds/head (sql/sql->dataset dev-conn "Select * from stocks" {:key-fn keyword}))
_unnamed [5 3]:

| :symbol |      :date | :price |
|---------|------------|-------:|
|    MSFT | 2000-01-01 |  39.81 |
|    MSFT | 2000-02-01 |  36.35 |
|    MSFT | 2000-03-01 |  43.22 |
|    MSFT | 2000-04-01 |  28.37 |
|    MSFT | 2000-05-01 |  25.45 |

create-sql

(create-sql conn-or-db-name dataset options)(create-sql conn-or-db-name dataset)

Return a create table statement for this datasets. Datatype appropriate columns will be created based on each column's metadata.

  • conn-or-db-name - Either a Connection object or a string database name to use to map database-specific datatypes, such as UUID's, to the appropriate column datatype.
  • dataset - dataset or sequence of column metadata maps each of which must contain :name. dataset's metadata itself must contain a :name key.

Options:

All options may be provided as metadata on the table itself.

  • :table-name - table name to use.
  • :primary-key - Either a string or a sequence of strings indicating the primary key clause.

Column Metadata:

Column metadata must have :name. Each column metadata map may have :sql-datatype in which case this will be used unchangd or :datatype which will be matched against the database name to decide the specific datatype to use.

create-table!

(create-table! conn dataset options)(create-table! conn dataset)

Create a table. Exception upon failure to drop the table.

  • conn - java.sql.Connection
  • dataset - dataset to use. The dataset-name will be used as the table-name and the column names and datatypes will be used for the sql names and datatypes.

The dataset's dataset-name will be used for the table and if the dataset's metadata has a :primary-key member this will be interpreted as a sequence of column names to be used as the primary key.

Options:

  • :table-name - set the name of the table to use. Overrides the dataset metadata.
  • :primary-key - Array of column names to use as the primary key of this table. Overrides the dataset metadata.

database-name

(database-name conn)

Return the database name for this connection.

database-type-table

(database-type-table conn)

Return table mapping sql datatype to sql datatype integer for all datatypes this database supports.

drop-table!

(drop-table! conn dataset)

Drop a table. Exception upon failure to drop the table.

  • conn - java.sql.Connection
  • dataset - string, keyword, symbol, or dataset

drop-table-when-exists!

(drop-table-when-exists! conn dataset)

Drop the table indicated by this dataset if it exists.

ensure-table!

(ensure-table! conn dataset options)(ensure-table! conn dataset)

Create a table if it does not exist. See documentation for create-table!

execute-prepared-statement-batches!

(execute-prepared-statement-batches! conn stmt-or-sql dataset options)

Internal method to, using a dataset, execute prepared statement batches drawn from the rows of the dataset. For this to work correctly, the connection needs to have autoCommit set to false and the sql statement's datatypes need to match the dataset datatypes.

See namespace documentation about column metadata and various options that may be associated on a per-column basis to change the way data is uploaded.

  • conn - java.sql.Connection
  • stmt-or-sql - Either a prepared statement or a string in which case the connection's .prepareStatement method will be called.
  • dataset - dataset

Options

  • batch-size - integer, defaults to 1024

execute-update!

(execute-update! conn sql)

Executes the given SQL statement, which may be an INSERT, UPDATE, or DELETE statement or an SQL statement that returns nothing, such as an SQL DDL statement. This statement will be immediately committed.

generic-column->sql

(generic-column->sql col idx)

Generic function that reads a value from a column.

generic-sql->column

(generic-sql->column rs sql-col-idx)

Generic function that reads a column from a result set.

insert-dataset!

(insert-dataset! conn dataset options)(insert-dataset! conn dataset)

Insert a dataset into a table. Column datatypes will be inferred from the dataset datatypes or may be provided as metadata on each column. See namespace documentation for :column->sql.

insert-sql

(insert-sql conn-or-db-name dataset options)(insert-sql conn-or-db-name dataset)

make-convert-column->sql

(make-convert-column->sql convert-fn)

make-convert-sql->column

(make-convert-sql->column convert-fn)

postgre-sql-connect-str

(postgre-sql-connect-str hoststr database user pwd)

postgres-upsert-sql

(postgres-upsert-sql conn-or-db-name dataset options)(postgres-upsert-sql conn-or-db-name dataset)

result-set->dataset

(result-set->dataset conn-or-db-name results options)(result-set->dataset conn-or-db-name results)

Given a result set return a dataset that serialized the entire resultset into one dataset. See options for result-set->dataset-seq.

result-set->dataset-seq

(result-set->dataset-seq conn-or-db-name results options)(result-set->dataset-seq conn-or-db-name results)

Given a result set return a sequence of datasets. Each dataset will be at most :batch-size num-rows in length. Note that the statement will stay open as long as the lazy sequence isn't fully realized which means if you do not realize the sequence you will leave a hanging statement open. There is currently no way to prematurely end the dataset sequence and close the sql statement via the public API.

Options:

  • :batch-size - defaults to 64000. Returned datasets will be at most this number of rows long.
  • :key-fn - Apply this to the column names -- an example would be keyword.
  • :parser-fn - Map of post-key-fn column-name to tuple of [datatype result-set-read-fn] where result-set-read-fn takes a result-set and a column-index and must return data of the appropriate datatype. If the datatype is unknown then nil can be provided and the dataset will use the widest datatype that will fit the returned data or :boolean if all values are missing.
  • :statement - The sql statement to close when finished.
  • :close? - Close the result set when finished. Defaults to true.

result-set-metadata->data

(result-set-metadata->data metadata col-idx)

Given result set metadata and a column index return the metadata. Note that the result set metadata for a particular column is save as :result-set-metadata in the column's metadata.

set-datatype-mapping!

(set-datatype-mapping! database-name datatype sql-datatype sql-type-index result-set-read-fn col-read-fn)

Add a database specific datatype mapping.

  • database-name - name of database
  • datatype - dtype-next datatype
  • sql-datatype - sql datatype to to map to.
  • sql-type-index - type index to use to set missing. See datatype-type-table for a way to find the sql type index from the sql datatype.
  • result-set-read-fn - function that takes a result set and a column index and returns the data in dtype-next space e.g. a java.time.LocalTime object as opposed to a java.sql.Time object.
  • col-read-fn - Function takes a column and a row idx and returns data as an sql datatype e.g. a java.sql.Time object as opposed to a java.time.LocalTime object.

Your type should be a proper tech.v3.datatype datatype meaning it is either an existing datatype or added via tech.v3.datatype.casting/add-object-datatype!.

Example:

For our example let's say we would like to support postgreSQL's jsonb datatype. We first create a record type to denote json data and register it with dtype-next. Next we build a dataset with a column of this datatype.

Next we register a way to input and get back this data from sql. After that columns of the new datatype will be automatically marshalled into/out of sql.

user> (def conn (tech.v3.dataset.sql-test-utils/connect :postgre-sql))
#'user/conn
user> conn
#object[org.postgresql.jdbc.PgConnection 0x421726aa "org.postgresql.jdbc.PgConnection@421726aa"]
user> (require '[tech.v3.dataset :as ds])
nil
user> (require '[tech.v3.datatype.casting :as casting])
nil
user> (require '[tech.v3.dataset.sql :as sql])
nil
user> (require '[clojure.data.json :as json])
nil
user> (defrecord JSONData [json-data])
JSONData
user> (casting/add-object-datatype! :json-data JSONData)
:ok
user> (def json-type-index (-> (sql/database-type-table conn)
                              (ds/filter-column "TYPE_NAME" #(= % "jsonb"))
                              (ds/row-at -1)
                              (get "DATA_TYPE")))
#'user/json-type-index
user> json-type-index
1111
user> (sql/database-name conn)
"PostgreSQL"
user> (sql/set-datatype-mapping!
       "PostgreSQL" :json-data "jsonb" json-type-index
       (fn [^java.sql.ResultSet rs col-idx]
         (when-let [json-obj (.getObject rs col-idx)]
           (-> json-obj
               (str)
               (json/read-str :key-fn keyword)
               (JSONData.))))
       (fn [col idx]
         (when-let [col-obj (col idx)]
           (-> col-obj
               (:json-data)
               (json/write-str)))))
:ok
user> (def table-name "jsontable")
#'user/table-name
user> (def test-ds (ds/->dataset {:jsoncol [(JSONData. {:a 1 :b 2})
                                            (JSONData. {:c 2 :d 3})]}
                                    {:dataset-name table-name}))
#'user/test-ds
user> test-ds
jsontable [2 1]:

|                  :jsoncol |
|---------------------------|
| {:json-data {:a 1, :b 2}} |
| {:json-data {:c 2, :d 3}} |
user> (test-ds :jsoncol)
#tech.v3.dataset.column<json-data>[2]
:jsoncol
[user.JSONData@db1bf962, user.JSONData@db1bf861]
user> (sql/create-table! conn test-ds)
nil
user> (sql/insert-dataset! conn test-ds)
nil
user> (sql/sql->dataset conn (str "select * from " table-name)
                        {:key-fn keyword})
_unnamed [2 1]:

|                  :jsoncol |
|---------------------------|
| {:json-data {:a 1, :b 2}} |
| {:json-data {:c 2, :d 3}} |

sql->dataset

(sql->dataset conn sql options)(sql->dataset conn sql)

Given a connection and an sql statement return a single dataset. See options for result-set->dataset-seq.

sql->dataset-seq

(sql->dataset-seq conn sql options)(sql->dataset-seq conn sql)

Given a connection and an sql statement return a sequence of datasets. See options for result-set->dataset-seq.

sql-server-connect-str

(sql-server-connect-str hoststr database user pwd)

table-exists?

(table-exists? conn dataset & [options])

Test if a table exists.

  • conn - java.sql.Connection
  • dataset - string, keyword, symbol, or dataset

Example:

  user> (sql/table-exists? dev-conn stocks)
  true