tvm-clj.application.kmeans

High performance implementation of the KMeans algorithm using kmeans++ initialization and Lloyd’s algorithm for convergence.

kmeans++

(kmeans++ dataset n-centroids & [{:keys [n-iters rand-seed minimal-improvement-threshold], :or {minimal-improvement-threshold 0.01}, :as options}])

Find K cluster centroids via kmeans++ center initialization followed by Lloyds algorithm. Dataset must be a matrix (2d tensor).

  • dataset - 2d matrix of numeric datatype.
  • n-centroids - How many centroids to find.

Returns map of:

  • :centroids - 2d tensor of double centroids
  • :centroid-indexes - 1d integer vector of assigned center indexes.
  • :iteration-scores - n-iters+1 length array of mean squared error scores container the scores from centroid assigned up to the score when the algorithm terminates.

Options:

  • :minimal-improvement-threshold - defaults to 0.01 - algorithm terminates if (1.0 - error(n-1)/error(n-2)) < error-diff-threshold. When Zero means algorithm will always train to max-iters.
  • :n-iters - defaults to 100 - Max number of iterations, algorithm terminates if `(>= iter-idx n-iters).
  • :rand-seed - integer or implementation of java.util.Random.

order-data-labels

(order-data-labels data labels)

Order the dataset and labels such that labels are monotonically increasing. returns tuple of [dataset labels]

predict-per-label

(predict-per-label data model)

Return both a probability distribution per row across each label and a 1d tensor of assigned label indexes.

Returns:

  • :probability-distribution - each row sums to one, max prob is the index picked.
  • :label-indexes - int32 assigned indexes for each row in the dataset.

quantize-image

(quantize-image src-path dst-path n-quantization & [{:keys [n-iters seed], :or {n-iters 5}}])

Quantize an image using kmeans. Copies data into a new image and, if dest-path is provided, saves the image.

Returns:

  • :centroids - result of the quantization.
  • :result - resulting BufferedImage.
  • :scores - Scores after each iteration including initialization.

train-per-label

(train-per-label data labels n-per-label & [{:keys [input-ordered?], :as options}])

Given a dataset along with per-row integer labels, train N per-label kmeans centroids returning a model which you can use can use with predict-per-label.