K-Means Clustering | Ignite Documentation
Edit

K-Means Clustering

K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters.

Model

K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

The model holds a vector of k centers and one of the distance metrics provided by the ML framework such as Euclidean, Hamming, Manhattan and etc.

It creates the label as follows:

KMeansModel mdl = trainer.fit(
    ignite,
    dataCache,
    vectorizer
);


double clusterLabel = mdl.predict(inputVector);

Trainer

KMeans is an unsupervised learning algorithm. It solves a clustering task which is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

KMeans is a parametrized iterative algorithm which calculates the new means to be the centroids of the observations in the clusters on each iteration.

Presently, Ignite supports a few parameters for the KMeans classification algorithm:

  • k - a number of possible clusters

  • maxIterations - one stop criteria (the other one is epsilon)

  • epsilon - delta of convergence (delta between old and new centroid’s values)

  • distance - one of the distance metrics provided by the ML framework such as Euclidean, Hamming or Manhattan

  • seed - one of initialization parameters which helps to reproduce models (trainer has a random initialization step to get the first centroids)

// Set up the trainer
KMeansTrainer trainer = new KMeansTrainer()
   .withDistance(new EuclideanDistance())
   .withK(AMOUNT_OF_CLUSTERS)
   .withMaxIterations(MAX_ITERATIONS)
   .withEpsilon(PRECISION);

// Build the model
KMeansModel mdl = trainer.fit(
    ignite,
    dataCache,
    vectorizer
);

Example

To see how K-Means clustering can be used in practice, try this example that is available on GitHub and delivered with every Apache Ignite distribution.

The training dataset is the subset of the Iris dataset (classes with labels 1 and 2, which are presented linear separable two-classes dataset) which can be loaded from the UCI Machine Learning Repository.