K Means Tutorial

The purpose of this tutorial is to walk through a K-Means analysis beginning to end. By the end of this tutorial the user should know how to specify, run, and interpret a K-means model in H2O.

Those who have never used H2O before should see the quick start guide for additional instructions on how to run H2O.

Interested users can find details on the math behind K Means at: K-Means.

Quick Start Video

Getting Started

This tutorial uses a publicly available data set that can be found at http://archive.ics.uci.edu/ml/datasets/seeds

The data are composed of 210 observations, 7 attributes, and an priori grouping assignment. All data are positively valued and continuous. Before modeling, parse data into H2O as follows:

  1. From the drop-down Data menu, select Upload and use the helper to upload data.
  2. On the “Request Parse” page that displays, select whether the first row of the data set is a header or not. All other settings can be left in default. Press Submit.
  3. Parsing data into H2O generates a .hex key (“data name.hex”).
../_images/KMparse.png

Building a Model

  1. Once data are parsed, a horizontal menu appears at the top of the screen that displays “Build model using ... ”. Select K Means here, or go to the drop-down Model menu and select K-Means.
  2. In the Source Key field, enter the .hex key associated with the data set.
  3. Choose K. For this dataset, K is chosen to be 3.
  4. Data can be normalized, though it is not done for this example.
  5. Specify Initialization. Plus Plus initialization chooses one initial center and random, and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen. Furthest initialization chooses one initial center at random, and then chooses the next center to be the point furthest away in terms of Euclidean distance. The default results in K initial centers being chosen independently at random.
  6. Specify Max Iter (short for maximum iterations), which allows the user to specify the maximum number of iterations the algorithm processes.
  7. Cols is a list of the columns of attributes that should be used in defining the clusters. Here we select all but column 7 (the a priori known clusters for this particular set).
  8. Press submit.
../_images/KMrequest.png

K-Means Output

Output is a matrix of the cluster assignments, and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly. K-Means randomly chooses starting points and converges on optimal centroids. The cluster number is arbitrary, and should be thought of as a factor.

../_images/KMinspect.png

K-means Next Steps

For further information on the model, select K-Means from the drop down Score menu. Specify the K-Means model key and the .hex key for the data set originally used.

When you click Submit, the output that displays is the number of rows assigned to each cluster and the squared error per cluster.

../_images/KMscore.png

K-means Apply

To generate a prediction (assign the observations in a data set to a cluster), select K-means Apply from the drop-down Score menu. Specify the model to be applied and the .hex for the data you would like to apply it to, and press submit.

Here cluster assignments have been generated for the original data. Because the data have been sufficiently well researched, the ideal cluster assignments were known in advance. Comparing known cluster with predicted cluster demonstrated that this K-Means model classifies with a less than 10% error rate.

../_images/KMapply.png

THE END.