The purpose of this tutorial is to walk the new user through a K-Means analysis beginning to end. By the end of this tutorial the user should know how to specify, run, and interpret a K-means model.
Those who have never used H2O before should see the quick start guide for additional instructions on how to run H2O.
This tutorial uses a publicly available data set that can be found
Seeds data set http://archive.ics.uci.edu/ml/datasets/seeds
The data are composed of 210 observations, 7 attributes, and an priori grouping assignment. All data are positively valued and continuous. Before modeling, parse data into H2O as follows:
Output is a matrix of the cluster assignments, and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly. K-Means randomly chooses starting points and converges on optimal centroids. The cluster number is arbitrary, and should be thought of as a factor.
The exact model can be reproduced by entering the same seed as was originally used. This seed is
-220049831882026147
For further information on the model select K-Means from the drop down menu Score. Specify the K-Means model key, and the .hex key for the data set originally used.
The output obtained when submit is pressed is the number of rows assigned to each cluster, and the squared error per cluster.
To generate a prediction (assign the observations in a data set to a cluster) select K-means Apply from the Score drop down menu. Specify the model to be applied and the .hex for the data you would like to apply it to, and press submit.
Here cluster assignments have been generated for the original data. Because the data have been sufficiently well researched, the ideal cluster assignments were known in advance. Comparing known cluster with predicted cluster demonstrated that this K-Means model classifies with a less than 10% error rate.
THE END.