K Means Grid Tutorial

The purpose of this tutorial is to walk the new user through a K-Means Grid analysis.

Those who have never used H2O before should see the quick start guide for additional instructions on how to run H2O.

Getting Started

This tutorial uses a publicly available data set that can be found

Seeds data set http://archive.ics.uci.edu/ml/datasets/seeds

The data are composed of 210 observations, 7 attributes, and an priori grouping assignment. All data are positively valued and continuous. Before modeling, parse data into H2O as follows:

  1. Under the drop down menu Data select Upload and use the helper to upload data.
  2. User will be redirected to a page with the header “Request Parse”. Select whether the first row of the data set is a header. All other settings can be left in default. Press Submit.
  3. Parsing data into H2O generates a .hex key (“data name.hex”).
../_images/KMparse.png

Building a Model

  1. Once data are parsed a horizontal menu will appear at the top of the screen reading “Build model using ... ”. Select K Means Grid here, or go to the drop down menu Model and select K-Means Grid there.
  2. In the field for Source Key enter the .hex key associated with the data set.
  3. Choose K. There is no hard and fast rule for how to do this. K-Means Grid is designed to allow users to examine many different values of K simultaneously. To consider K={2, 5, 10} for this analysis enter 2|5|10 into the K field.
  4. Note that data can be normalized, though it is not done for this analysis.
  5. Max iter allows the user to specify the maximum number of iterations the algorithm processes. To consider Max Iterations ={50, 100, 200} for this analysis enter 50|100|200 into the Max Iter field.
  6. Press submit.

The separator between alternative levels of K and Max Iter is the vertical bar or pipe. On a US keyboard it is found below the delete key on the right hand side of the keyboard by shift + (backslash).

K-Means Grid Output

Output is a table displaying basic information about each model generated. Because 3 levels of K, and 3 levels of iteration were requested, there are 9 total models presented in the output. To examine any one of these models click the link in the Model Key field of the table.

Each of the models was produced with the same seed, and each can be reproduced by entering that same seed.

THE END.