The purpose of this tutorial is to walk the new user through a GLM analysis beginning to end. The objective is to learn how to specify, run, and interpret a GLM model.
Those who have never used H2O before should see the quick start guide for additional instructions on how to run H2O.
PREDICTION The variable of interest relates to predictions or inferences about a rate, an event, or a continuous measurement. Questions are about how a set of environmental conditions influence the dependent variable.
Here are some examples:
“What attributes determine which customers will purchase, and which will not?”
“Given a set of specific manufacturing conditions, how many units produced will fail?”
“How many customers will contact help support in a given time frame?”
CLASSIFICATION The variable of interest is a binomial outcome; a variable that can be expressed as 0 or 1, a success or a failure.
This tutorial uses a publicly available data set that can be found at:
http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
The original data are the Abalone data set made available by UCI Machine Learning repository. They are composed of 4177 observations and 7 attributes, and have been split into .90/.10 train/ test sets through random assignment. All attributes are real valued continuous, except for Sex and Rings. Sex is categorical with 3 levels (male, female, and infant), and Rings is discrete.
Before modeling, parse data into H2O as follows:
For this tutorial two data sets will need to be parsed: the testing set and the training set. Split your data appropriately and parse them both now.
After parsing:
Additional specification detail
GLM output includes coefficients (as well as normalized coefficients when standardization is requested). Also reported are AIC and error rate. A specification of the model is printed across the top of the GLM results page in red.
Users should note that if they wish to replicate results between H2O and R, it is recommended that standardization and cross validation either be turned off in H2O, or specified in R.
Validation results report models statistics like those originally generated when the model was built. It should give users an idea of how well their model predicts.
THE END.