The purpose of this tutorial is to walk the new user through a PCA analysis beginning to end.
Those who have never used H2O before should see the quick start guide for additional instructions on how to run H2O.
PCA is used to reduce dimensions and solve issues of multicollinearity in high dimension data.
This tutorial uses a publicly available data set that can be found at: http://archive.ics.uci.edu/ml/datasets/Arrhythmia
The original data are the Arrhythmia data set made available by UCI Machine Learning repository. They are composed of 452 observations and 279 attributes.
Before modeling, parse data into H2O as follows:
After parsing:
Additional specification detail
PCA output returns a table displaying the number of components indicated by whichever criteria was more restrictive in this particular case. In this example, a maximum of 5 components were requested, and a tolerance set to 3, so the first 5 components were returned.
Scree and cumulative variance plots for the components are returned as well. Users can access them by clicking on the black button labeled “Scree and Variance Plots” at the top left of the results page. A scree plot shows the variance of each component, while the cumulative variance plot shows the total variance accounted for by the set of components.
Users should note that if they wish to replicate results between H2O and R, it is recommended that standardization and cross validation either be turned off in H2O, or specified in R.
THE END.