Introduction to H2O AutoDoc¶

What is H2O AutoDoc¶

H2O AutoDoc is a Python package for creating automatic reports from H2O-3 or Scikit-Learn supervised learning models.

Supported H2O-3 Models¶

Deep Learning
Distributed Random Forest (including Extremely Randomized Forest)
Generalized Linear Model
Gradient Boosting Machine
Stacked Ensembles
XGBoost

Supported Scikit-Learn Models¶

The AutoDoc supports most of Scikit-Learn’s supervised learning models for classification and regression. The AutoDoc, however, does not support the Multi-task algorithms.

H2O-3¶

AutoDoc for Binary vs. MOJO Models

The AutoDoc supports H2O-3 version 3.24.0.1 and higher. In version 3.24.0.3, H2O-3 added the ability to import MOJO models into an H2O-3 cluster. The MOJO import functionality provides a means to use external, pre-trained models in H2O - primarily for scoring. Depending on the external model type, select metrics and model information can also be attained. The AutoDoc leverages this MOJO import functionality to document MOJO models. Check the H2O-3 user guide to see if it supports your model’s MOJO here.

This table shows which H2O-3 versions AutoDoc supports for binary and MOJO models.

H2O-3 Version	Binary Model	MOJO Model
3.24.0.1 - 3.24.0.2	x
3.24.0.3 and greater	x	x

The following table shows when MOJO support was added for different models. Note: the AutoDoc includes all features for the supported H2O-3 binary models.

Notes:

H2O-3 started supporting original model name in 3.30.0.1, in older versions you will see the model name as “Import MOJO Model”.

MOJO Model Type	Supported H2O Version
GBM	3.24.0.3 +
XGBoost	3.28.0.1 +
GLM	3.24.0.3 +
DRF	3.24.0.3 +
XRT	3.24.0.3 +

AutoDoc Sections by Problem-Type

Depending on your use case (i.e., classification or regression) the AutoDoc will exclude certain sections.

Binary Classification - Includes all sections.
Regression - Includes all sections except the classificaiton-specific sections:
- Quantile Response Rate Plot
- Actual vs Predicted Probabilities and Actual vs Predicted Log Odds
Multiclass Classification - Includes all sections except:
- Partial Dependence plots, which are not currently supported for multiclass problems
- Population Stability Index
- Prediction Statistics Table
- Binary classification specific plots:
  ROC curve
  
  Cumulative Lift and Gains Charts
  
  Quantile Response Rate Plot
  
  Actual vs Predicted Probabilities and Actual vs Predicted Log Odds

Scikit Learn¶

The Scikit-learn AutoDoc is designed to be very similar to the H2O-3 AutoDoc. In general, disparities between the AutoDocs are due to differences in H2O-3’s and Scikit-Learn’s algorithms.

Here we highlight some notable differences:

Feature Importance is only provided for Scikit-learn algorithms that support feature importance.
- For Scikit-Learn 0.22 and higher, AutoDoc shows permutation importance for models that don’t have built-in feature importance.
Validation Strategy section is only included in the H2O-3 AutoDoc.
- H2O-3 estimators include hold-out data and cross-validation information, while Scikit-learn estimators do not.
Prediction Probability Information is limited to Scikit-learn classifiers that provided prediction probabilities (accessible via the predict_proba() method). As a result, AutoDoc for classifiers without predict_probab() will exclude sections that require predicted probabilities:
- Quantile Response Rates
- Actual vs. Predicted, Prediction statistics
- Population Stability Index.
- Partial Dependence Plots

Dataset Dependent Sections¶

The AutoDoc can render reports for models (Python Object, binary, MOJO, etc.,) with and without access to hold-out datasets. While H2O-3’s AutoDoc can render model-only reports, the Scikit-Learn AutoDoc requires that you pass in training data.

The following table summarizes AutoDocs with training data and/or hold-out datasets:

Section Topic	Model + Train	Model + Train + Hold-Out
Shift Detection	X
Population Stability Index	X
Hold-Out Data Metrics		X

If no datasets are provided, however, the report will exclude additional sections. The table below details the limitations of an H2O-3 model-only AutoDoc.

Section Name	Model and Data	Model
Experiment Overview	X	X
Data Overview	X
Validation Strategy	X
Feature Importance	X	Limited
Final Model	X	Limited
Alternative Models	X	X
Partial Dependence Plots	X
Appendix	X	Limited

Data Limitation Details by Section

Additional limitations can occur depending on the machine learning software platform (i.e., H2O-3, Scikit-Learn) and the software platform’s version number. See the corresponding software sections for further details.

Feature Importance:
- Excludes:
  Shapley Feature Importance
Final Model Section:
- Excludes:
  Population Stability Index
  
  Prediction Statistics
  
  Quantile Response Plots
  
  Actual vs. Predicted Probabilities and Actual vs. Predicted Log Odds
- Can Include:
  Confusion Matrix
  
  H2O-3 Performance Charts: Scoring History, ROC Curve, Cumulative Lift and Cumulative Gains
Appendix:
- Excludes:
  Quantile Plot Calculation Tables (These tables only appear if quantile based table and plots are created)

Supported File Types¶

H2O AutoDoc can generate the following file types:

Microsoft Word documents (.docx)
Markdown files (.md)

Both of these file types can easily be exported to HTML or PDF.