.. _RFmath:

Random Forest
====================

Random Forest (RF) is a powerful classification tool. When given a set of data, RF
generates a forest of classification trees, rather than a single
classification tree. Each of these trees generates a classification
for a given set of  attributes. The classification from each H2O tree
can be thought of as a vote; the most votes determines the
classification.

"""""" 

When to use RF
""""""""""""""

RF is a good choice when your objective is classification. 

""""""

Defining a Model
""""""""""""""""""

**Destination key**

  A user-defined name for the model. 


**Source:**

  The parsed data set to be used in training a model. 


**Response:**
   
  The dependent variable to be modeled.


**Ignored Columns:**

  The set of features from the source data set to omit from
  training a model. 

**Validation:** 

  A .hex key associated with data to be used in validation of the
  model built using the data specified in **Source**.

**N folds**

   The number of folds for cross-validation (if no validation data is specified).      To disable cross-validation, enter 0. If the number of folds is more than 1, validation must remain empty. 

**Holdout fraction**

   The fraction of training data (from end) to hold out for validation (if no validation data is specified).  

**Keep cross validation dataset splits**

   Preserve cross-validation dataset splits.


**N Trees:** 
  
  The number of trees to generate for classification.

**Mtries:**

  At each iteration, a randomly chosen subset of the features in the training data is selected and evaluated to define the optimal split of that subset. Mtries specifies the number of features to be selected from the whole set. When set to -1, the number of features is the square root of the total number of features, rounded to the nearest integer. 

 
**Max Depth:** 

  A user-defined tuning parameter for controlling model complexity
  (by number of edges). Depth is the longest path from root to the
  furthest leaf. Maximum depth also specifies the maximum number of
  interactions that can be accounted for by the model.

**Min Rows**

  (BigData only)The minimum number of observations to include in each terminal node. The maximum possible observations in a terminal node is N-1, where N is the number of observations. The minimum is 1, and is the case where each observation is in a unique terminal node.

**Select stat type**

  (DRF only) Specify the stat type (entropy, gini, or twoing).
  
**Sampling strategy**

  (DRF only)Specify the sampling strategy (random).  


**Sample rate**

  The sampling rate at each split. 
  
**Score pojo**

  Create a POJO for scoring.   
  
**Balance classes**  

 For imbalanced data, balance training data class counts via over/under-sampling for improved predictive accuracy.

**Class sampling factors**

  (BigData only) Specify the over/under-sampling ratios per class (lexicographic order). 


**Checkpoint**

   (BigData only)A model key associated with a previously trained model. Use this option to build a new model as a continuation of a previously generated model (e.g., by a grid search).      

**Overwrite checkpoint**

  (BigData only)Overwrite the checkpoint. 
  

**Max After Balance Size** 

   If classes are balanced, limit the resulting dataset size to the specified multiple of the original dataset size.


**Oobee** 

  Generate an out-of-bag error estimate. 

**Importance:** 

  Returns information about the importance of each
  feature in the overall model. 


 **N Bins:**  

   A user-defined tuning parameter for controlling model complexity. N bins sets the number of groups into which the original data can be split.

**Score each iteration:**

  (BogData only)Returns an error rate for each iteration of the training process. It is used when models are especially complex. By producing a score for the model at each iteration, users can stop the training process when the cost of continuing to model is greater than the marginal gains in predictive power.


 **Seed:**
 
  A large number that allows the user to recreate an analysis by specifying a starting point for black box processes that would otherwise occur at a randomly chosen place within the data.

**Verbose**

  Display tree splits and extra statistics in the trout. 

**Build tree one node:**

  (BigData only)Train the model using one node, instead of using a distributed cluster with multiple nodes. Selecting this option can significantly impact the time it
  takes to train a model, and is recommended only for small data
  sets. 
 
""""

Interpreting Results
""""""""""""""""""""

RF results are comprised of a model key and a confusion matrix. The
model key specifies the full forest of trees to be used for 
predicting classifications. 


An example of a confusion matrix is given below:

The highlighted fields across the diagonal indicate the number the
number of true members of the class who were correctly predicted as
true. In this case, of 111 total members of class F, 44 were correctly
identified as class F, while a total of 80 observations were
incorrectly classified as M or I, yielding an error rate of 0.654.
 
In the column for class F, 11 members of I were incorrectly classified
as F, 56 as M, and a total of 111 observations in the set were
identified as F. 

The overall error rate is shown in the bottom right field. It reflects
the total number of incorrect predictions divided by the total number
of rows. 

.. image:: RFtable.png
   :width: 90%
   
   
""""""   

RF Error Rates
""""""""""""""

H2O's Random Forest algorithm produces a dynamic confusion matrix. As each
tree is built and OOBE (out of bag error estimate) is recalculated,
the expected behavior is that the error rate increases before it decreases. 
This is a natural outcome of Random Forest's learning process. When
there are only a few trees built on random subsets, the error rate is
expected to be relatively high. As more trees are added, resulting in 
more trees "voting" for the correct classification of the OOB
data, the error rate should decrease. 

""""

Random Forest Data Science
--------------------------
   

.. raw:: html

    <iframe src="http://www.slideshare.net/slideshow/embed_code/20546878" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen> </iframe> <div style="margin-bottom:5px"> <strong> <a href="https://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013" title="Jan vitek distributedrandomforest_5-2-2013" target="_blank">Jan vitek distributedrandomforest_5-2-2013</a> </strong> from <strong><a href="http://www.slideshare.net/0xdata" target="_blank">0xdata</a></strong> </div>


""""""