Features Settings

Feature Engineering Effort

Specify a value from 0 to 10 for the Driverless AI feature engineering effort. Higher values generally lead to more time (and memory) spent in feature engineering. This value defaults to 5.

  • 0: Keep only numeric features. Only model tuning during evolution.

  • 1: Keep only numeric features and frequency-encoded categoricals. Only model tuning during evolution.

  • 2: Similar to 1 but instead just no Text features. Some feature tuning before evolution.

  • 3: Similar to 5 but only tuning during evolution. Mixed tuning of features and model parameters.

  • 4: Similar to 5 but slightly more focused on model tuning.

  • 5: Balanced feature-model tuning. (Default)

  • 6-7: Similar to 5 but slightly more focused on feature engineering.

  • 8: Similar to 6-7 but even more focused on feature engineering with high feature generation rate and no feature dropping even if high interpretability.

  • 9-10: Similar to 8 but no model tuning during feature evolution.

Data Distribution Shift Detection

Specify whether Driverless AI should detect data distribution shifts between train/valid/test datasets (if provided). Currently, this information is only presented to the user and not acted upon.

Data Distribution Shift Detection Drop of Features

Specify whether to drop high-shift features. This defaults to Auto. Note that Auto for time series experiments turns this feature off.

Max Allowed Feature Shift (AUC) Before Dropping Feature

Specify the maximum allowed AUC value for a feature before dropping the feature.

When train and test dataset differ (or train/valid or valid/test) in terms of distribution of data, then a model can be built that tells for each row, whether the row is in train or test. This model includes an AUC value. If this AUC, GINI, or Spearman correlation of the model is above the specified threshold, then Driverless AI will consider it a strong enough shift to drop those features.

The default AUC threshold is 0.999.

Data Leakage Detection

Specify whether to check for data leakage for each feature. Some of the features may contain over predictive power on the target column. This may affect model generalization. Note that this is always disabled if a fold column is specified and if the experiment is a time series experiment. This is set to Auto by default.

The equivalent config.toml parameter is check_leakage.

Data Leakage Detection Dropping AUC/R2 Threshold

If Leakage Detection is enabled, specify the threshold for dropping features. When the AUC (or R2 for regression), GINI, or Spearman correlation is above this value, the feature is dropped. This value defaults to 0.999.

The equivalent config.toml parameter is drop_features_leakage_threshold_auc.

Max Rows Times Columns for Leakage

Specify the maximum number of rows times the number of columns to trigger sampling for leakage checks. This value defaults to 10,000,000.

Report Permutation Importance on Original Features

Specify whether Driverless AI reports permutation importance on original features. This is disabled by default.

Maximum Number of Rows to Perform Permutation-Based Feature Selection

Specify the maximum number of rows to when performing permutation feature importance. This value defaults to 500,000.

Max Number of Original Features Used

Specify the maximum number of columns to be selected from an existing set of columns using feature selection. This value defaults to 10,000.

Max Number of Original Non-Numeric Features

Specify the maximum number of non-numeric columns to be selected. Feature selection is performed on all features when this value is exceeded. This value defaults to 300.

Max Number of Original Features Used for FS Individual

Specify the maximum number of features you want to be selected in an experiment. This value defaults to 500. Additional columns above the specified value add special individual with original columns reduced.

Number of Original Numeric Features to Trigger Feature Selection Model Type

The maximum number of original numeric columns, above which Driverless AI will do feature selection. Note that this is applicable only to special individuals with original columns reduced. A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features. This value defaults to 500.

Number of Original Non-Numeric Features to Trigger Feature Selection Model Type

The maximum number of original non-numeric columns, above which Driverless AI will do feature selection on all features. Note that this is applicable only to special individuals with original columns reduced. A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features. This value defaults to 200.

Max Allowed Fraction of Uniques for Integer and Categorical Columns

Specify the maximum fraction of unique values for integer and categorical columns. If the column has a larger fraction of unique values than that, it will be considered an ID column and ignored. This value defaults to 0.95.

Allow Treating Numerical as Categorical

Specify whether to allow some numerical features to be treated as categorical features. This is enabled by default.

Max Number of Unique Values for Int/Float to be Categoricals

Specify the number of unique values for integer or real columns to be treated as categoricals. This value defaults to 50.

Max Number of Engineered Features

Specify the maximum number of features to include in the final model’s feature engineering pipeline. If -1 is specified (default), then Driverless AI will automatically determine the number of features.

Max Number of Genes

Specify the maximum number of genes (transformer instances) kept per model (and per each model within the final model for ensembles). This controls the number of genes before features are scored, so Driverless AI will just randomly samples genes if pruning occurs. If restriction occurs after scoring features, then aggregated gene importances are used for pruning genes. Instances includes all possible transformers, including original transformer for numeric features. A value of -1 means no restrictions except internally-determined memory and interpretability restriction.

Limit Features by Interpretability

Specify whether to limit feature counts with the Interpretability training setting as specified by the features_allowed_by_interpretability config.toml setting.

Threshold for Interpretability Above Which to Enable Automatic Monotonicity Constraints for Tree Models

Specify an Interpretability setting value equal and above which to use automatic monotonicity constraints in XGBoostGBM, LightGBM, or Decision Tree models. This value defaults to 7.

See monotonicity_constraints_interpretability_switch in config.toml

Correlation Beyond Which to Trigger Monotonicity Constraints (if enabled)

Specify the threshold of Pearson product-moment correlation coefficient between numerical or encoded transformed feature and target above (below negative for) which to use positive (negative) monotonicity for XGBoostGBM, LightGBM and Decision Tree models. This value defaults to 0.1.

Note: This setting is only enabled when Interpretability is greater than or equal to the value specified by the Threshold for Interpretability Above Which to Enable Automatic Monotonicity Constraints for Tree Models setting and when the Manual Override for Monotonicity Constraints setting is not specified.

See monotonicity_constraints_correlation_threshold in config.toml

Control amount of logging when calculating automatic monotonicity constraints (if enabled)

For models that support monotonicity constraints, and if enabled, show automatically determined monotonicity constraints for each feature going into the model based on its correlation with the target. ‘low’ shows only monotonicity constraint direction. ‘medium’ shows correlation of positively and negatively constraint features. ‘high’ shows all correlation values.

See monotonicity_constraints_log_level in config.toml

Whether to drop features that have no monotonicity constraint applied (e.g., due to low correlation with target)

If enabled, only monotonic features with +1/-1 constraints will be passed to the model(s), and features without monotonicity constraints (0) will be dropped. Otherwise all features will be in the model. Only active when interpretability >= monotonicity_constraints_interpretability_switch or monotonicity_constraints_dict is provided.

See monotonicity_constraints_drop_low_correlation_features in config.toml

Manual Override for Monotonicity Constraints

Specify a list of features for which monotonicity contraints are applied. Original numeric features are mapped to the desired constraint:

  • 1: Positive constraint

  • -1: Negative constraint

  • 0: Constraint disabled

Constraint is automatically disabled (set to 0) for features that are not in this list.

The following is an example of how this list can be specified:

"{'PAY_0': -1, 'PAY_2': -1, 'AGE': -1, 'BILL_AMT1': 1, 'PAY_AMT1': -1}"

Note: If a list is not provided, then the automatic correlation-based method is used when monotonicity constraints are enabled at high enough interpretability settings.

See monotonicity_constraints_dict in config.toml

Max Feature Interaction Depth

Specify the maximum number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates.

Exploring feature interactions can be important in gaining better predictive performance. The interaction can take multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + … featureN). Although certain machine learning algorithms (like tree-based methods) can do well in capturing these interactions as part of their training process, still generating them may help them (or other algorithms) yield better performance.

The depth of the interaction level (as in “up to” how many features may be combined at once to create one single feature) can be specified to control the complexity of the feature engineering process. Higher values might be able to make more predictive models at the expense of time. This value defaults to 8.

Set Max Feature Interaction Depth to 1 to disable any feature interactions max_feature_interaction_depth=1 .

Fixed Feature Interaction Depth

Specify a fixed non-zero number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates. To use all features for each transformer, set this to be equal to the number of columns. To do a 50/50 sample and a fixed feature interaction depth of \(n\) features, set this to -\(n\).

Enable Target Encoding

Specify whether to use Target Encoding when building the model. Target encoding refers to several different feature transformations (primarily focused on categorical data) that aim to represent the feature using information of the actual target variable. A simple example can be to use the mean of the target to replace each unique category of a categorical feature. These type of features can be very predictive but are prone to overfitting and require more memory as they need to store mappings of the unique categories and the target values.

Enable Outer CV for Target Encoding

For target encoding, specify whether an outer level of cross-fold validation is performed in cases where GINI is detected to flip sign or have an inconsistent sign for weight of evidence between fit_transform (on training data) and transform (on training and validation data). The degree to which GINI is inaccurate is also used to perform fold-averaging of look-up tables instead of using global look-up tables. This is enabled by default.

Enable Lexicographical Label Encoding

Specify whether to enable lexicographical label encoding. This is disabled by default.

Enable Isolation Forest Anomaly Score Encoding

Isolation Forest is useful for identifying anomalies or outliers in data. Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that selected feature. This split depends on how long it takes to separate the points. Random partitioning produces noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

This option lets you specify whether to return the anomaly score of each sample. This is disabled by default.

Enable One HotEncoding

Specify whether one-hot encoding is enabled. The default Auto setting is only applicable for small datasets and GLMs.

Number of Estimators for Isolation Forest Encoding

Specify the number of estimators for Isolation Forest encoding. This value defaults to 200.

Drop Constant Columns

Specify whether to drop columns with constant values. This is enabled by default.

Drop ID Columns

Specify whether to drop columns that appear to be an ID. This is enabled by default.

Don’t Drop Any Columns

Specify whether to avoid dropping any columns (original or derived). This is disabled by default.

Features to Drop

Specify which features to drop. This setting allows you to select many features at once by copying and pasting a list of column names (in quotes) separated by commas.

Features to Group By

Specify which features to group columns by. When this field is left empty (default), Driverless AI automatically searches all columns (either at random or based on which columns have high variable importance).

Sample from Features to Group By

Specify whether to sample from given features to group by or to always group all features. This is disabled by default.

Aggregation Functions (Non-Time-Series) for Group By Operations

Specify whether to enable aggregation functions to use for group by operations. Choose from the following (all are selected by default):

  • mean

  • sd

  • min

  • max

  • count

Number of Folds to Obtain Aggregation When Grouping

Specify the number of folds to obtain aggregation when grouping. Out-of-fold aggregations will result in less overfitting, but they analyze less data in each fold.

Type of Mutation Strategy

Specify which strategy to apply when performing mutations on transformers. Select from the following:

  • sample: Sample transformer parameters (Default)

  • batched: Perform multiple types of the same transformation together

  • full: Perform more types of the same transformation together than the above strategy

Enable Detailed Scored Features Info

Specify whether to dump every scored individual’s variable importance (both derived and original) to a csv/tabulated/json file. If enabled, Driverless AI produces files such as “individual_scored_id%d.iter%d*features*”. This is disabled by default.

Enable Detailed Logs for Timing and Types of Features Produced

Specify whether to dump every scored fold’s timing and feature info to a timings.txt file. This is disabled by default.

Compute Correlation Matrix

Specify whether to compute training, validation, and test correlation matrixes. When enabled, this setting creates table and heatmap PDF files that are saved to disk. Note that this setting is currently a single threaded process that may be slow for experiments with many columns. This is disabled by default.

Required GINI Relative Improvement for Interactions

Specify the required GINI relative improvement value for the InteractionTransformer. If the GINI coefficient is not better than the specified relative improvement value in comparison to the original features considered in the interaction, then the interaction is not returned. If the data is noisy and there is no clear signal in interactions, this value can be decreased to return interactions. This value defaults to 0.5.

Number of Transformed Interactions to Make

Specify the number of transformed interactions to make from generated trial interactions. (The best transformed interactions are selected from the group of generated trial interactions.) This value defaults to 5.

Whether to enable RAPIDS cuML GPU transformers (no mojo)

Specify whether to enable GPU-based RAPIDS cuML transformers. Note that no MOJO support for deployment is avaiable for this selection at this time, but python scoring is supported and this is in beta testing status.

The equivalent config.toml parameter is enable_rapids_transformers and the default value is False.