Experiment configuration¶

`max_runtime_minutes`¶

`max_runtime_minutes_until_abort`¶

`time_abort`¶

`time_abort_format`¶

`time_abort_timezone`¶

`delete_model_dirs_and_files`¶

`recipe`¶

Pipeline Building Recipe (String)

Default value 'auto'

# Recipe type ## Recipes override any GUI settings - ‘auto’: all models and features automatically determined by experiment settings, toml settings, and feature_engineering_effort

‘compliant’like ‘auto’ except:
- interpretability=10 (to avoid complexity, overrides GUI or python client chose for interpretability)
- enable_glm=’on’ (rest ‘off’, to avoid complexity and be compatible with algorithms supported by MLI)
- fixed_ensemble_level=0: Don’t use any ensemble
- *feature_brain_level=0*(: No feature brain used (to ensure every restart is identical)
- max_feature_interaction_depth=1: interaction depth is set to 1 (no multi-feature interactions to avoid complexity)
- target_transformer=’identity’: for regression (to avoid complexity)
- check_distribution_shift_drop=’off’: Don’t use distribution shift between train, valid, and test to drop features (bit risky without fine-tuning)
‘monotonic_gbm’like ‘auto’ except:
- monotonicity_constraints_interpretability_switch=1: enable monotonicity constraints
- self.config.monotonicity_constraints_correlation_threshold = 0.01: see below
- monotonicity_constraints_drop_low_correlation_features=true: drop features that aren’t correlated with target by at least 0.01 (specified by parameter above)
- fixed_ensemble_level=0: Don’t use any ensemble (to avoid complexity)
- included_models=[‘LightGBMModel’]
- included_transformers=[‘OriginalTransformer’]: only original (numeric) features will be used
- feature_brain_level=0: No feature brain used (to ensure every restart is identical)
- monotonicity_constraints_log_level=’high’
- autodoc_pd_max_runtime=-1: no timeout for PDP creation in AutoDoc
‘kaggle’like ‘auto’ except:
- external validation set is concatenated with train set, with target marked as missing
- test set is concatenated with train set, with target marked as missing
- transformers that do not use the target are allowed to fit_transform across entire train + validation + test
- several config toml expert options open-up limits (e.g. more numerics are treated as categoricals)
- Note: If plentiful memory, can:
  
  choose kaggle mode and then change fixed_feature_interaction_depth to large negative number,
otherwise default number of features given to transformer is limited to 50 by default
choose mutation_mode = “full”, so even more types are transformations are done at once per transformer
‘nlp_model’: Only enables NLP models that process pure text
‘nlp_transformer’: Only enables NLP transformers that process pure text, while any model type is allowed
‘image_model’: Only enables Image models that process pure images
‘image_transformer’: Only enables Image transformers that process pure images, while any model type is allowed
‘unsupervised’: Only enables unsupervised transformers, models and scorers
‘gpus_max’: Maximize use of GPUs (e.g. use XGBoost, rapids, Optuna hyperparameter search, etc.)
‘more_overfit_protection’: Potentially improve overfit, esp. for small data, by disabling target encoding and making GA behave like final model for tree counts and learning rate

Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings. Changing the pipeline building recipe will reset all pipeline building recipe options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline building recipe rules.

If choose to do new/continued/refitted/retrained experiment from parent experiment, the recipe rules are not re-applied and any fine-tuning is preserved. To reset recipe behavior, one can switch between ‘auto’ and the desired mode. This way the new child experiment will use the default settings for the chosen recipe.

`enable_genetic_algorithm`¶

`make_python_scoring_pipeline`¶

`make_mojo_scoring_pipeline`¶

`mojo_for_predictions_benchmark`¶

`mojo_for_predictions_benchmark_slower_than_python_threshold`¶

`mojo_for_predictions_benchmark_slower_than_python_min_rows`¶

`mojo_for_predictions_benchmark_slower_than_python_min_seconds`¶

`inject_mojo_for_predictions`¶

`mojo_for_predictions`¶

`mojo_for_predictions_max_rows`¶

`mojo_for_predictions_batch_size`¶

`mojo_acceptance_test_rtol`¶

`mojo_acceptance_test_atol`¶

`reduce_mojo_size`¶

`make_pipeline_visualization`¶

`make_python_pipeline_visualization`¶

`max_cols_make_autoreport_automatically`¶

`max_cols_make_pipeline_visualization_automatically`¶

`pass_env_to_deprecated_python_scoring`¶

`transformer_description_line_length`¶

`benchmark_mojo_latency`¶

`benchmark_mojo_latency_auto_size_limit`¶

`mojo_building_timeout`¶

`mojo_building_parallelism`¶

`mojo_building_parallelism_base_model_size_limit`¶

`show_pipeline_sizes`¶

`max_workers`¶

`max_cores_dai`¶

`stall_subprocess_submission_dai_fork_threshold_count`¶

`stall_subprocess_submission_mem_threshold_pct`¶

`max_cores_by_physical`¶

`max_cores_limit`¶

`assumed_simultaneous_dt_forks_stats_openblas`¶

`max_max_dt_threads_stats_openblas`¶

`max_dt_threads_do_timeseries_split_suggestion`¶

`kaggle_username`¶

`kaggle_key`¶

`kaggle_timeout`¶

`kaggle_keep_submission`¶

`kaggle_competitions`¶

`ping_period`¶

`ping_autodl`¶

`disk_limit_gb`¶

`stall_disk_limit_gb`¶

`memory_limit_gb`¶

`min_num_rows`¶

`min_rows_per_class`¶

`min_rows_per_split`¶

`reproducibility_level`¶

`seed`¶

`missing_values`¶

`glm_nan_impute_training_data`¶

`glm_nan_impute_validation_data`¶

`glm_nan_impute_prediction_data`¶

`tf_nan_impute_value`¶

`statistical_threshold_data_size_small`¶

`statistical_threshold_data_size_large`¶

`aux_threshold_data_size_large`¶

`set_method_sampling_row_limit`¶

`performance_threshold_data_size_small`¶

`performance_threshold_data_size_large`¶

`max_relative_cols_mismatch_allowed`¶

`max_cols`¶

`max_rows_col_stats`¶

`max_rows_cv_in_cv_gini`¶

`max_rows_constant_model`¶

`max_rows_final_ensemble_base_model_fold_scores`¶

`max_rows_final_blender`¶

`max_rows_final_train_score`¶

`max_rows_final_roccmconf`¶

`max_rows_final_holdout_score`¶

`max_rows_final_holdout_bootstrap_score`¶

`max_rows_leak`¶

`max_workers_fs`¶

`max_workers_shift_leak`¶

`num_folds`¶

`fold_balancing_repeats_times_rows`¶

`max_fold_balancing_repeats`¶

`fixed_split_seed`¶

`show_fold_stats`¶

`allow_different_classes_across_fold_splits`¶

`full_cv_accuracy_switch`¶

`ensemble_accuracy_switch`¶

`num_ensemble_folds`¶

`save_validation_splits`¶

`fold_reps`¶

`max_num_classes_hard_limit`¶

`max_num_classes`¶

`max_num_classes_compute_roc`¶

`max_num_classes_client_and_gui`¶

`roc_reduce_type`¶

`min_roc_sample_size`¶

`max_rows_cm_ga`¶

`num_actuals_vs_predicted`¶

`use_feature_brain_new_experiments`¶

`resume_data_schema`¶

`resume_data_schema_old_logic`¶

`feature_brain_level`¶

Model/Feature Brain Level (0..10) (Number)

Default value 2

Whether to show (or use) results from H2O.ai brain: the local caching and smart re-use of prior experiments, in order to generate more useful features and models for new experiments. See use_feature_brain_new_experiments for how new experiments by default do not use brain cache. It can also be used to control checkpointing for experiments that have been paused or interrupted. DAI will use H2O.ai brain cache if cache file has a) any matching column names and types for a similar experiment type b) exactly matches classes c) exactly matches class labels d) matches basic time series choices e) interpretability of cache is equal or lower f) main model (booster) is allowed by new experiment. Level of brain to use (for chosen level, where higher levels will also do all lower level operations automatically) -1 = Don’t use any brain cache and don’t write any cache 0 = Don’t use any brain cache but still write cache

Use case: Want to save model for later use, but want current model to be built without any brain models

1 = smart checkpoint from latest best individual model: Use case: Want to use latest matching model, but match can be loose, so needs caution
2 = smart checkpoint from H2O.ai brain cache of individual best models: Use case: DAI scans through H2O.ai brain cache for best models to restart from
3 = smart checkpoint like level #1, but for entire population. Tune only if brain population insufficient size: (will re-score entire population in single iteration, so appears to take longer to complete first iteration)
4 = smart checkpoint like level #2, but for entire population. Tune only if brain population insufficient size: (will re-score entire population in single iteration, so appears to take longer to complete first iteration)
5 = like #4, but will scan over entire brain cache of populations to get best scored individuals: (can be slower due to brain cache scanning if big cache)
1000 + feature_brain_level (above positive values) = use resumed_experiment_id and actual feature_brain_level,: to use other specific experiment as base for individuals or population, instead of sampling from any old experiments

GUI has 3 options and corresponding settings: 1) New Experiment: Uses feature brain level default of 2 2) New Experiment With Same Settings: Re-uses the same feature brain level as parent experiment 3) Restart From Last Checkpoint: Resets feature brain level to 1003 and sets experiment ID to resume from

(continued genetic algorithm iterations)

Retrain Final Pipeline: Like Restart but also time=0 so skips any tuning and heads straight to final model (assumes had at least one tuning iteration in parent experiment)

Other use cases: a) Restart on different data: Use same column names and fewer or more rows (applicable to 1 - 5) b) Re-fit only final pipeline: Like (a), but choose time=1 and feature_brain_level=3 - 5 c) Restart with more columns: Add columns, so model builds upon old model built from old column names (1 - 5) d) Restart with focus on model tuning: Restart, then select feature_engineering_effort = 3 in expert settings e) can retrain final model but ignore any original features except those in final pipeline (normal retrain but set brain_add_features_for_new_columns=false) Notes: 1) In all cases, we first check the resumed experiment id if given, and then the brain cache 2) For Restart cases, may want to set min_dai_iterations to non-zero to force delayed early stopping, else may not be enough iterations to find better model. 3) A “New experiment with Same Settings” of a Restart will use feature_brain_level=1003 for default Restart mode (revert to 2, or even 0 if want to start a fresh experiment otherwise)

`feature_brain_reset_score`¶

`enable_strict_confict_key_check_for_brain`¶

`allow_change_layer_count_brain`¶

`brain_maximum_diff_score`¶

`max_num_brain_indivs`¶

`feature_brain_save_every_iteration`¶

`which_iteration_brain`¶

`refit_same_best_individual`¶

`restart_refit_redo_origfs_shift_leak`¶

`brain_rel_dir`¶

`brain_max_size_GB`¶

`brain_add_features_for_new_columns`¶

`force_model_restart_to_defaults`¶

`early_stopping`¶

`early_stopping_per_individual`¶

`min_dai_iterations`¶

`tensorflow_nlp_have_gpus_in_production`¶

`bert_migration_timeout_secs`¶

`enable_bert_transformer_acceptance_test`¶

`enable_bert_model_acceptance_test`¶

`string_col_as_text_min_relative_cardinality`¶

`string_col_as_text_min_absolute_cardinality`¶

`supported_image_types`¶

`image_paths_absolute`¶

`text_dl_token_pad_percentile`¶

`text_dl_token_pad_max`¶

`tune_parameters_accuracy_switch`¶

`tune_target_transform_accuracy_switch`¶

`target_transformer`¶

`target_transformer_tuning_choices`¶

`tournament_style`¶

`tournament_uniform_style_interpretability_switch`¶

`tournament_uniform_style_accuracy_switch`¶

`tournament_model_style_accuracy_switch`¶

`tournament_feature_style_accuracy_switch`¶

`tournament_fullstack_style_accuracy_switch`¶

`tournament_use_feature_penalized_score`¶

`num_individuals`¶

`fixed_fold_reps`¶

`sanitize_natural_sort_limit`¶

`excluded_transformers`¶

`excluded_genes`¶

`excluded_models`¶

`excluded_pretransformers`¶

`include_all_as_pretransformers_if_none_selected`¶

`force_include_all_as_pretransformers_if_none_selected`¶

`excluded_datas`¶

`excluded_individuals`¶

`excluded_scorers`¶

`enable_glm_rapids`¶

`use_dask_for_1_gpu`¶

`dask_retrials_allreduce_empty_issue`¶

`optuna_pruner_kwargs`¶

`optuna_sampler_kwargs`¶

`use_xgboost_xgbfi`¶

`drop_constant_model_final_ensemble`¶

`xgboost_rf_exact_threshold_num_rows_x_cols`¶

`lossguide_drop_factor`¶

`lossguide_max_depth_extend_factor`¶

`params_tune_grow_policy_simple_trees`¶

`default_max_bin`¶

`default_lightgbm_max_bin`¶

`min_max_bin`¶

`scale_mem_for_max_bin`¶

`factor_rf`¶

`tensorflow_use_all_cores`¶

`tensorflow_use_all_cores_even_if_reproducible_true`¶

`tensorflow_disable_memory_optimization`¶

`tensorflow_cores`¶

`validate_meta_learner`¶

`validate_meta_learner_extra`¶

`fixed_num_folds_evolution`¶

`fixed_num_folds`¶

`fixed_only_first_fold_model`¶

`num_fold_ids_show`¶

`fold_scores_instability_warning_threshold`¶

`feature_evolution_data_size`¶

`final_pipeline_data_size`¶

`max_validation_to_training_size_ratio_for_final_ensemble`¶

`force_stratified_splits_for_imbalanced_threshold_binary`¶

`force_stratified_splits_for_binary_max_rows`¶

`stratify_for_regression`¶

`imbalance_ratio_multiclass_threshold`¶

`heavy_imbalance_ratio_multiclass_threshold`¶

`imbalance_sampling_rank_averaging`¶

`imbalance_ratio_notification_threshold`¶

`nbins_ftrl_list`¶

`te_bin_list`¶

`woe_bin_list`¶

`ohe_bin_list`¶

`cols_to_drop_sanitized`¶

`cols_to_group_by_sanitized`¶

`leaderboard_mode`¶

`default_knob_offset_accuracy`¶

`default_knob_offset_time`¶

`default_knob_offset_interpretability`¶

`shift_check_text`¶

`use_rf_for_shift_if_have_lgbm`¶

`shift_key_features_varimp`¶

`shift_check_reduced_features`¶

`shift_trees`¶

`shift_max_bin`¶

`shift_min_max_depth`¶

`shift_max_max_depth`¶

`detect_features_distribution_shift_threshold_auc`¶

`drop_features_distribution_shift_min_features`¶

`shift_high_notification_level`¶

`leakage_check_text`¶

`leakage_key_features_varimp`¶

`leakage_key_features_varimp_if_no_early_stopping`¶

`leakage_check_reduced_features`¶

`use_rf_for_leakage_if_have_lgbm`¶

`leakage_trees`¶

`leakage_max_bin`¶

`leakage_min_max_depth`¶

`leakage_max_max_depth`¶

`drop_features_leakage_min_features`¶

`leakage_train_test_split`¶

`check_system`¶

`abs_tol_for_perfect_score`¶

`data_ingest_timeout`¶

`gpu_locking_trust_pool_submission`¶

`gpu_locking_free_dead`¶

`tensorflow_allow_cpu_only`¶

`check_pred_contribs_sum`¶

`debug_daimodel_level`¶

`debug_debug_xgboost_splits`¶

`log_predict_info`¶

`log_fit_info`¶

`stalled_time_kill_ref`¶

`long_time_psdump`¶

`do_psdump`¶

`livelock_signal`¶

`num_cpu_sockets_override`¶

`num_gpus_override`¶

`show_gpu_usage_only_if_locked`¶

`show_inapplicable_models_preview`¶

`show_inapplicable_transformers_preview`¶

`show_warnings_preview`¶

`show_warnings_preview_unused_map_features`¶

`max_cols_show_unused_features`¶

`max_cols_show_feature_transformer_mapping`¶

`warning_unused_feature_show_max`¶

`interaction_finder_max_rows_x_cols`¶

`interaction_finder_corr_threshold`¶

`min_bootstrap_samples`¶

`max_bootstrap_samples`¶

`min_bootstrap_sample_size_factor`¶

`max_bootstrap_sample_size_factor`¶

`bootstrap_final_seed`¶

`benford_mad_threshold_int`¶

`benford_mad_threshold_real`¶

`stabilize_features`¶

`fraction_std_bootstrap_ladder_factor`¶

`bootstrap_ladder_samples_limit`¶

`rdelta_percent_score_penalty_per_feature_by_interpretability`¶

`drop_low_meta_weights`¶

`meta_weight_allowed_by_interpretability`¶

`meta_weight_allowed_for_reference`¶

`show_full_pipeline_details`¶

`num_transformed_features_per_pipeline_show`¶

`fs_data_vary_for_interpretability`¶

`fs_data_frac`¶

`many_columns_count`¶

`columns_count_interpretable`¶

`round_up_indivs_for_busy_gpus`¶

`check_timeout_per_gpu`¶

`gpu_exit_if_fails`¶

`require_graphviz`¶

`fast_approx_max_num_trees_ever`¶

`fast_approx_num_trees`¶

`fast_approx_do_one_fold`¶

`fast_approx_do_one_model`¶

`fast_approx_contribs_num_trees`¶

`fast_approx_contribs_do_one_fold`¶

`fast_approx_contribs_do_one_model`¶

`use_187_prob_logic`¶

`enable_ohe_linear`¶

`max_absolute_feature_expansion`¶

`booster_for_fs_permute`¶

`model_class_name_for_fs_permute`¶

`switch_from_tree_to_lgbm_if_can`¶

`default_model_class_name`¶

`textlin_num_classes_switch`¶

`text_gene_dim_reduction_choices`¶

`text_gene_max_ngram`¶

`number_of_texts_to_cache_in_bert_transformer`¶

`gbm_early_stopping_rounds_min`¶

`gbm_early_stopping_rounds_max`¶

`max_varimp_to_save`¶

`max_num_varimp_to_log`¶

`max_num_varimp_shift_to_log`¶

`can_skip_final_upper_layer_failures`¶

`config_overrides`¶

`dump_modelparams_every_scored_indiv_feature_count`¶

`dump_modelparams_every_scored_indiv_mutation_count`¶

`dump_modelparams_separate_files`¶

`delete_preview_trans_timings`¶

`use_random_text_file`¶

`runtime_estimation_train_frame`¶

`enable_bad_scorer`¶

`debug_col_dict_prefix`¶

`return_early_debug_col_dict_prefix`¶

`return_early_debug_preview`¶

`autoviz_enable_recommendations`¶

`autoviz_recommended_transformation`¶

`last_recipe`¶

`make_mojo_scoring_pipeline_for_features_only`¶

`mojo_replace_target_encoding_with_grouped_input_cols`¶

`time_series_causal_split_recipe`¶

`use_lags_if_causal_recipe`¶

`min_ymd_timestamp`¶

`max_ymd_timestamp`¶

`max_rows_datetime_format_detection`¶

`disallowed_datetime_formats`¶

`use_datetime_cache`¶

`datetime_cache_min_rows`¶

`holiday_country`¶

`max_time_series_properties_sample_size`¶

`max_lag_sizes`¶

`min_lag_autocorrelation`¶

`max_signal_lag_sizes`¶

`single_model_vs_cv_score_reldiff`¶

`single_model_vs_cv_score_reldiff2`¶

`blend_in_link_space`¶

`tgc_via_ui_max_ncols`¶

`tgc_dup_tolerance`¶

Experiment configuration¶

max_runtime_minutes¶

max_runtime_minutes_until_abort¶

time_abort¶

time_abort_format¶

time_abort_timezone¶

delete_model_dirs_and_files¶

recipe¶

enable_genetic_algorithm¶

make_python_scoring_pipeline¶

make_mojo_scoring_pipeline¶

mojo_for_predictions_benchmark¶

mojo_for_predictions_benchmark_slower_than_python_threshold¶

mojo_for_predictions_benchmark_slower_than_python_min_rows¶

mojo_for_predictions_benchmark_slower_than_python_min_seconds¶

inject_mojo_for_predictions¶

mojo_for_predictions¶

mojo_for_predictions_max_rows¶

mojo_for_predictions_batch_size¶

mojo_acceptance_test_rtol¶

mojo_acceptance_test_atol¶

reduce_mojo_size¶

make_pipeline_visualization¶

make_python_pipeline_visualization¶

max_cols_make_autoreport_automatically¶

max_cols_make_pipeline_visualization_automatically¶

pass_env_to_deprecated_python_scoring¶

transformer_description_line_length¶

benchmark_mojo_latency¶

benchmark_mojo_latency_auto_size_limit¶

mojo_building_timeout¶

mojo_building_parallelism¶

mojo_building_parallelism_base_model_size_limit¶

show_pipeline_sizes¶

max_workers¶

max_cores_dai¶

stall_subprocess_submission_dai_fork_threshold_count¶

stall_subprocess_submission_mem_threshold_pct¶

max_cores_by_physical¶

max_cores_limit¶

assumed_simultaneous_dt_forks_stats_openblas¶

max_max_dt_threads_stats_openblas¶

max_dt_threads_do_timeseries_split_suggestion¶

kaggle_username¶

kaggle_key¶

kaggle_timeout¶

kaggle_keep_submission¶

kaggle_competitions¶

ping_period¶

ping_autodl¶

disk_limit_gb¶

stall_disk_limit_gb¶

memory_limit_gb¶

min_num_rows¶

min_rows_per_class¶

min_rows_per_split¶

reproducibility_level¶

seed¶

missing_values¶

glm_nan_impute_training_data¶

glm_nan_impute_validation_data¶

glm_nan_impute_prediction_data¶

tf_nan_impute_value¶

statistical_threshold_data_size_small¶

statistical_threshold_data_size_large¶

aux_threshold_data_size_large¶

set_method_sampling_row_limit¶

performance_threshold_data_size_small¶

performance_threshold_data_size_large¶

max_relative_cols_mismatch_allowed¶

max_cols¶

max_rows_col_stats¶

max_rows_cv_in_cv_gini¶

max_rows_constant_model¶

max_rows_final_ensemble_base_model_fold_scores¶

max_rows_final_blender¶

max_rows_final_train_score¶

max_rows_final_roccmconf¶

max_rows_final_holdout_score¶

max_rows_final_holdout_bootstrap_score¶