System Settings¶

Exclusive level of access to node resources¶

There are three levels of access:

safe: this level assumes that there might be another experiment also running on same node.

moderate: this level assumes that there are no other experiments or tasks running on the same node, but still only uses physical core counts.

max: this level assumes that there is absolutly nothing else running on the node except the experiment

The default level is “safe” and the equivalent config.toml parameter is exclusive_mode. If multinode is enabled, this option has no effect.

Number of Cores to Use¶

Specify the number of cores to use for the experiment. Note that if you specify 0, all available cores will be used. Lower values can reduce memory usage but might slow down the experiment. This value defaults to 0.

Maximum Number of Cores to Use for Model Fit¶

Specify the maximum number of cores to use for a model’s fit call. Note that if you specify 0, all available cores will be used. This value defaults to 10.

If full dask cluster is enabled, use full cluster¶

Specify whether to use full multinode distributed cluster (True) or single-node dask (False). In some cases, using entire cluster can be inefficient. E.g. several DGX nodes can be more efficient, if used one DGX at a time for medium-sized data. The equivalent config.toml parameter is use_dask_cluster.

Maximum Number of Cores to Use for Model Predict¶

Specify the maximum number of cores to use for a model’s predict call. Note that if you specify 0, all available cores will be used. This value defaults to 0.

Maximum Number of Cores to Use for Model Transform and Predict When Doing MLI, AutoDoc¶

Specify the maximum number of cores to use for a model’s transform and predict call when doing operations in the Driverless AI MLI GUI and the Driverless AI R and Python clients. Note that if you specify 0, all available cores will be used. This value defaults to 4.

Tuning Workers per Batch for CPU¶

Specify the number of workers used in CPU mode for tuning. A value of 0 uses the socket count, while a value of -1 uses all physical cores greater than or equal to 1 that count. This value defaults to 0.

Number of Workers for CPU Training¶

Specify the number of workers used in CPU mode for training:

0: Use socket count (Default)
-1: Use all physical cores >= 1 that count

#GPUs/Experiment¶

Specify the number of GPUs to use per experiment. A value of -1 (default) specifies to use all available GPUs. Must be at least as large as the number of GPUs to use per model (or -1). In multinode context when using dask, this refers to the per-node value.

Num Cores/GPU¶

Specify the number of CPU cores per GPU. In order to have a sufficient number of cores per GPU, this setting limits the number of GPUs used. This value defaults to 4.

#GPUs/Model¶

Specify the number of GPUs to user per model. The equivalent config.toml parameter is num_gpus_per_model and the default value is 1. Currently num_gpus_per_model other than 1 disables GPU locking, so is only recommended for single experiments and single users. Setting this parameter to -1 means use all GPUs per model. In all cases, XGBoost tree and linear models use the number of GPUs specified per model, while LightGBM and Tensorflow revert to using 1 GPU/model and run multiple models on multiple GPUs. FTRL does not use GPUs. Rulefit uses GPUs for parts involving obtaining the tree using LightGBM. In multinode context when using dask, this parameter refers to the per-node value.

Num. of GPUs for Isolated Prediction/Transform¶

Specify the number of GPUs to use for predict for models and transform for transformers when running outside of fit/fit_transform. If predict or transform are called in the same process as fit/fit_transform, the number of GPUs will match. New processes will use this count for applicable models and transformers. Note that enabling tensorflow_nlp_have_gpus_in_production will override this setting for relevant TensorFlow NLP transformers. The equivalent config.toml parameter is num_gpus_for_prediction and the default value is “0”.

Note: When GPUs are used, TensorFlow, PyTorch models and transformers, and RAPIDS always predict on GPU. And RAPIDS requires Driverless AI python scoring package also to be used on GPUs. In multinode context when using dask, this refers to the per-node value.

Max Number of Threads to Use for datatable and OpenBLAS for Munging and Model Training¶

Specify the maximum number of threads to use for datatable and OpenBLAS during data munging (applied on a per process basis):

0 = Use all threads
-1 = Automatically select number of threads (Default)

Max Number of Threads to Use for datatable Read and Write of Files¶

Specify the maximum number of threads to use for datatable during data reading and writing (applied on a per process basis):

0 = Use all threads
-1 = Automatically select number of threads (Default)

Max Number of Threads to Use for datatable Stats and OpenBLAS¶

Specify the maximum number of threads to use for datatable stats and OpenBLAS (applied on a per process basis):

0 = Use all threads
-1 = Automatically select number of threads (Default)

GPU Starting ID¶

Specify Which gpu_id to start with. If using CUDA_VISIBLE_DEVICES=… to control GPUs (preferred method), gpu_id=0 is the first in that restricted list of devices. For example, if CUDA_VISIBLE_DEVICES='4,5' then gpu_id_start=0 will refer to device #4.

From expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs, then:

Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0
Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1

From expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs, then:

Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0
Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4

To run on all 4 GPUs/model, then

Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0
Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4

If num_gpus_per_model!=1, global GPU locking is disabled. This is because the underlying algorithms do not support arbitrary gpu ids, only sequential ids, so be sure to set this value correctly to avoid overlap across all experiments by all users.

More information is available at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation Note that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than the number of visibile GPUs.

Enable Detailed Traces¶

Specify whether to enable detailed tracing in Driverless AI trace when running an experiment. This is disabled by default.

Enable Debug Log Level¶

If enabled, the log files will also include debug logs. This is disabled by default.

Enable Logging of System Information for Each Experiment¶

Specify whether to include system information such as CPU, GPU, and disk space at the start of each experiment log. Note that this information is already included in system logs. This is enabled by default.