NLP Settings¶
tensorflow_max_epochs_nlp
¶
Max TensorFlow Epochs for NLP
When building TensorFlow NLP features (for text data), specify the maximum number of epochs to train feature engineering models with (it might stop earlier). The higher the number of epochs, the higher the run time. This value defaults to 2 and is ignored if TensorFlow models is disabled.
enable_tensorflow_nlp_accuracy_switch
¶
Accuracy Above Enable TensorFlow NLP by Default for All Models
Specify the accuracy threshold. Values equal and above will add all enabled TensorFlow NLP models at the start of the experiment for text-dominated problems when the following NLP expert settings are set to Auto:
Enable word-based CNN TensorFlow models for NLP
Enable word-based BigRU TensorFlow models for NLP
Enable character-based CNN TensorFlow models for NLP
If the above transformations are set to ON, this parameter is ignored.
At lower accuracy, TensorFlow NLP transformations will only be created as a mutation. This value defaults to 5.
enable_tensorflow_textcnn
¶
Enable Word-Based CNN TensorFlow Models for NLP
Specify whether to use Word-based CNN TensorFlow models for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs.
enable_tensorflow_textbigru
¶
Enable Word-Based BiGRU TensorFlow Models for NLP
Specify whether to use Word-based BiG-RU TensorFlow models for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs.
enable_tensorflow_charcnn
¶
Enable Character-Based CNN TensorFlow Models for NLP
Specify whether to use Character-level CNN TensorFlow models for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs.
enable_pytorch_nlp
¶
Enable PyTorch Models for NLP (Experimental)
Specify whether to enable pretrained PyTorch models and fine-tune them for NLP tasks. This is set to Auto by default. You need to set this to On if you want to use the PyTorch models like BERT for feature engineering or for modeling. We recommend that you use GPUs to speed up execution when this option is used.
Notes:
This setting requires an Internet connection.
Some PyTorch NLP models may only use one text column.
pytorch_nlp_pretrained_models
¶
Select Which Pretrained PyTorch NLP Models to Use
Specify one or more pretrained PyTorch NLP models to use. Select from the following:
bert-base-uncased (Default)
distilbert-base-uncased (Default)
xlnet-base-cased
xlm-mlm-enfr-1024
roberta-base
albert-base-v2
camembert-base
xlm-roberta-base
Notes:
This setting requires an Internet connection.
Models that are not selected by default may not have MOJO support.
Using BERT-like models may result in a longer experiment completion time.
pytorch_nlp_fine_tuning_num_epochs
¶
Number of Epochs for Fine-Tuning of PyTorch NLP Models
Specify the number of epochs used when fine-tuning PyTorch NLP models. This value defaults to 2.
pytorch_nlp_fine_tuning_batch_size
¶
Batch Size for PyTorch NLP Models
Specify the batch size for PyTorch NLP models. This value defaults to 10.
Note: Large models and batch sizes require more memory.
pytorch_nlp_fine_tuning_padding_length
¶
Maximum Sequence Length for PyTorch NLP Models
Specify the maximum sequence length (padding length) for PyTorch NLP models. This value defaults to 100.
Note: Large models and padding lengths require more memory.
pytorch_nlp_pretrained_models_dir
¶
Path to Pretrained PyTorch NLP Models
Specify a path to pretrained PyTorch NLP models. To get all available models, download http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/bert_models.zip, then extract the folder and store it in a directory on the instance where Driverless AI is installed:
pytorch_nlp_pretrained_models_dir = /path/on/server/to/bert_models_folder
tensorflow_nlp_pretrained_embeddings_file_path
¶
Path to Pretrained Embeddings for TensorFlow NLP Models
Specify a path to pretrained embeddings that will be used for the TensorFlow NLP models. For example, /path/on/server/to/file.txt
You can download the Glove embeddings from here and specify the local path in this box.
You can download the fasttext embeddings from here and specify the local path in this box.
You can also train your own custom embeddings. Please refer to this code sample for creating custom embeddings that can be passed on to this option.
If this field is left empty, embeddings will be trained from scratch.
tensorflow_nlp_pretrained_embeddings_trainable
¶
For TensorFlow NLP, Allow Training of Unfrozen Pretrained Embeddings
Specify whether to allow training of all weights of the neural network graph, including the pretrained embedding layer weights. If this is disabled, the embedding layer will be frozen. All other weights, however, will still be fine-tuned. This is disabled by default.
text_fraction_for_text_dominated_problem
¶
Fraction of Text Columns Out of All Features to be Considered a Text-Dominanted Problem
Specify the fraction of text columns out of all features to be considered as a text-dominated problem. This value defaults to 0.3.
Specify when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable. Higher values will favor string columns as categoricals, while lower values will favor string columns as text. This value defaults to 0.3.
text_transformer_fraction_for_text_dominated_problem
¶
Fraction of Text per All Transformers to Trigger That Text Dominated
Specify the fraction of text columns out of all features to be considered a text-dominated problem. This value defaults to 0.3.
string_col_as_text_threshold
¶
Threshold for String Columns to be Treated as Text
Specify the threshold value (from 0 to 1) for string columns to be treated as text (0.0 - text; 1.0 - string). This value defaults to 0.3.
text_transformers_max_vocabulary_size
¶
Max Size of the Vocabulary for Text Transformers
Max number of tokens created during fitting of Tfidf/Count based text transformers. If multiple values are provided, will use the first one for initial models, and use remaining values during parameter tuning and feature evolution. The default value is [1000, 5000]. Values smaller than 10000 are recommended for speed.