Nlp configuration¶
enable_tensorflow_textcnn
¶
Enable word-based CNN TensorFlow transformers for NLP (String)
Default value 'auto'
Whether to use out-of-fold predictions of Word-based CNN TensorFlow models as transformers for NLP if TensorFlow enabled
enable_tensorflow_textbigru
¶
Enable word-based BiGRU TensorFlow transformers for NLP (String)
Default value 'auto'
Whether to use out-of-fold predictions of Word-based Bi-GRU TensorFlow models as transformers for NLP if TensorFlow enabled
enable_tensorflow_charcnn
¶
Enable character-based CNN TensorFlow transformers for NLP (String)
Default value 'auto'
Whether to use out-of-fold predictions of Character-level CNN TensorFlow models as transformers for NLP if TensorFlow enabled
enable_pytorch_nlp_transformer
¶
Enable PyTorch transformers for NLP (String)
Default value 'auto'
Whether to use pretrained PyTorch models as transformers for NLP tasks. Fits a linear model on top of pretrained embeddings. Requires internet connection. Default of ‘auto’ means disabled. To enable, set to ‘on’. GPU(s) are highly recommended.
pytorch_nlp_transformer_max_rows_linear_model
¶
Max number of rows to use for fitting the linear model on top of the pretrained embeddings. (Number)
Default value 50000
More rows can slow down the fitting process. Recommended values are less than 100000.
enable_pytorch_nlp_model
¶
Enable PyTorch models for NLP (String)
Default value 'auto'
Whether to use pretrained PyTorch models and fine-tune them for NLP tasks. Requires internet connection. Default of ‘auto’ means disabled. To enable, set to ‘on’. These models are only using the first text column, and can be slow to train. GPU(s) are highly recommended.
pytorch_nlp_pretrained_models
¶
Select which pretrained PyTorch NLP model(s) to use. (List)
Default value ['bert-base-uncased', 'distilbert-base-uncased', 'bert-base-multilingual-cased']
Select which pretrained PyTorch NLP model(s) to use. Non-default ones might have no MOJO support. Requires internet connection. Only if PyTorch models or transformers for NLP are set to ‘on’.
tensorflow_max_epochs_nlp
¶
Max. TensorFlow epochs for NLP (Number)
Default value 2
Max. number of epochs for TensorFlow models for making NLP features
enable_tensorflow_nlp_accuracy_switch
¶
Accuracy above enable TensorFlow NLP by default for all models (Number)
Default value 5
Accuracy setting equal and above which will add all enabled TensorFlow NLP models below at start of experiment for text dominated problems when TensorFlow NLP transformers are set to auto. If set to on, this parameter is ignored. Otherwise, at lower accuracy, TensorFlow NLP transformations will only be created as a mutation.
tensorflow_nlp_pretrained_embeddings_file_path
¶
Path to pretrained embeddings for TensorFlow NLP models. If empty, will train from scratch. (String)
Default value ''
Path to pretrained embeddings for TensorFlow NLP models, can be a path in local file system or an S3 location (s3://). For example, download and unzip https://nlp.stanford.edu/data/glove.6B.zip tensorflow_nlp_pretrained_embeddings_file_path = /path/on/server/to/glove.6B.300d.txt
tensorflow_nlp_pretrained_s3_access_key_id
¶
S3 access key Id to use when tensorflow_nlp_pretrained_embeddings_file_path is set to an S3 location. (String)
Default value ''
tensorflow_nlp_pretrained_s3_secret_access_key
¶
S3 secret access key to use when tensorflow_nlp_pretrained_embeddings_file_path is set to an S3 location. (String)
Default value ''
tensorflow_nlp_pretrained_embeddings_trainable
¶
For TensorFlow NLP, allow training of unfrozen pretrained embeddings (in addition to fine-tuning of the rest of the graph) (Boolean)
Default value False
Allow training of all weights of the neural network graph, including the pretrained embedding layer weights. If disabled, then the embedding layer is frozen, but all other weights are still fine-tuned.
pytorch_tokenizer_parallel
¶
pytorch_tokenizer_parallel (Boolean)
Default value True
Whether to parallelize tokenization for BERT Models/Transformers.
pytorch_nlp_fine_tuning_num_epochs
¶
Number of epochs for fine-tuning of PyTorch NLP models. (Number)
Default value -1
Number of epochs for fine-tuning of PyTorch NLP models. Larger values can increase accuracy but take longer to train.
pytorch_nlp_fine_tuning_batch_size
¶
Batch size for PyTorch NLP models. -1 for automatic. (Number)
Default value -1
Batch size for PyTorch NLP models. Larger models and larger batch sizes will use more memory.
pytorch_nlp_fine_tuning_padding_length
¶
Maximum sequence length (padding length) for PyTorch NLP models. -1 for automatic. (Number)
Default value -1
Maximum sequence length (padding length) for PyTorch NLP models. Larger models and larger padding lengths will use more memory.
pytorch_nlp_pretrained_models_dir
¶
Path to pretrained PyTorch NLP models. If empty, will get models from S3 (String)
Default value ''
Path to pretrained PyTorch NLP models. Note that this can be either a path in the local file system
(/path/on/server/to/bert_models_folder), an URL or a S3 location (s3://).
To get all models, download http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/bert_models.zip
and unzip and store it in a directory on the instance where DAI is installed.
pytorch_nlp_pretrained_models_dir=/path/on/server/to/bert_models_folder
pytorch_nlp_pretrained_s3_access_key_id
¶
S3 access key Id to use when pytorch_nlp_pretrained_models_dir is set to an S3 location. (String)
Default value ''
pytorch_nlp_pretrained_s3_secret_access_key
¶
S3 secret access key to use when pytorch_nlp_pretrained_models_dir is set to an S3 location. (String)
Default value ''
text_fraction_for_text_dominated_problem
¶
Fraction of text columns out of all features to be considered a text-dominated problem (Float)
Default value 0.3
Fraction of text columns out of all features to be considered a text-dominated problem
text_transformer_fraction_for_text_dominated_problem
¶
Fraction of text per all transformers to trigger that text dominated (Float)
Default value 0.3
Fraction of text transformers to all transformers above which to trigger that text dominated problem
string_col_as_text_threshold
¶
Threshold for string columns to be treated as text (0.0 - text, 1.0 - string) (Float)
Default value 0.3
Threshold for average string-is-text score as determined by internal heuristics It decides when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable. Higher values will favor string columns as categoricals, lower values will favor string columns as text
string_col_as_text_threshold_preview
¶
string_col_as_text_threshold_preview (Float)
Default value 0.1
Threshold for string columns to be treated as text during preview - should be less than string_col_as_text_threshold to allow data with first 20 rows that don’t look like text to still work for Text-only transformers (0.0 - text, 1.0 - string)
tokenize_single_chars
¶
Tokenize single characters. (Boolean)
Default value True
If disabled, require 2 or more alphanumeric characters for a token in Text (Count and TF/IDF) transformers, otherwise create tokens out of single alphanumeric characters. True means that ‘Street 3’ is tokenized into ‘Street’ and ‘3’, while False means that it’s tokenized into ‘Street’.
text_transformers_max_vocabulary_size
¶
Max size of the vocabulary for text transformers. (List)
Default value [1000, 5000]
- Max size (in tokens) of the vocabulary created during fitting of Tfidf/Count based text
transformers (not CNN/BERT). If multiple values are provided, will use the first one for initial models, and use remaining values during parameter tuning and feature evolution. Values smaller than 10000 are recommended for speed.