Google Big Query¶

Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

In order to enable the GBQ data connector with authentication, you must:

Obtain a JSON authentication file from GCP.
Mount the JSON file to the Docker instance.
Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json environmental variable.

Notes:

The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all.
Google BigQuery APIs limit the amount of data that can be extracted to a single file at 1GB. Any queries larger than this will fail.

Google BigQuery with Authentication¶

This example enables the GBQ data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google BigQuery authentications.

Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:

export DRIVERLESS_AI_CONFIG_FILE="/config/config.toml"

Edit the following environment variables in the config.toml file.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
enabled_file_systems = "file, gbq"

# GCS Connector credentials
# example (suggested) -- "/licenses/my_service_account_json.json"
gcs_path_to_service_account_json = "/service_account_json.json"

Save the changes when you are done, then stop/restart Driverless AI.

After Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or Drag and Drop) drop-down menu.

Specify the following information to add your dataset.

Enter BigQuery Dataset ID to query. Enter a dataset from your user-owned Google Cloud project. Users will need read/write access to this dataset. BigQuery uses this dataset as the location for the new table generated by the query.
Enter Google Storage destination path: Specify a destination path in Google Cloud Storage to store the dataset. Users will need read/write to the Google Storage bucket. BigQuery will export the new table created by the query to this path. This should be a full path, including the filename and file type extension. (For example, gs://mybucket/myfile.csv)
Enter BigQuery Query (Use StandardSQL): Enter a StandardSQL query that you want BigQuery to execute. For example: SELECT * FROM <my_dataset>.<my_table>.
When you are finished, select the Click to Make Query button to add the dataset.