Hive 设置¶

Driverless AI 让您能从 Driverless AI 应用程序内搜索 Hive 数据源。本节介绍如何配置 Driverless AI 与 Hive 配合使用。

请注意：根据您所安装的 Docker 版本，在启动 Driverless AI Docker 映像时，使用 docker run --runtime=nvidia (>= Docker 19.03) 或 nvidia-docker (< Docker 19.03) 命令。使用 docker version 检查所使用的 Docker 版本。

配置属性说明¶

enabled_file_systems: 您要启用的文件系统。为使数据连接器正常运行，必须进行此项配置。
hive_app_configs: Hive 连接器配置。输入与配置 HDFS 连接器相似。重要密钥包括：
- hive_conf_path: Hive 配置路径。这可以有多个文件（例如，hive-site.xml、hdfs-site.xml 等。）
- auth_type: 将``noauth``、keytab``或 ``keytabimpersonation 中的一种指定用于 Keberos 身份验证。
- keytab_path: 指定用于身份验证的 Kerberos 密钥表的路径（如果使用 auth_type='noauth'，则可以为 “”）
- Principal_user: 指定 Kerberos 应用程序主用户（使用 auth_type='keytab' 或 auth_type='keytabimpersonation 时需要指定）

此配置应为具有多个密钥的 JSON/字典字符串。例如：

"""{
  "hive_connection_1": {
   "hive_conf_path": "/path/to/hive/conf",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename>.keytab",
   "principal_user": "hive/LOCALHOST@H2O.AI",
  },
  "hive_connection_2": {
   "hive_conf_path": "/path/to/hive/conf_2",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename_2>.keytab",
   "principal_user": "my_user/LOCALHOST@H2O.AI",
  }
}"""
请注意：hive_app_configs``的预期输入为 `JSON 字符串 <https://docs.python.org/3/library/json.html>`__. 必须使用双引号 (“…”) 来表示 JSON 字典 *内* 的密钥和值，*外* 引号必须为”””、’’’`` 或 ``’``的格式。根据配置值的应用方式，可能需要不同形式的外引号。以下示例展示了两种应用外引号的独特方法。
通过 config.toml 文件应用的配置值：
hive_app_configs = """{"my_json_string": "value", "json_key_2": "value2"}"""
通过环境变量应用的配置值：
DRIVERLESS_AI_HIVE_APP_CONFIGS='{"my_json_string": "value", "json_key_2": "value2"}'

Hive_app_jvm_args: 在需要 JAAS 的情况下，指定用于 Hive 连接器的其他 Java 虚拟机 (JVM) 参数。每个参数必须用空格隔开。以下示例展示如何指定此 config.toml 选项。

hive_app_jvm_args = "-Xmx20g -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dai/jaas.conf"
请注意：
Kerberos 身份验证和模拟登录要求设置 -Djavax.security.auth.useSubjectCredsOnly=false 默认参数。

需要设置 -Djava.security.auth.login.config=/etc/dai/jaas.conf 默认参数，以允许底层的连接器进程采用 /etc/dai/jaas.conf 中定义的 Kerberos 登录属性。您必须创建 jaas.conf 文件并将其放置于指定目录中。以下示例展示如何指定 jaas.conf 文件。
com.sun.security.jgss.initiate {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
principal="super-qa/mr-0xg9.0xdata.loc@H2OAI.LOC" [Replace this line]
doNotPrompt=true
keyTab="/etc/dai/super-qa.keytab" [Replace this line]
debug=true;
};

启用有身份验证的 Hive¶

本节介绍如何在于 Docker 中启动 Driverless AI 时启用 Hive。通过在 nvidia-docker run 命令中指定每个环境变量，或通过在 config.toml 文件中编辑此配置选项，然后在 nvidia-docker run 命令中指定此文件即可。

启动 Driverless AI Docker 映像。

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs,hive" \
  -e DRIVERLESS_AI_HIVE_APP_CONFIGS='{"hive_connection_2: {"hive_conf_path":"/etc/hadoop/conf",
                                                           "auth_type":"keytabimpersonation",
                                                           "keytab_path":"/etc/dai/steam.keytab",
                                                           "principal_user":"steam/mr-0xg9.0xdata.loc@H2OAI.LOC"}}' \
  -p 12345:12345 \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -v /path/to/hive/conf:/path/to/hive/conf/in/docker \
  -v /path/to/hive.keytab:/path/in/docker/hive.keytab \
  -u $(id -u):${id -g) \
  h2oai/dai-centos7-x86_64:1.9.2.1-cuda10.0.xx

本示例展示如何在 config.toml 文件中配置 Hive 选项并在于 Docker 中启动 Driverless AI 时指定此文件。

在 Driverless AI config.toml 文件中启用并配置 Hive 连接器。Hive 连接器配置必须为带有多个密钥的 JSON/字典字符串。

enabled_file_systems = "file, hdfs, s3, hive"
hive_app_configs = """{"hive_1": {"auth_type": "keytab",
                                  "key_tab_path": "/path/to/Downloads/hive.keytab",
                                  "hive_conf_path": "/path/to/Downloads/hive-resources",
                                  "principal_user": "hive/localhost@H2O.AI"}}"""

将 config.toml 文件挂载至 Docker 容器。

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro /
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -v /path/to/hive/conf:/path/to/hive/conf/in/docker \
  -v /path/to/hive.keytab:/path/in/docker/hive.keytab \
  -u $(id -u):$(id -g) \
  h2oai/dai-centos7-x86_64:1.9.2.1-cuda10.0.xx

此操作可启用 Hive 连接器。

导出 Driverless AI config.toml或将其添加至 ~/.bashrc。

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

在 config.toml 文件中指定以下配置选项。

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs, s3, hive"


# Configuration for Hive Connector
# Note that inputs are similar to configuring HDFS connectivity
# Important keys:
# * hive_conf_path - path to hive configuration, may have multiple files. Typically: hive-site.xml, hdfs-site.xml, etc
# * auth_type - one of `noauth`, `keytab`, `keytabimpersonation` for kerberos authentication
# * keytab_path - path to the kerberos keytab to use for authentication, can be "" if using `noauth` auth_type
# * principal_user = Kerberos app principal user. Required when using auth_type `keytab` or `keytabimpersonation`
# JSON/Dictionary String with multiple keys. Example:
# """{
# "hive_connection_1": {
# "hive_conf_path": "/path/to/hive/conf",
# "auth_type": "one of ['noauth', 'keytab', 'keytabimpersonation']",
# "keytab_path": "/path/to/<filename>.keytab",
# principal_user": "hive/LOCALHOST@H2O.AI",
# }
# }"""
#
hive_app_configs = """{"hive_1": {"auth_type": "keytab",
                                  "key_tab_path": "/path/to/Downloads/hive.keytab",
                                  "hive_conf_path": "/path/to/Downloads/hive-resources",
                                  "principal_user": "hive/localhost@H2O.AI"}}"""

完成后，保存更改，然后停止/重启 Driverless AI。

使用 Hive 添加数据集¶

启用 Hive 连接器后，您可以通过在 添加数据集（或拖放） 下拉菜单中选择 Hive 来添加数据集。

选择您想要使用的 Hive 配置。

指定以下信息以添加数据集。

Hive 数据库：指定您要查询的 Hive 数据库名称。

Hadoop 配置路径：指定 Hive 配置文件的路径。

Hive Kerberos 密钥表路径：指定 Hive Kerberos 密钥表的路径。

Hive Kerberos 主体：指定 Hive Kerberos 主体。如果 Hive 身份验证类型为 keytabimpersonation，则需要指定。

Hive 身份验证类型：指定身份验证类型。类型可以为 noauth、密钥表或 keytabimpersonation。

输入要保存为的数据集名称：可选择为您所上传的数据集指定新名称。

SQL Query：指定您想要执行的 Hive 查询。