Import Dataset with Hive Connector

First, we’ll initialize a client with our server credentials and store it in the variable dai.

[22]:
import driverlessai
dai = driverlessai.Client(address='http://localhost:12345', username="py", password="py")

We can check that the Hive connector has been enabled on the Driverless AI server.

[23]:
dai.connectors.list()
[23]:
["upload'", 'file', 'hdfs', 's3', 'recipe_file', 'recipe_url', 'hive']

The Hive connector is considered an advanced connector. Thus, the create methods require a data_source_config argument to use them.

User Defined Hive Configuration

Here we manually specify the Hive configurations directory, authentication type, keytab file path, and Kerberos principal.

[24]:
dataset_from_hive = dai.datasets.create(
    data_source="hive",
    name="From Hive user defined config",
    data="SELECT * FROM AirlinesTest WHERE distance > 0",
    data_source_config=dict(
        hive_conf_path="/opt/hive/current/conf",
        hive_auth_type="keytab",
        hive_keytab_path="/opt/hive/current/hive.keytab",
        hive_principal_user="kadmin/admin@KDC.LOCAL",
    ),
    force=True,
)

dataset_from_hive.head()
Complete 100.00% - [4/4] Computed stats for column isdepdelayed_rec
[24]:
fyear fmonth fdayofmonth fdayofweek deptime arrtimeuniquecarrier origin dest distanceisdepdelayed isdepdelayed_rec
"f1987""f10" "f15" "f4" 729 903"PS" "SAN" "SFO" 447"NO" -1
"f1987""f10" "f17" "f6" 741 918"PS" "SAN" "SFO" 447"YES" 1
"f1987""f10" "f22" "f4" 728 852"PS" "SAN" "SFO" 447"NO" -1
"f1987""f10" "f24" "f6" 929 1052"PS" "SFO" "RNO" 192"YES" 1
"f1987""f10" "f6" "f2" 1505 1607"PS" "BUR" "OAK" 325"NO" -1

Predefined Hive Configuration

Here we use a predefined configuration that was setup on the Driverless AI server. We only need to specify the configuration name along with authentication type.

[25]:
dataset_from_hive = dai.datasets.create(
    data_source="hive",
    name="From Hive pre-defined config",
    data="SELECT * FROM AirlinesTest WHERE distance > 0",
    data_source_config=dict(
        hive_default_config="kerberized",
        hive_auth_type="keytab",
    ),
    force=True,
)

dataset_from_hive.head()
Complete 100.00% - [4/4] Computed stats for column isdepdelayed_rec
[25]:
fyear fmonth fdayofmonth fdayofweek deptime arrtimeuniquecarrier origin dest distanceisdepdelayed isdepdelayed_rec
"f1987""f10" "f15" "f4" 729 903"PS" "SAN" "SFO" 447"NO" -1
"f1987""f10" "f17" "f6" 741 918"PS" "SAN" "SFO" 447"YES" 1
"f1987""f10" "f22" "f4" 728 852"PS" "SAN" "SFO" 447"NO" -1
"f1987""f10" "f24" "f6" 929 1052"PS" "SFO" "RNO" 192"YES" 1
"f1987""f10" "f6" "f2" 1505 1607"PS" "BUR" "OAK" 325"NO" -1