Using Enterprise Steam with Python ---------------------------------- This section describes how to use the Enterprise Steam for Python. Note that each Python request will result in a warning message. These warnings can be ignored. Downloading and Installing ~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Go to `https://s3.amazonaws.com/steam-release/enterprise-steam/latest-stable.html `__ to retrieve the latest version of Enterprise Steam. 2. On the Steam API tab, download the Python package. 3. Open a Terminal window, and navigate to the location where the Python .whl file was downloaded. For example: :: cd ~/Downloads 4. Install Enterprise Steam for Python using ``pip install ``. For example: :: pip install h2osteam-1.4.4-py2.py3-none-any.whl ``login`` ~~~~~~~~~ In Python, use the ``login`` function to log in to your Enterprise Steam web server. Note that you must already have a username and a password. The web server and your username and password are provided to you by your Enterprise Steam Admin. :: $ python >>> import h2osteam >>> conn = h2osteam.login(url = "https://steam.0xdata.loc", verify_ssl = False, username="jsmith", password="jsmith") ``start_h2o_cluster`` ~~~~~~~~~~~~~~~~~~~~~ Use the ``start_h2o_cluster`` function to create a new cluster. This function takes the following parameters: - ``cluster_name``: Specify a name for this cluster. - ``profile_name``: Specify the profile to use for this cluster. - ``num_nodes``: Specify the number of nodes for the cluster. - ``node_memory``: Specify the amount of memory that should be available on each node. - ``v_cores``: Specify the number of virtual cores. - ``n_threads``: Specify the number of threads (CPUs) to use in the cluster. Specify 0 to use all available threads. - ``max_idle_time``: Specify the maximum number of hours that the cluster can be idle before gracefully shutting down. Specify 0 to turn off this setting and allow the cluster to remain idle for an unlimited amount of time. - ``max_uptime``: Specify the maximum number of hours that the cluster can be running. Specify 0 to turn off this setting and allow the cluster to remain up for an unlimited amount of time. - ``extramempercent``: Specify the amount of extra memory for internal JVM use outside of the Java heap. This is a percentage of memory per node. The default (and recommended) value is 10%. - ``h2o_version``: The H2O engine version that this cluster will use. Note that the Enterprise Steam Admin is responsible for adding engines to Enterprise Steam. - ``yarn_queue``: If your cluster contains queues for allocating cluster resources, specify the queue for this cluster. Note that the YARN Queue cannot contain spaces. - ``callback_ip``: Optionally specify the IP address for callback messages from the mapper to the driver (driverif). :: >>> cluster_config = conn.start_h2o_cluster(cluster_name = 'first-cluster-from-Python', profile_name = 'default', num_nodes = 2, node_memory = '30g', h2o_version = "3.22.0.1", max_idle_time = 1, max_uptime = 1) # Call the cluster to retrieve its ID and configuration params. >>> cluster_config {'id': 107, 'connect_params': {'cookies': [u'first-cluster-from-Python=YW5nZWxhOmdrZm53aGJsdWY='], 'ip': 'steam.0xdata.loc', 'context_path': u'jsmit_first-cluster-from-Python', 'verify_ssl_certificates': False, 'https': True, 'port': 9999}} Note that after you create a cluster, you can immediately connect to that cluster and begin using H2O. Refer to the following for a complete Python example. :: >>> import h2o >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> h2o.connect(config = cluster_config) # import the cars dataset # this dataset is used to classify whether or not a car is economical based on # the car's displacement, power, weight, and acceleration, and the year it was made >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") # convert response column to a factor >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() # set the predictor names and the response column name >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" # split into train and validation sets >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) # initialize your estimator >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) # train your model, specifying your 'x' predictors, # your 'y' the response column, training_frame, and validation_frame >>> cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid) # print the auc for the validation data >>> cars_gbm.auc(valid=True) ``get_h2o_cluster`` ~~~~~~~~~~~~~~~~~~~ Use the ``get_h2o_cluster`` to retrieve information about a specific cluster using the cluster name. :: >>> conn.get_h2o_cluster('first-cluster-from-Python') {'id': 108, 'connect_params': {'cookies': [u'first-cluster-from-Python=YW5nZWxhOnA1bHRreHN5amo='], 'ip': 'steam.0xdata.loc', 'context_path': u'jsmith_first-cluster-from-Python', 'verify_ssl_certificates': False, 'https': True, 'port': 9999}} ``get_h2o_clusters`` ~~~~~~~~~~~~~~~~~~~~ Use the ``get_h2o_clusters`` to retrieve all running H2O clusters accessible to current user :: >>> conn.get_h2o_clusters() ``stop_h2o_cluster`` ~~~~~~~~~~~~~~~~~~~~ Use the ``stop_h2o_cluster`` function to stop a cluster. :: >>> conn.stop_h2o_cluster(cluster_config) ``show_profiles`` ~~~~~~~~~~~~~~~~~ Use the ``show_profiles`` to show available profiles. :: >>> conn.show_profiles(cluster_config) ``start_internal_sparkling_cluster`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``start_internal_sparkling_cluster`` function to create a new sparkling water cluster using internal backend. This function takes the following parameters: - ``cluster_name``: Specify a name for this cluster. - ``profile_name``: Specify the profile to use for this cluster. - ``h2o_version``: The H2O engine version that this cluster will use. Note that the Enterprise Steam Admin is responsible for adding engines to Enterprise Steam. - ``driver_cores``: Number of Spark driver cores - ``driver_memory_gb``: Amount of Spark driver memory in GB - ``num_executors``: Number of Spark executors - ``executor_cores``: Number of Spark executor cores - ``executor_memory_gb``: Amount of Spark executor memory in GB - ``h2o_node_threads``: Specify the number of threads (CPUs) to use per node. Specify 0 to use all available threads. - ``start_timeout_sec``: Specify start timeout in seconds - ``yarn_queue``: If your cluster contains queues for allocating cluster resources, specify the queue for this cluster. Note that the YARN Queue cannot contain spaces. - ``spark_properties``: Specify additional spark properties as a Python dictionary. :: >>> cluster = conn.start_internal_sparkling_cluster(cluster_name="test", profile_name="default-sparkling-internal", h2o_version="3.22.0.1", driver_cores=1, driver_memory_gb=1, num_executors=1, executor_cores=1, executor_memory_gb=1, h2o_node_threads=0, start_timeout_sec=90, yarn_queue=None, spark_properties={'spark.python.worker.reuse': 'true', 'key': 'val'}) ``start_external_sparkling_cluster`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``start_external_sparkling_cluster`` function to create a new sparkling water cluster using external backend. This function takes the following parameters: - ``cluster_name``: Specify a name for this cluster. - ``profile_name``: Specify the profile to use for this cluster. - ``h2o_version``: The H2O engine version that this cluster will use. Note that the Enterprise Steam Admin is responsible for adding engines to Enterprise Steam. - ``driver_cores``: Number of Spark driver cores - ``driver_memory_gb``: Amount of Spark driver memory in GB - ``num_executors``: Number of Spark executors - ``executor_cores``: Number of Spark executor cores - ``executor_memory_gb``: Amount of Spark executor memory in GB - ``h2o_nodes``: Specify the number of H2O nodes for the cluster. - ``h2o_node_memory_gb``: Specify the amount of memory that should be available on each H2O node. - ``h2o_node_threads``: Specify the number of threads (CPUs) to use per node. Specify 0 to use all available threads. - ``start_timeout_sec``: Specify start timeout in seconds - ``yarn_queue``: If your cluster contains queues for allocating cluster resources, specify the queue for this cluster. Note that the YARN Queue cannot contain spaces. - ``spark_properties``: Specify additional spark properties as a Python dictionary. :: >>> cluster = conn.start_external_sparkling_cluster(cluster_name="test", profile_name="default-sparkling-external", h2o_version="3.22.0.1", driver_cores=1, driver_memory_gb=1, num_executors=1, executor_cores=1, executor_memory_gb=1, h2o_nodes=1, h2o_node_memory_gb=1, h2o_node_threads=0, start_timeout_sec=90, yarn_queue=None, spark_properties={'spark.python.worker.reuse': 'true', 'key': 'val'}) ``sparkling_cluster.session`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``session`` function of sparkling water cluster to connect to the remote spark session and issue commands. :: >>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......) >>> sparkling_cluster.session() ``sparkling_cluster.send_statement`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``send_statement`` function of sparkling water cluster to send a single statement to the remote spark session. :: >>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......) >>> sparkling_cluster.send_statement("f_crimes = h2o.import_file(path ="../data/chicagoCrimes10k.csv",col_types =column_type)") ``sparkling_cluster.detail`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``detail`` function of sparkling water cluster to get an information about that sparkling water cluster. :: >>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......) >>> sparkling_cluster.detail() ``sparkling_cluster.stop`` ~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``stop`` function of sparkling water cluster to stop the cluster. :: >>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......) >>> sparkling_cluster.stop() ``upload_engine`` ~~~~~~~~~~~~~~~~~ Use the ``upload_engine`` function to upload H2O engine to Steam. :: >>> conn.upload_engine("~/Downloads/h2o-3.22.0.1-hdp2.4.zip") ``upload_sparkling_engine`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the ``upload_sparkling_engine`` function to upload Sparkling Water engine to Steam. :: >>> conn.upload_sparkling_engine("~/Downloads/sparkling-water-2.3.17.zip")