Running H₂O on Hadoop¶

The following tutorial will walk the user through the download or build of H₂O and the parameters involved in launching H₂O from the command line.

Download the latest H₂O release:

$ wget http://h2o-release.s3.amazonaws.com/h2o/rel-mandelbrot/1/h2o-2.8.0.1.zip

Prepare the job input on the Hadoop Node by unzipping the build file and changing to the directory with the Hadoop and H₂O’s driver jar files.

$ unzip h2o-2.8.0.1.zip
$ cd h2o-2.8.0.1/hadoop

To launch H₂O nodes and form a cluster on the Hadoop cluster, run:

$ hadoop jar h2odriver_hdp2.1.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 1g -nodes 1 -output hdfsOutputDirName

4. To monitor your job, direct your web browser to your standard job tracker Web UI. To access H₂O’s Web UI, direct your web browser to one of the launched instances. If you are unsure where your port is launched, review the output from your command.

Parameters

h2o_driver_jar_file : For each major release of each distribution of hadoop, there is a driver jar file that the user will need to launch H2O with. Currently available driver jar files in each build of H2O include h2odriver_cdh5.jar, h2odriver_hdp2.1.jar, and mapr2.1.3.jar.

jobtracker:port: The argument is optional and typically without it the jobtracker will be available at the default port of each distro.

mapperXmx : The mapper size or the amount of memory allocated to each node.

nodes : The number of nodes requested to form the cluster.

output: The name of the directory created for each mapper task which has to be unique to each instance of H2O since they cannot be overwritten.

Running H2O on Hadoop¶

Running H₂O on Hadoop¶