AWS Setup Guide¶
This topic describes how to set up Puddle on Amazon AWS. Note that this is a sample recipe on how to install Puddle, and it is not necessarily the only recipe. VPN ingress/egress rules might need to be customized depending on the user’s needs.
Recipe for Deploying on Public and Private Subnets¶
This recipe describes how to configure and deploy Puddle on a public subnet with both Redis and a database on private networks.
For quick navigation on AWS, we recommend that you pin the following to your toolbar:
VPC Configuration¶
Create two elastic IPs: one to be used for the NAT gateway and another one for management in case you need to log in to one of the EC2 machines in your network.
- Go to the VPC Dashboard and click on the “Launch VPC Wizard”
- Select the following configuration:
- In the configuration menu, enter names for the public/private subnets. Note that you can keep the default values for the IPv4 CIDR blocks and use one of the created Elastic IPs as an “Elastic IP Allocation ID”.
Note: The VPC wizard will only allow you to create one public and one private subnet. Once you create a VPC with the wizard, add a new subnet on a different availability zone than the one created previously. This will come in handy when configuring the Amazon RDS module.
RDS for the Postgres Database¶
- Click on RDS on the navigation toolbar menu and select Databases from the left panel.
- Click Create Database and enter the following details:
- Type: PostgreSQL
- Version: 9.6.15-R1
- Credentials: Specify a user and a master password. These credentials will be configured in Puddle’s config.yaml.
- Connectivity: Select the VPC ID created in the previous section, and set Publicly Accessible to No.
![]()
- Expand the Additional configuration section and set the Initial database name to a value as shown in the image below:
Other settings are optional and will depend on your specific system configuration. Note that there might be fees associated with the type of RDS configuration you chose.
Set Up an Inbound Rule to Enable a Connection from the Public Subnet to the Database¶
After the RDS database is up and running, go to RDS and then select Databases from the left menu. Click on the newly created database.
From here, you can select the security group associated with the database and modify it to include an inbound rule matching the public subnet where the backend server machine will live. (See VPC Security Groups in the image below.)
ElasticCache for the Redis Cache¶
- Click on ElasticCache on the navigation toolbar menu and select Redis from the left panel.
- Click on CREATE and enter the details as required. Refer to the image below:
- Create a new subnet and select the private networks from the previously created VPC on the Advanced Redis settings section as shown in the following image:
Backend Server Host for Puddle¶
Spin up a new EC2 instance for an Ubuntu box and run the following commands to get dependencies installed:
Linux Packages¶
The following Linux packages will be used to check connectivity to PostgreSQL and Redis.
sudo apt-get update
sudo apt-get install -y wget unzip redis-tools postgresql-client
yum install postgresql redis
Confirm and Test the RDS and ElasticCache Installations¶
- Look up the primary endpoint for ElasticCache and make sure the following command succeeds and connects to Redis. (Do not use the read only endpoint.)
redis-cli -h xxxxxxxxxxx.cac1.cache.amazonaws.com
- Look up the database endpoint from the RDS menu and run the following:
psql -U puddle -h xxxxxxxxxxx.ca-central-1.rds.amazonaws.com -p 5432 puddle
- Once connected to postgresql using the previously defined user and password, install the “uuid-ossp” extension as follows:
CREATE EXTENSION "uuid-ossp";
- Validate that the extension is created using:
select * from pg_extension;
Configure Puddle AWS Authentication¶
This section describes a typical configuration of AWS Cognito to authenticate to Puddle.
- Click on Cognito from the navigation toolbar menu.
- Click on Manage User Pools.
- Click on the Create a user pool button at the top right of the page.
- Give your new user pool a name, e.g. “puddle-users”.
For this example, we will be using the default settings and will create a new app. Please note that you might need to modify these default settings further depending on your needs.
- Create a new app client by editing the App clients sections as shown in the image below:
- Go back to continue editing the user pool and then click on the Create pool button at the bottom of the page.
- On the newly created user pool, select Users and Groups on the left pane menu and create two groups as follows:
![]()
Note that the two configured group names above will be used for the config parameters “adminsGroup” and “usersGroup” to denote whether users will be treated as admins or non admins on the Puddle system.
Configure AWS Provider¶
- Create a new security group to be used for the DAI and H2O-3 instances that Puddle creates with the following inbound rules:
Port Number Description 80 http 443 https 22 ssh 8888 Jupyter 9999 REST API 12345 DAI web app 54321 H2O-3
- Capture the subnet ID from where you want the DAI and H2O-3 instances to be installed. Create a new one if needed.
- Create a new ssh key pair or use an existing pair. This key pair will be used for the ssh connection to the DAI and H2O-3 instances. Make sure to have this key available on the file system where the Puddle backend is running.
Enable Public IPv4 Addressing for the Public Subnet¶
- Open the Amazon VPC console and choose Subnets from the navigation pane.
- Select the public subnet and choose Subnet Actions > Modify auto-assign IP settings.
- The Enable auto-assign public IPv4 address check box, if selected, requests a public IPv4 address for all DAI instances launched into the selected subnet.
Create the License File¶
- ssh into the Virtual Machine.
- Create a file
/opt/h2oai/puddle/license.sigcontaining the license. Different path might be used, but this is the default.
Configuring Puddle¶
Now we will need to fill in the config.yaml file, which is located at /etc/puddle/config.yaml. The config.yaml should contain the following:
redis:
connection:
protocol: tcp
address:
password:
tls: true
db:
connection:
drivername: postgres
host:
port: 5432
user:
dbname: puddle
sslmode: require
password:
tls:
certFile:
keyFile:
license:
file: /opt/h2oai/puddle/license.sig
ssh:
publicKey: /opt/h2oai/puddle/ssh/id_rsa.pub
privateKey: /opt/h2oai/puddle/ssh/id_rsa
auth:
token:
secret:
activeDirectory:
enabled: false
server:
port: 389
baseDN:
security: tls
objectGUIDAttr: objectGUID
displayNameAttr: displayName
administratorsGroup: Puddle-Administrators
usersGroup: Puddle-Users
implicitGrant: false
azureAD:
enabled: false
useAADLoginExtension: true
awsCognito:
enabled: false
userPoolId:
userPoolWebClientId:
domain:
redirectSignIn:
redirectSignOut:
adminsGroup: Puddle-Administrators
usersGroup: Puddle-Users
implicitGrant: false
ldap:
enabled: false
host:
port: 389
baseDN:
baseDNGroup:
bindDN:
bindPassword:
implicitGrant: false
adminsGroup: Puddle-Administrators
usersGroup: Puddle-Users
packer:
path: /opt/h2oai/puddle/deps/packer
usePublicIP: true
buildTimeoutHours: 1
terraform:
path: /opt/h2oai/puddle/deps/terraform
usePublicIP: true
backend:
baseUrl:
connections:
usePublicIP: true
webclient:
usePublicIP: true
providers:
azure:
enabled: false
authority:
location:
rg:
vnetrg:
vnet:
sg:
subnet:
enterpriseApplicationObjectId:
adminRoleId:
publicIpEnabled: true
packerInstanceType:
aws:
enabled: false
owner:
vpcId:
sgId:
subnetId:
publicIpEnabled: true
packerInstanceType:
products:
dai:
configTomlTemplatePath: "/opt/h2oai/puddle/configs/dai/config.toml"
license:
logs:
dir: /opt/h2oai/puddle/logs
maxSize: 1000
maxBackups: 15
maxAge: 60
compress: true
mailing:
enabled: true
server:
username:
password:
fromAddress: puddle@h2o.ai
fromName: Puddle
recipients:
offsetHours: 24
- ssh into the Virtual machine.
- Fill in the fields in the config.yaml.
- Values for
redis.connection.*can be found in following way:
- Microsoft Azure:
- Search for Azure Cache for Redis.
- Select newly created Redis instance.
- Select Access keys.
- Amazon AWS:
- Go to ElastiCache Dashboard.
- Select Redis.
- Select cluster used by Puddle.
- Select Description tab.
- Values for
db.connection.*can be found in following way:
- Microsoft Azure:
- Search for Azure Database for PostgreSQL servers.
- Select the newly created PostgreSQL instance.
- Select Connection strings.
- Use the password that was provided when creating the PostgreSQL database.
- Amazon AWS:
- Go to Amazon RDS.
- Select Databases.
- Select database used by Puddle.
tls.certFileshould point to the PEM encoded certificate file if you want to use HTTPS. If you don’t want to use HTTPS, leave this property empty. If you set this property, thentls.keyFilemust be set as well.tls.keyFileshould point to the PEM encoded private key file if you want to use HTTPS. The private key must be not encrypted by password. If you don’t want to use HTTPS, leave this property empty. If you set this property, thentls.certFilemust be set as well.license.fileshould be a path to the file containing the license (created in previous step).ssh.publicKeyshould be the path to ssh public key (for example /opt/h2oai/puddle/ssh/id_rsa.pub), which will be used by Puddle to talk to the Systems. If this ssh key is changed, Puddle won’t be able to talk to the Systems created with old key, and these will have to be destroyed.ssh.privateKeyshould be the path to ssh private key (for example /opt/h2oai/puddle/ssh/id_rsa), which will be used by Puddle to talk to the Systems. If this ssh key is changed, Puddle won’t be able to talk to the Systems created with old key, and these will have to be destroyed.auth.token.secretshould be a random string. It is used to encrypt the tokens between the backend and frontend.
for example the following could be used to generate the secret:
tr -cd '[:alnum:]' < /dev/urandom | fold -w32 | head -n1
auth.activeDirectory.enabledshould be true/false and is false by default. If true then authentication using ActiveDirectory is enabled.auth.activeDirectory.servershould be the hostname of the ActiveDirectory server, for example puddle-ad.h2o.ai.auth.activeDirectory.portshould be the port where ActiveDirectory is accessible, defaults to 389.auth.activeDirectory.baseDNshould be the BaseDN used for search.auth.activeDirectory.securityshould be the security level used in communication with AD server. Could be none, start_tls, tls, defaults to tls.auth.activeDirectory.objectGUIDAttrshould be the name of the attribute used as ID of the user, defaults to objectGUID.auth.activeDirectory.displayNameAttrshould be the name of the attribute used to determine groups where user is member, defaults to memberOf.auth.activeDirectory.administratorsGroupshould be the name of the Administrators group. Users in this group are assigned Administrator role in Puddle, users in Administrators group and Users group are considered Administrators.auth.activeDirectory.usersGroupshould be the name of the Users group. Users in this group are assigned User role in Puddle, users in Administrators group and Users group are considered Administrators.auth.activeDirectory.implicitGrantshould be true/false and is false by default. If true, then users are allowed access to Puddle (using user role) even if they are not members of Administrators nor Users group. If false, then users must be members of at least one group to be allowed access to Puddle.auth.azureAD.enabledshould be true/false and is false by default. If true, then authentication using Azure Active Directory is enabled.auth.azureAD.useAADLoginExtensionshould be true/false and is false by default. If true, then ssh access to provisioned Virtual machines will use the Azure AD for authentication. Check https://docs.microsoft.com/en-us/azure/virtual-machines/linux/login-using-aad for more information. Cannot be enabled, if using proxy for egress.auth.awsCognito.enabledshould be true/false and is false by default. If true, then authentication using AWS Cognito is enabled.auth.awsCognito.userPoolIdshould be the Pool Id, for example us-east-1_SlxxxxML1.auth.awsCognito.userPoolWebClientIdshould be the App client id.
- The App client id can be found in following way:
- Go to the AWS Cognito User Pool used by Puddle.
- Select the App client settings.
- Use the value under ID.
auth.awsCognito.domainshould be the domain of the AWS Cognito User Pool, for example https://puddle.auth.<REGION>.amazoncognito.com.
- The domain can be found in following way:
- Go to the AWS Cognito User Pool used by Puddle.
- Select the Domain name.
auth.awsCognito.redirectSignInshould be https://<SERVER_ADDRESS>/aws-cognito-callback, please replace <SERVER_ADDRESS> with hostname where Puddle is running.auth.awsCognito.redirectSignOutshould be https://<SERVER_ADDRESS>/logout, please replace <SERVER_ADDRESS> with hostname where Puddle is running.auth.awsCognito.adminsGroupshould be the name of a group in AWS Cognito User Pool. If users are members of this group, they are assigned Administrator role in Puddle.auth.awsCognito.usersGroupshould be the name of a group in AWS Cognito User Pool. If users are members of this group, they are assigned User role in Puddle.auth.awsCognito.implicitGrantshould be true/false and is false by default. If true, then users are allowed access to Puddle (using user role) even if they are not members of Administrators nor Users group. If false, then users must be members of at least one group to be allowed access to Puddle.auth.ldap.enabledshould be true/false and is false by default. If true, then authentication using LDAP is enabled.auth.ldap.hostshould be the LDAP server hostname.auth.ldap.portshould be the port where LDAP is accessible, defaults to 389.auth.ldap.baseDNshould be the BaseDN where authentication search will start.auth.ldap.baseDNGroupshould be the BaseDN where search for user’s group will start.auth.ldap.bindDNshould be the BindDN used by Puddle to query LDAP.auth.ldap.bindPasswordshould be the password of the user used by Puddle to query LDAP.auth.ldap.implicitGrantshould be true/false and is false by default. If true, then users are allowed access to Puddle (using user role) even if they are not members of Administrators nor Users group. If false, then users must be members of at least one group to be allowed access to Puddle.auth.ldap.adminsGroupshould be the name of the Administrators group. Users in this group are assigned Administrator role in Puddle, users in Administrators group and Users group are considered Administrators.auth.ldap.usersGroupshould be the name of the Users group. Users in this group are assigned User role in Puddle, users in Administrators group and Users group are considered Administrators.packer.pathshould point to the packer binary. Defaults to/opt/h2oai/puddle/deps/packer.packer.usePublicIPshould be true/false and is true by default. If true then packer will use public IP to communicate with the provisioned Virtual machines, otherwise private IP will be used.packer.buildTimeoutHoursshould be the number of hours after which the packer build times out. Default is 1 hour.terraform.pathshould point to the terraform binary. Defaults to/opt/h2oai/puddle/deps/terraform.terraform.usePublicIPshould be true/false and is true by default. If true then terraform will use public IP to communicate with the provisioned Virtual machines, otherwise private IP will be used.backend.baseUrlshould be the URL where Puddle is running, for example https://puddle.h2o.ai.backend.connections.usePublicIpshould be true/false and is true by default. If true then backend will use public IP to communicate with the provisioned Virtual machines, otherwise private IP will be used.webclient.usePublicIpshould be true/false and is true by default. If true then public IP is shown in UI, otherwise private IP is displayed.providers.azure.enabledshould be true/false and is false by default. If true then Microsoft Azure is enabled as provider in Puddle. All variables underproviders.azuremust be set if enabled.providers.azure.authorityshould be set tohttps://login.microsoftonline.com/<Azure ActiveDirectory Name>.onmicrosoft.com.
- The Azure Active Directory name can be found in following way:
- Go to Azure Active Directory blade.
- Select Overview.
providers.azure.locationshould be set to the same value that was specified for the Resource group, for exampleeastus.providers.azure.rgshould be set to the name of the newly created Resource group.providers.azure.vnetrgshould be set to the name of the Resource group where VNET and Subnet are present.providers.azure.vnetshould be set to the id of the newly created Virtual network.providers.azure.sgshould be set to the id of the newly created Network security group.providers.azure.subnetshould be set to the id of the newly created Subnet.providers.azure.enterpriseApplicationObjectIdshould be the Object ID of the Enterprise Application.
- The Enterprose Application Object ID can be found in following way:
- Go to the Azure Active Directory blade.
- Select Enterprise Applications.
- Select the newly created Enterprise Application.
- Use the Object ID.
providers.azure.adminRoleIdshould be set to the ID of the newly created Administator Role in the Application Registration Manifest.
- The Administator Role ID can be found in following way:
- Go to the Azure Active Directory blade.
- Select App registrations (preview).
- Select the newly created App registration.
- Select Manifest.
- Search for Administator role under appRoles and use the ID of this role.
providers.azure.publicIpEnabledshould be true/false and is true by default. Public IP is created if and only if this is set to true. Must be set to true if at least one of packer, terraform, backend or webclient uses public IP.providers.azure.packerInstanceTypeshould be the instance type used by Packer to build images. Defaults to Standard_DS2_v2.providers.aws.enabledshould be true/false and is false by default. If true then Amazon AWS is enabled as provider in Puddle. All variables underproviders.awsmust be set if enabled.providers.aws.ownershould be the owner of the newly created resources.providers.aws.vpcIdshould be the ID of the VPC where Virtual machines will be launched.providers.aws.sgIdshould be the ID of the Security Group applied to provisioned Virtual machines.providers.aws.subnetIdshould be the ID of the Subnet where Virtual machines will be placed.providers.aws.publicIpEnabledshould be true/false and is false by default. If true, then no public IP will be assigned. Must be set to true if at least one of packer, terraform, backend or webclient uses public IP.providers.aws.packerInstanceTypeshould be the instance type used by packer to build images, defaults to m5.large.products.dai.configTomlTemplatePathshould be the path to custom config.toml file, which will be used as default configuration for all new Driverless AI Systems. If not set, the default file is used.products.dai.licenseshould be the path to DriverlessAI license file. If set, then this license will be automatically installed on all provisioned systems.logs.dirshould be set to a directory where logs should be placed.logs.maxSizeshould be the max size of log file, in MB, defaults to 1000.logs.maxBackupsshould be the number of old files retained, defaults to 15.logs.maxAgeshould be the max age of retained files, in days, defaults to 60. Older files are always deleted.logs.compressshould be true/false and is true by default. If true then the files will be compressed when rotating.mailing.enabledshould be true/false. If true then mailing is enabled. All fields undermailingare mandatory if this is set to true.mailing.servershould be the hostname and port of the SMTP server, for example smtp.example.com:587.mailing.usernameshould be the client username.mailing.passwordshould be the client password.mailing.fromAddressshould be the email address used as FROM, for example in case of an address ‘<Puddle> puddle@h2o.ai’ this field should be set to puddle@h2o.ai.mailing.fromNameshould be the name used as FROM, defaults to Puddle, for example in case of an address ‘<Puddle> puddle@h2o.ai’ this field should be set to Puddle.mailing.recipientsshould be the space-separated list of recipients.mailing.offsetHoursshould be a number of hours between repeated email notifications, defaults to 24, does not apply to FAILED system notifications.
Configuring Environment Variables¶
The next step is to to fill in the variables in EnvironmentFile file, which is located at /etc/puddle/EnvironmentFile. The EnvironmentFile should contain the following:
# Should point to dir with config.yaml
PUDDLE_CONFIG_DIR='/etc/puddle/'
# AzureRM Provider should skip registering the Resource Providers
ARM_SKIP_PROVIDER_REGISTRATION=true
# Azure related environment variables, please fill-in all values if you use Azure as provider
# AZURE_SUBSCRIPTION_ID='YOUR-SUBSCRIPTION-ID'
# AZURE_TENANT_ID='YOUR-TENANT-ID'
# AZURE_CLIENT_ID='YOUR-CLIENT-ID'
# AZURE_CLIENT_SECRET='YOUR-CLIENT-SECRET'
# AWS related environment variables, please fill-in all values if you use AWS as provider
# AWS_ACCESS_KEY_ID='YOUR-AWS-ACCESS-KEY-ID'
# AWS_SECRET_ACCESS_KEY='YOUR-AWS-SECRET-ACCESS-KEY'
# AWS_REGION='AWS-REGION'
# General variables, delete those which are not necessary
# http_proxy=http://10.0.0.100:3128
# https_proxy=http://10.0.0.100:3128
# no_proxy=localhost,127.0.0.1
PUDDLE_CONFIG_DIRdirectory where the config.yaml file is present.ARM_SKIP_PROVIDER_REGISTRATION- AzureRM Provider should skip registering the Resource Providers. This should be left as true.AZURE_SUBSCRIPTION_IDis the ID of the subscription that should be used.- This value can be found in following way:
- Search for Subscriptions.
- Use the SUBSCRIPTION ID of the subscription you want to use.
- This value can be found in following way:
AZURE_TENANT_IDis ID of tenant that should be used.- This value can be found in following way:
- Select Azure Active Directory blade.
- Select App registrations (preview).
- Select the newly created App registration.
- Use Directory (tenant) ID.
- This value can be found in following way:
AZURE_CLIENT_IDis the Application ID that should be used.- This value can be found in following way:
- Select Azure Active Directory blade.
- Select App registrations (preview).
- Select the newly created App registration.
- Use Application (client) ID.
- This value can be found in following way:
AZURE_CLIENT_SECRETclient secret that should be used.- This value can be found in following way:
- Select the Azure Active Directory blade.
- Select App registrations (preview).
- Select the newly created App registration.
- Select Certificates & Secrets.
- Click New client secret.
- Fill in the form and click Add.
- The secret value should be visible. Copy it because after refreshing the page, this value is gone and cannot be restored.
- This value can be found in following way:
AWS_ACCESS_KEY_IDAWS Access Key Id used by Puddle to access the AWS services.AWS_SECRET_ACCESS_KEYAWS Secret Access Key used by Puddle to access the AWS services.AWS_REGIONAWS Region used by Puddle to access the AWS services.http_proxyis the URL of proxy server to be used (if required), for example http://10.0.0.3:3128.https_proxyis the URL of proxy server to be used (if required), for example http://10.0.0.3:3128.no_proxyis the comma-separated list of hosts that should be excluded from proxying, for example localhost,127.0.0.1.
Running Puddle¶
After all of the previous steps are successfully completed, we can now start Puddle. Execute the following command to start the server and web UI:
systemctl start puddle
Puddle is accessible on port 443 if HTTPS is enabled, or on port 80 if HTTP is being used.
First Steps¶
At first, you will have to perform some initialization steps:
- Log in to Puddle as the Administrator.
- Go to Administration > Check Updates.
- Either use the update plan from the default URL location, or specify a custom update plan file.
- Click Submit.
- Review the plan and click Apply.
- Go to Administration > Images.
- Build all the images you want to use. Please be aware this can take up to 1 hour.
Once the images are built, your Puddle instance is ready.
Stats Board (Optional)¶
The stats board is an optional component. It’s distributed as Python wheel, and it requires Python 3.6. It’s recommended (although not necessary) to run the board inside a virtual environment.
Use the following to install the required dependencies:
apt install gcc libpq-dev python3.6-dev python-virtualenv
yum install epel-release
yum install gcc postgresql-devel python36-devel python-virtualenv
Use the following to create the virtualenv:
mkdir -p /opt/h2oai/puddle/envs
cd /opt/h2oai/puddle/envs
virtualenv -p python3.6 puddle-stats-env
Please make sure that the virtualenv uses the same name and is available at the same path as in this provided snippet. Otherwise the systemd script used to manage Stats Board will not work.
Use the following to install the stats board. Please note that this command will install dependencies as well:
source /opt/h2oai/puddle/envs/puddle-stats-env/bin/activate
pip install puddle_stats_board-<VERSION>-py3-none-any.whl
Use the following to run the stats board:
systemctl start puddle-dashboard
The stats board is running on port 8050 and is accessible from Puddle UI at http://<PUDDLE_SERVER_ADDRESS>/board. There is a link in the Administration menu as well.