Driverless AI Health API¶

The following sections describe the Driverless AI Health API.

Overview
Using the DAI Health API
Attribute Definitions

Overview¶

The Driverless AI Health API is a publicly available API that exposes basic system metrics and statistics. Its primary purpose is to provide information for resource monitoring and auto-scaling of Driverless AI multinode clusters. The API outputs a set of metrics in a JSON format so that they can be used by tools like KEDA or K8S Autoscaler.

Notes:

The Health API is only available in multinode or singlenode mode. For more information, refer to the worker_mode config.toml option.
For security purposes, the Health API endpoint can be disabled by setting the enable_health_api config.toml option to false. This setting is enabled by default.
The Health API is designed with the intention to provide information that is needed by users to write their own autoscaling logic for Multinode Driverless AI. It can also be used in tandem with services like Enterprise Puddle to skip the authentication step and instead retrieve the needed information directly.

Using the DAI Health API¶

To retrieve Driverless AI’s health status, create a GET request:

GET http://{driverless-ai-instance-address}/apis/health/v1

This returns the following JSON response:

{
  "api_version": "1.0",
  "server_version": "1.10",
  "timestamp": "ISO 8601 Datetime",
  "last_system_interaction": "ISO 8601 Datetime",
  "is_idle": true,

  "resources": {
    "cpu_cores": 150,
    "gpus": 12,
    "nodes": 5,
  },

  "tasks": {
    "running": 45,
    "scheduled": 123,
  },

  "utilization": {
    "cpu": 0.12,
    "gpu": 0.45,
    "memory": 0.56,
  },
}

Attribute Definitions¶

The following is a list of relevant JSON attribute definitions.

api_version (string): API version
server_version (string): Driverless AI server version
timestamp (string): Current server time in ISO8601 format
last_system_interaction (string): ISO8601 format timestamp of last interaction with the Driverless AI server. The following are considered as system interactions:

Incoming RPC request from client

Login/Logout of user

A system event like _sync_ message from a running or finished experiment

Initialization of dataset upload

Custom recipe upload

is_idle (boolean): System is considered idle when there is no task running or scheduled
resources.nodes (int): Number of nodes in Driverless AI cluster
resources.gpus (int): Total number of GPUs in Driverless AI cluster
resources.cpu_cores (int): Total number of CPU cores in Driverless AI cluster
tasks.running (int): Total number of jobs running in the system
tasks.scheduled (int): Total number of jobs waiting for execution in scheduling queue
utilization.cpu (float [0, 1]): CPU utilization percentage aggregated across all nodes
utilization.gpu (float [0, 1]): GPU utilization percentage aggregated across all nodes
utilization.memory (float [0, 1]): Memory utilization percentage aggregated across all nodes