Driverless AI Health API

The following sections describe the Driverless AI Health API.

Overview

The Driverless AI Health API is a publicly available API that exposes basic system metrics and statistics. Its primary purpose is to provide information for resource monitoring and auto-scaling of Driverless AI multinode clusters. The API outputs a set of metrics in a JSON format so that they can be used by tools like KEDA or K8S Autoscaler.

Notes:

  • The Health API is only available in multinode or singlenode mode. For more information, refer to the worker_mode config.toml option.

  • For security purposes, the Health API endpoint can be disabled by setting the enable_health_api config.toml option to false. This setting is enabled by default.

  • The Health API is designed with the intention to provide information that is needed by users to write their own autoscaling logic for Multinode Driverless AI. It can also be used in tandem with services like Enterprise Puddle to skip the authentication step and instead retrieve the needed information directly.

Using the DAI Health API

To retrieve Driverless AI’s health status, create a GET request:

GET http://{driverless-ai-instance-address}/apis/health/v1

This returns the following JSON response:

{
  "api_version": "1.0",
  "server_version": "1.10",
  "timestamp": "ISO 8601 Datetime",
  "last_system_interaction": "ISO 8601 Datetime",
  "is_idle": true,

  "resources": {
    "cpu_cores": 150,
    "gpus": 12,
    "nodes": 5,
  },

  "tasks": {
    "running": 45,
    "scheduled": 123,
  },

  "utilization": {
    "cpu": 0.12,
    "gpu": 0.45,
    "memory": 0.56,
  },
}

Attribute Definitions

The following is a list of relevant JSON attribute definitions.

  • api_version (string): API version

  • server_version (string): Driverless AI server version

  • timestamp (string): Current server time in ISO8601 format

  • last_system_interaction (string): ISO8601 format timestamp of last interaction with the Driverless AI server. The following are considered as system interactions:

  1. Incoming RPC request from client

  2. Login/Logout of user

  3. A system event like _sync_ message from a running or finished experiment

  4. Initialization of dataset upload

  5. Custom recipe upload

  • is_idle (boolean): System is considered idle when there is no task running or scheduled

  • resources.nodes (int): Number of nodes in Driverless AI cluster

  • resources.gpus (int): Total number of GPUs in Driverless AI cluster

  • resources.cpu_cores (int): Total number of CPU cores in Driverless AI cluster

  • tasks.running (int): Total number of jobs running in the system

  • tasks.scheduled (int): Total number of jobs waiting for execution in scheduling queue

  • utilization.cpu (float [0, 1]): CPU utilization percentage aggregated across all nodes

  • utilization.gpu (float [0, 1]): GPU utilization percentage aggregated across all nodes

  • utilization.memory (float [0, 1]): Memory utilization percentage aggregated across all nodes