Skip to content

feature request: Add support for GPU VM on CNCF self-runners #115

@jaiakash

Description

@jaiakash

Currently, the CNCF automation scripts only support CPU-based runners/VMs. Some CI workflows (e.g., ML/AI workloads, GPU-enabled tests, and benchmarks) require access to NVIDIA GPUs and appropriate drivers/tooling on the runner.

To enable these workflows, support for GPU-capable VM images must be added.

Requested VM images:

  • VM.GPU.A10.1
  • VM.GPU.A10.2

Reference: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu

Additional Requirements

  • Add installation for Nvidia drivers [Required]
  • Add installation for nvidia-smi [Optional but recommended]
  • nvkind for GPU discovery/use in Kubernetes-based workflows [Optional]

File to Update: https://github.com/cncf/automation/blob/main/ci/gha-runner-image/Dockerfile

Boot Volume Size

The default ephemeral OCI VM boot volume is 50GB, which may be insufficient for GPU workloads (CUDA, drivers, ML frameworks, build artifacts, etc.).

Request:

  • Increase default boot volume size to 256GB, or
  • Provide selectable options: 50GB, 256GB, 500GB

File to Update: https://github.com/cncf/automation/blob/main/ci/cloudrunners/oci/main.go

Would you like to help?

Yes, I would love to help.
I worked on something similar at kubeflow/trainer#2689 and https://github.com/jaiakash/automation/tree/gpu-runner-kubeflow

Metadata

Metadata

Assignees

Labels

help wantedneeds-kindIndicates an issue or PR that is missing an issue type or kind (a kind/foo label)needs-triageIndicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions