-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Currently, the CNCF automation scripts only support CPU-based runners/VMs. Some CI workflows (e.g., ML/AI workloads, GPU-enabled tests, and benchmarks) require access to NVIDIA GPUs and appropriate drivers/tooling on the runner.
To enable these workflows, support for GPU-capable VM images must be added.
Requested VM images:
- VM.GPU.A10.1
- VM.GPU.A10.2
Reference: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu
Additional Requirements
- Add installation for Nvidia drivers [Required]
- Add installation for
nvidia-smi[Optional but recommended] - nvkind for GPU discovery/use in Kubernetes-based workflows [Optional]
File to Update: https://github.com/cncf/automation/blob/main/ci/gha-runner-image/Dockerfile
Boot Volume Size
The default ephemeral OCI VM boot volume is 50GB, which may be insufficient for GPU workloads (CUDA, drivers, ML frameworks, build artifacts, etc.).
Request:
- Increase default boot volume size to 256GB, or
- Provide selectable options: 50GB, 256GB, 500GB
File to Update: https://github.com/cncf/automation/blob/main/ci/cloudrunners/oci/main.go
Would you like to help?
Yes, I would love to help.
I worked on something similar at kubeflow/trainer#2689 and https://github.com/jaiakash/automation/tree/gpu-runner-kubeflow