Skip to content

Conversation

@mollyheamazon
Copy link
Collaborator

@mollyheamazon mollyheamazon commented Dec 3, 2025

What's changing and why?

We are adding six new command line arguments related to elastic training for HyperPodTrainingOperator. Elastic training is a method that dynamically scales distributed machine learning operations, which helps solve the resource inefficiency or poor fault tolerance issues being seen often in traditional distributed training jobs.

Before/After UX

Before:

hyp create hyp-pytorch-job
--job-name <job_name>
--image <image_name>
--node-count 1

After:

(Case 1)
hyp create hyp-pytorch-job
--job-name <job_name>
--image <image_name>
--node-count 1
--elastic-replica-increment-step 1
--max-node-count 2
--elastic-graceful-shutdown-timeout-in-seconds 180
--elastic-scaling-timeout-in-seconds 60
--elastic-scale-up-snooze-time-in-seconds 60
(Case 2)
hyp create hyp-pytorch-job
--job-name <job_name>
--image <image_name>
--node-count 1
--max-node-count 8
--elastic-graceful-shutdown-timeout-in-seconds 180
--elastic-scaling-timeout-in-seconds 60
--elastic-scale-up-snooze-time-in-seconds 60
--elastic-replica-discrete-values '[2, 4, 8]'

How was this change tested?

  • Manual testing: Testing local changes by updating with pip install . in both template and project root folders. Then run the hyp create hyp-pytorch-job command with newly added command line arguments.
    • Could get the message Successfully submitted HyperPodPytorchJob '<job_name>'!
    • Run the command kubectl get hyperpodpytorchjob <job_name> -o yaml and check the output yaml file
  • Run the unit test and integ test

Are unit tests added?

Yes

Are integration tests added?

Yes

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

shantanutrip and others added 5 commits December 3, 2025 07:17
Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* feat: Implement elastic training cli arguments (#273)

* feat: Implement elastic training cli arguments

* Add elastic training unified config and unit test

* Add graceful shutdown and scaling timeout to cli args

* Revert "feat: Implement elastic training cli arguments (#273)"

This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259.

* feat: Implement elastic training cli arguments (#295)

* feat: implement elastic training cli args

* Rename args name to match crd for elastic training

* Add unit test for replcia discrete values

* Add integ test for elastic training cli

---------

Co-authored-by: Sophia <yungwenh@amazon.com>
Co-authored-by: Molly He <mollyhe@amazon.com>
Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
@mollyheamazon mollyheamazon requested a review from a team as a code owner December 3, 2025 15:27
@mollyheamazon mollyheamazon merged commit c64811d into main Dec 3, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants