Reinvent Keynote3 Elastic Training Feature Support #341

mollyheamazon · 2025-12-03T15:27:37Z

What's changing and why?

We are adding six new command line arguments related to elastic training for HyperPodTrainingOperator. Elastic training is a method that dynamically scales distributed machine learning operations, which helps solve the resource inefficiency or poor fault tolerance issues being seen often in traditional distributed training jobs.

Before/After UX

Before:

hyp create hyp-pytorch-job
--job-name <job_name>
--image <image_name>
--node-count 1

After:

(Case 1)
hyp create hyp-pytorch-job
--job-name <job_name>
--image <image_name>
--node-count 1
--elastic-replica-increment-step 1
--max-node-count 2
--elastic-graceful-shutdown-timeout-in-seconds 180
--elastic-scaling-timeout-in-seconds 60
--elastic-scale-up-snooze-time-in-seconds 60
(Case 2)
hyp create hyp-pytorch-job
--job-name <job_name>
--image <image_name>
--node-count 1
--max-node-count 8
--elastic-graceful-shutdown-timeout-in-seconds 180
--elastic-scaling-timeout-in-seconds 60
--elastic-scale-up-snooze-time-in-seconds 60
--elastic-replica-discrete-values '[2, 4, 8]'

How was this change tested?

Manual testing: Testing local changes by updating with pip install . in both template and project root folders. Then run the hyp create hyp-pytorch-job command with newly added command line arguments.
- Could get the message Successfully submitted HyperPodPytorchJob '<job_name>'!
- Run the command kubectl get hyperpodpytorchjob <job_name> -o yaml and check the output yaml file
Run the unit test and integ test

Are unit tests added?

Yes

Are integration tests added?

Yes

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

All automated PR checks pass
Failed tests include local run results/screenshots proving they work
Changes are documentation-only

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

* feat: Implement elastic training cli arguments (#273) * feat: Implement elastic training cli arguments * Add elastic training unified config and unit test * Add graceful shutdown and scaling timeout to cli args * Revert "feat: Implement elastic training cli arguments (#273)" This reverts commit 18428ef2b1c0562bf51a9a4b4aa2914eed441259. * feat: Implement elastic training cli arguments (#295) * feat: implement elastic training cli args * Rename args name to match crd for elastic training * Add unit test for replcia discrete values * Add integ test for elastic training cli --------- Co-authored-by: Sophia <yungwenh@amazon.com> Co-authored-by: Molly He <mollyhe@amazon.com> Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

shantanutrip and others added 5 commits December 3, 2025 07:17

Upgrade Inference Operator Version (#327)

99dbe7b

pyproj version update (#328)

4db9168

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

version change (#329)

e37b011

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

version update for v3.5.0

682303b

mollyheamazon requested a review from a team as a code owner December 3, 2025 15:27

mollyheamazon temporarily deployed to auto-approve December 3, 2025 15:27 — with GitHub Actions Inactive

resolve merge conflict

7099725

mollyheamazon deployed to auto-approve December 3, 2025 15:42 — with GitHub Actions Active

mohamedzeidan2021 approved these changes Dec 3, 2025

View reviewed changes

aviruthen approved these changes Dec 3, 2025

View reviewed changes

nargokul approved these changes Dec 3, 2025

View reviewed changes

mollyheamazon merged commit c64811d into main Dec 3, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reinvent Keynote3 Elastic Training Feature Support #341

Reinvent Keynote3 Elastic Training Feature Support #341

Uh oh!

mollyheamazon commented Dec 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Reinvent Keynote3 Elastic Training Feature Support #341

Reinvent Keynote3 Elastic Training Feature Support #341

Uh oh!

Conversation

mollyheamazon commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's changing and why?

Before/After UX

Before:

After:

How was this change tested?

Are unit tests added?

Are integration tests added?

Reviewer Guidelines

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mollyheamazon commented Dec 3, 2025 •

edited

Loading