Enable ROCm CI support.#1260
Conversation
|
Hi @akashveramd! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
ebaafdc to
ce60b15
Compare
|
No ciflow labels are configured for this repo. |
| TEST_WITH_ROCM | ||
| and test_flavor.test_name in skip_for_rocm_test_list | ||
| ): | ||
| continue |
There was a problem hiding this comment.
This logic makes sense to me, but if we really want to use the test setting in integration_tests_h100.py, we should move this logic to that file (and of course rename it to be more agnostic).
There was a problem hiding this comment.
@jithunnair-amd : All tests in integration_tests_h100.py passes for rocm. Hence, we don't TEST_WITH_ROCM in integration_tests_h100.py. However, we need to talk about renaming integration_tests_h100.py filename as we also run it on rocm runners.
cc: @tianyu-l @fegin
d527f27 to
18025ad
Compare
|
@huydhn Need your help in creating a new docker repo for torchtitan ROCm docker image: https://github.com/pytorch/torchtitan/actions/runs/16042425274/job/45266420732?pr=1260#step:7:1436 |
3f3551b to
efd11a8
Compare
bc0314b to
4e81fd9
Compare
…g ubuntu folder for cuda Dockerfile.
…Fixed error in integration_tests.py. Fixed lint errors.
…_job_v2.yml for integration_test_8gpu.yaml.
4e81fd9 to
c23e65b
Compare
…ily available to run the workflow.
|
From what I see I suspect that the error happens because your PR is a forked PR from https://github.com/akashveramd/torchtitan, not https://github.com/pytorch/torchtitan. So, I'm trying to test that out in #1782. If this is the case, we need to implement |
|
It is confirmed https://github.com/pytorch/torchtitan/actions/runs/18179897492/job/51753630586?pr=1782 that authentication works on a non-forked PR. This makes sense because the grant is only to |
…n in linux_job_v2.yml.
… and move_aws_steps_inside_setup_rocm branch.
|
Closing in favor of PR: #1786 |
This PR is based out of the original PR #1260. The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR. --------- Co-authored-by: Huy Do <huydhn@gmail.com>
This PR is based out of the original PR pytorch#1260. The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR. --------- Co-authored-by: Huy Do <huydhn@gmail.com>
This PR is based out of the original PR pytorch#1260. The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR. --------- Co-authored-by: Huy Do <huydhn@gmail.com>
This PR is based out of the original PR pytorch#1260. The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR. --------- Co-authored-by: Huy Do <huydhn@gmail.com>
In this PR-