fix(checkpoint): write consolidated safetensors without append#2627
fix(checkpoint): write consolidated safetensors without append#2627huahuajhu wants to merge 11 commits into
Conversation
Signed-off-by: Hua Hua <huahuajhu@gmail.com>
Signed-off-by: Hua Hua <huahuajhu@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the HuggingFace safetensors consolidation path to avoid reopening output shards in append mode by writing the safetensors header and payload in a single wb stream, improving compatibility with filesystems that do not support append.
Changes:
- Refactors safetensors consolidation to compute header metadata/offsets and write header+tensor bytes in one
wbpass per output shard (noabreopen). - Changes HF storage writer consolidation to only use staging when
staging_diris explicitly provided (direct consolidation by default). - Updates unit tests and Databricks guide examples to reflect staging being optional.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tests/unit_tests/checkpoint/test_consolidate_safetensors.py |
Adds regression tests for “no append-mode opens” and verifies direct consolidation defaults. |
nemo_automodel/components/checkpoint/config.py |
Clarifies staging_dir semantics in the checkpointing config comments. |
nemo_automodel/components/checkpoint/_backports/hf_storage.py |
Makes staging opt-in based on staging_dir presence for consolidation. |
nemo_automodel/components/checkpoint/_backports/consolidate_hf_safetensors.py |
Implements single-stream (wb) header+payload writing and removes append-mode usage. |
docs/guides/llm/databricks.mdx |
Removes staging_dir from example invocations and describes it as optional for consolidation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: jingxin yu <huahuajhu@gmail.com>
eab01b7 to
9b4b32d
Compare
|
Hi @huahuajhu , thank you for making this! Since we don't have databricks on our CI, i want to ask you if you have tested this on databricks and what's the difference in perf (before and after). I'll try to find someone to review, but that will be next week probably. Thank you. |
Signed-off-by: Hua Hua <huahuajhu@gmail.com>
Signed-off-by: Hua Hua <huahuajhu@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: jingxin yu <huahuajhu@gmail.com>
…ttps://github.com/huahuajhu/Automodel into huahuajhu/fix/issue-1092-single-pass-consolidation
|
Thanks for the question. I tested the checkpoint consolidation path on Databricks using a Unity Catalog volume, since that is the filesystem behavior this PR changes. Test environment:
Test scope:
Before:
Failure occurred inside: After: |
What does this PR do ?
Fixes consolidated HF safetensors export to write each output shard in a single
wbpass instead of writing metadata first and reopening the file in append mode.Changelog
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items, you can still open a "Draft" PR.
Additional Information