Skip to content

fix(checkpoint): write consolidated safetensors without append#2627

Open
huahuajhu wants to merge 11 commits into
NVIDIA-NeMo:mainfrom
huahuajhu:huahuajhu/fix/issue-1092-single-pass-consolidation
Open

fix(checkpoint): write consolidated safetensors without append#2627
huahuajhu wants to merge 11 commits into
NVIDIA-NeMo:mainfrom
huahuajhu:huahuajhu/fix/issue-1092-single-pass-consolidation

Conversation

@huahuajhu

@huahuajhu huahuajhu commented Jun 18, 2026

Copy link
Copy Markdown

What does this PR do ?

Fixes consolidated HF safetensors export to write each output shard in a single wb pass instead of writing metadata first and reopening the file in append mode.

Changelog

  • Add specific line-by-line info of high-level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and follow Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items, you can still open a "Draft" PR.

Additional Information

Signed-off-by: Hua Hua <huahuajhu@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Hua Hua <huahuajhu@gmail.com>
@huahuajhu huahuajhu marked this pull request as ready for review June 18, 2026 05:14
@huahuajhu huahuajhu requested review from a team and jgerh as code owners June 18, 2026 05:14
Copilot AI review requested due to automatic review settings June 18, 2026 05:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the HuggingFace safetensors consolidation path to avoid reopening output shards in append mode by writing the safetensors header and payload in a single wb stream, improving compatibility with filesystems that do not support append.

Changes:

  • Refactors safetensors consolidation to compute header metadata/offsets and write header+tensor bytes in one wb pass per output shard (no ab reopen).
  • Changes HF storage writer consolidation to only use staging when staging_dir is explicitly provided (direct consolidation by default).
  • Updates unit tests and Databricks guide examples to reflect staging being optional.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit_tests/checkpoint/test_consolidate_safetensors.py Adds regression tests for “no append-mode opens” and verifies direct consolidation defaults.
nemo_automodel/components/checkpoint/config.py Clarifies staging_dir semantics in the checkpointing config comments.
nemo_automodel/components/checkpoint/_backports/hf_storage.py Makes staging opt-in based on staging_dir presence for consolidation.
nemo_automodel/components/checkpoint/_backports/consolidate_hf_safetensors.py Implements single-stream (wb) header+payload writing and removes append-mode usage.
docs/guides/llm/databricks.mdx Removes staging_dir from example invocations and describes it as optional for consolidation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit_tests/checkpoint/test_consolidate_safetensors.py
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: jingxin yu <huahuajhu@gmail.com>
@huahuajhu huahuajhu force-pushed the huahuajhu/fix/issue-1092-single-pass-consolidation branch from eab01b7 to 9b4b32d Compare June 18, 2026 05:32
@akoumpa

akoumpa commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @huahuajhu , thank you for making this!

Since we don't have databricks on our CI, i want to ask you if you have tested this on databricks and what's the difference in perf (before and after). I'll try to find someone to review, but that will be next week probably.

Thank you.

huahuajhu and others added 4 commits June 18, 2026 16:08
Signed-off-by: Hua Hua <huahuajhu@gmail.com>
Signed-off-by: Hua Hua <huahuajhu@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: jingxin yu <huahuajhu@gmail.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 18, 2026
@huahuajhu

Copy link
Copy Markdown
Author

Thanks for the question. I tested the checkpoint consolidation path on Databricks using a Unity Catalog volume, since that is the filesystem behavior this PR changes.

Test environment:

  • Databricks workspace with Unity Catalog enabled
  • Catalogs available: samples, system, workspace
  • Test volume path: /Volumes/workspace/automodel_pr2627/checkpoints/automodel-pr2627
  • Created:
    • schema: workspace.automodel_pr2627
    • volume: workspace.automodel_pr2627.checkpoints
  • Verified the volume path exists from Python.

Test scope:

  • This was a CPU-only Databricks UC-volume smoke test for the safetensors consolidation writer.
  • It directly exercises consolidate_safetensors_files(..., use_staging=False) on a UC volume.
  • It does not measure end-to-end GPU training throughput yet.

Before:

  • Code: NVIDIA-NeMo/Automodel@main
  • Resolved commit: 83e4aad1ed49068c22f8ce527742e727215c0323
  • Test: write sharded safetensors input, then consolidate to the UC volume with use_staging=False
  • Result: failed
  • Error:
    OSError: [Errno 29] Illegal seek
    
    
    

Failure occurred inside:

nemo_automodel/components/checkpoint/_backports/consolidate_hf_safetensors.py
consolidate_safetensors_files(...)
_consolidate_safetensors_files(...)
_write_data(...)

After:
Code: this PR branch
Resolved commit: c2158e7
Same UC volume path and same input tensors
Test wrote:

/Volumes/workspace/automodel_pr2627/checkpoints/automodel-pr2627/cpu_smoke_output/model-00001-of-00001.safetensors
/Volumes/workspace/automodel_pr2627/checkpoints/automodel-pr2627/cpu_smoke_output/model.safetensors.index.json

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

staging-free consolidation for databricks

4 participants