Skip to content

Optimize merge_file_prefixes for merging bin/idx files #1979

@sarahyurick

Description

@sarahyurick

While testing #1727, I noticed that https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/utils/merge_file_prefixes.py is a large bottleneck. From some light investigations, it seems like there are a few places where we can add parallelism and improve the runtime.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions