While testing #1727, I noticed that https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/utils/merge_file_prefixes.py is a large bottleneck. From some light investigations, it seems like there are a few places where we can add parallelism and improve the runtime.
While testing #1727, I noticed that https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/utils/merge_file_prefixes.py is a large bottleneck. From some light investigations, it seems like there are a few places where we can add parallelism and improve the runtime.