Optimize `merge_file_prefixes` for merging bin/idx files

While testing https://github.com/NVIDIA-NeMo/Curator/pull/1727, I noticed that https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/utils/merge_file_prefixes.py is a large bottleneck. From some light investigations, it seems like there are a few places where we can add parallelism and improve the runtime.