Skip to content

Parallelize COLMAP image downsampling in dataset loader#928

Closed
GabrielDeml wants to merge 1 commit into
nerfstudio-project:mainfrom
GabrielDeml:colmap-threaded-resize
Closed

Parallelize COLMAP image downsampling in dataset loader#928
GabrielDeml wants to merge 1 commit into
nerfstudio-project:mainfrom
GabrielDeml:colmap-threaded-resize

Conversation

@GabrielDeml

Copy link
Copy Markdown

Multithreaded COLMAP image downsampling. About 4.8x speed improvement for 240 images (360_v2 counter dataset) on an 8 core machine.

"""Resize image folder."""
print(f"Downscaling images by {factor}x from {image_dir} to {resized_dir}.")
if num_workers <= 0:
num_workers = os.cpu_count()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.cpu_count() is documented to return None if the count is indeterminable (e.g., constrained sandboxes, some container runtimes). In that case num_workers becomes None, which makes the log message read (None threads). and relies on ThreadPoolExecutor's own fallback. Consider num_workers = os.cpu_count() or 1 (or a small explicit default like 4) so the log line and downstream behavior stay well-defined.


Generated by a review agent (code review)


with ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(_resize_image, *t) for t in tasks]
for future in tqdm(as_completed(futures), total=len(futures)):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one resize task raises, future.result() re-raises and exits the as_completed loop, leaving the remaining futures' exceptions unobserved (they will be silently consumed when the with block awaits shutdown). For a dataset preprocessing step this is usually fine, but a single corrupt JPEG will currently mask any other failures and produce an opaque traceback. Consider catching per-task and aggregating (or at least logging image_path on failure) so users can identify the offending file.


Generated by a review agent (code review)

int(round(image.shape[0] / factor)),
)
resized_image = np.array(Image.fromarray(image).resize(resized_size, Image.BICUBIC))
imageio.imwrite(resized_path, resized_image)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new _resize_image helper is the unit of work executed concurrently by ThreadPoolExecutor, and it carries three behaviors that are easy to silently break: (1) early-return when the destination already exists (used as a poor-man's resume), (2) recursive os.makedirs for nested output paths, and (3) bicubic downscaling to (round(W/factor), round(H/factor)) via PIL. None of these are covered by an existing test — there are no tests anywhere under tests/ that import from examples/datasets/colmap.py. A regression here (e.g. swapping width/height in resized_size, or accidentally overwriting cached outputs) would corrupt training inputs without surfacing in CI.

Suggested test (in tests/test_colmap_dataset.py):

import os
import sys
from pathlib import Path

import imageio.v2 as imageio
import numpy as np
import pytest

# examples/ is not on sys.path by default
EXAMPLES_DIR = Path(__file__).resolve().parents[1] / "examples"
sys.path.insert(0, str(EXAMPLES_DIR))
from datasets.colmap import _resize_image  # noqa: E402


def _write_image(path: Path, shape=(40, 60, 3)) -> np.ndarray:
    path.parent.mkdir(parents=True, exist_ok=True)
    img = (np.random.RandomState(0).rand(*shape) * 255).astype(np.uint8)
    imageio.imwrite(path, img)
    return img


def test_resize_image_writes_downscaled_png(tmp_path):
    src = tmp_path / "src.jpg"
    dst = tmp_path / "out" / "dst.png"
    _write_image(src, shape=(40, 60, 3))

    _resize_image(str(src), str(dst), factor=2)

    assert dst.is_file(), "destination file was not written"
    out = imageio.imread(dst)
    # PIL.resize takes (W, H) so output array shape is (H/factor, W/factor, 3)
    assert out.shape == (20, 30, 3)


def test_resize_image_skips_when_destination_exists(tmp_path):
    src = tmp_path / "src.jpg"
    dst = tmp_path / "out" / "dst.png"
    _write_image(src, shape=(40, 60, 3))
    dst.parent.mkdir(parents=True, exist_ok=True)
    sentinel = (np.ones((5, 5, 3), dtype=np.uint8) * 42)
    imageio.imwrite(dst, sentinel)
    sentinel_bytes = dst.read_bytes()

    _resize_image(str(src), str(dst), factor=2)

    # Existing file must be left untouched (cache-resume semantics).
    assert dst.read_bytes() == sentinel_bytes


def test_resize_image_creates_missing_parent_directories(tmp_path):
    src = tmp_path / "src.jpg"
    dst = tmp_path / "a" / "b" / "c" / "dst.png"
    _write_image(src, shape=(20, 30, 3))
    assert not dst.parent.exists()

    _resize_image(str(src), str(dst), factor=1)

    assert dst.is_file()
    out = imageio.imread(dst)
    assert out.shape == (20, 30, 3)


def test_resize_image_rounds_dimensions(tmp_path):
    # 25 / 3 = 8.33 -> rounds to 8; 31 / 3 = 10.33 -> rounds to 10
    src = tmp_path / "src.jpg"
    dst = tmp_path / "dst.png"
    _write_image(src, shape=(25, 31, 3))

    _resize_image(str(src), str(dst), factor=3)

    out = imageio.imread(dst)
    assert out.shape == (8, 10, 3)

Generated by a review agent (test coverage)

@vince-brisebois

Copy link
Copy Markdown
Collaborator

Thanks for the contribution @GabrielDeml. We're trying to up our test coverage, so if you agree with the proposed test we'd appreciate if you could add it, or otherwise test the change. The other finds were minor, and we can merge once resolved. Cheers!

@gsplat-engage-bot

Copy link
Copy Markdown

Thanks for your contribution! We haven't seen activity here in a while. If you're still interested, please let us know within 30 days — otherwise we'll close this out as we see fit. Thank you!

@GabrielDeml GabrielDeml closed this by deleting the head repository Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants