Parallelize COLMAP image downsampling in dataset loader#928
Conversation
| """Resize image folder.""" | ||
| print(f"Downscaling images by {factor}x from {image_dir} to {resized_dir}.") | ||
| if num_workers <= 0: | ||
| num_workers = os.cpu_count() |
There was a problem hiding this comment.
os.cpu_count() is documented to return None if the count is indeterminable (e.g., constrained sandboxes, some container runtimes). In that case num_workers becomes None, which makes the log message read (None threads). and relies on ThreadPoolExecutor's own fallback. Consider num_workers = os.cpu_count() or 1 (or a small explicit default like 4) so the log line and downstream behavior stay well-defined.
Generated by a review agent (code review)
|
|
||
| with ThreadPoolExecutor(max_workers=num_workers) as executor: | ||
| futures = [executor.submit(_resize_image, *t) for t in tasks] | ||
| for future in tqdm(as_completed(futures), total=len(futures)): |
There was a problem hiding this comment.
If one resize task raises, future.result() re-raises and exits the as_completed loop, leaving the remaining futures' exceptions unobserved (they will be silently consumed when the with block awaits shutdown). For a dataset preprocessing step this is usually fine, but a single corrupt JPEG will currently mask any other failures and produce an opaque traceback. Consider catching per-task and aggregating (or at least logging image_path on failure) so users can identify the offending file.
Generated by a review agent (code review)
| int(round(image.shape[0] / factor)), | ||
| ) | ||
| resized_image = np.array(Image.fromarray(image).resize(resized_size, Image.BICUBIC)) | ||
| imageio.imwrite(resized_path, resized_image) |
There was a problem hiding this comment.
The new _resize_image helper is the unit of work executed concurrently by ThreadPoolExecutor, and it carries three behaviors that are easy to silently break: (1) early-return when the destination already exists (used as a poor-man's resume), (2) recursive os.makedirs for nested output paths, and (3) bicubic downscaling to (round(W/factor), round(H/factor)) via PIL. None of these are covered by an existing test — there are no tests anywhere under tests/ that import from examples/datasets/colmap.py. A regression here (e.g. swapping width/height in resized_size, or accidentally overwriting cached outputs) would corrupt training inputs without surfacing in CI.
Suggested test (in tests/test_colmap_dataset.py):
import os
import sys
from pathlib import Path
import imageio.v2 as imageio
import numpy as np
import pytest
# examples/ is not on sys.path by default
EXAMPLES_DIR = Path(__file__).resolve().parents[1] / "examples"
sys.path.insert(0, str(EXAMPLES_DIR))
from datasets.colmap import _resize_image # noqa: E402
def _write_image(path: Path, shape=(40, 60, 3)) -> np.ndarray:
path.parent.mkdir(parents=True, exist_ok=True)
img = (np.random.RandomState(0).rand(*shape) * 255).astype(np.uint8)
imageio.imwrite(path, img)
return img
def test_resize_image_writes_downscaled_png(tmp_path):
src = tmp_path / "src.jpg"
dst = tmp_path / "out" / "dst.png"
_write_image(src, shape=(40, 60, 3))
_resize_image(str(src), str(dst), factor=2)
assert dst.is_file(), "destination file was not written"
out = imageio.imread(dst)
# PIL.resize takes (W, H) so output array shape is (H/factor, W/factor, 3)
assert out.shape == (20, 30, 3)
def test_resize_image_skips_when_destination_exists(tmp_path):
src = tmp_path / "src.jpg"
dst = tmp_path / "out" / "dst.png"
_write_image(src, shape=(40, 60, 3))
dst.parent.mkdir(parents=True, exist_ok=True)
sentinel = (np.ones((5, 5, 3), dtype=np.uint8) * 42)
imageio.imwrite(dst, sentinel)
sentinel_bytes = dst.read_bytes()
_resize_image(str(src), str(dst), factor=2)
# Existing file must be left untouched (cache-resume semantics).
assert dst.read_bytes() == sentinel_bytes
def test_resize_image_creates_missing_parent_directories(tmp_path):
src = tmp_path / "src.jpg"
dst = tmp_path / "a" / "b" / "c" / "dst.png"
_write_image(src, shape=(20, 30, 3))
assert not dst.parent.exists()
_resize_image(str(src), str(dst), factor=1)
assert dst.is_file()
out = imageio.imread(dst)
assert out.shape == (20, 30, 3)
def test_resize_image_rounds_dimensions(tmp_path):
# 25 / 3 = 8.33 -> rounds to 8; 31 / 3 = 10.33 -> rounds to 10
src = tmp_path / "src.jpg"
dst = tmp_path / "dst.png"
_write_image(src, shape=(25, 31, 3))
_resize_image(str(src), str(dst), factor=3)
out = imageio.imread(dst)
assert out.shape == (8, 10, 3)Generated by a review agent (test coverage)
|
Thanks for the contribution @GabrielDeml. We're trying to up our test coverage, so if you agree with the proposed test we'd appreciate if you could add it, or otherwise test the change. The other finds were minor, and we can merge once resolved. Cheers! |
|
Thanks for your contribution! We haven't seen activity here in a while. If you're still interested, please let us know within 30 days — otherwise we'll close this out as we see fit. Thank you! |
Multithreaded COLMAP image downsampling. About 4.8x speed improvement for 240 images (360_v2 counter dataset) on an 8 core machine.