Skip to content

extsort: fix CRLF off-by-one by switching from line() to record()#3790

Merged
jqnatividad merged 3 commits intomasterfrom
extsort-position-delta-fix
Apr 29, 2026
Merged

extsort: fix CRLF off-by-one by switching from line() to record()#3790
jqnatividad merged 3 commits intomasterfrom
extsort-position-delta-fix

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes an off-by-one bug in extsort CSV mode where, for CRLF-terminated input, the last data record was dropped on the floor and the first record was duplicated in the output. LF-terminated input was unaffected.
  • Root cause: the first pass wrote csv::Position::line() to the temp file and the second pass did idxfile.seek(position - position_delta) with position_delta = 2 (headers) or 1 (no headers). That assumed line() is the absolute source line — true for LF input, but the csv crate produced different line() values for CRLF input (e.g. 1..9 rather than 2..10 for a 9-row synthetic file with headers). The first-record breakage was masked by saturating_sub; the last-record breakage produced duplicate output.
  • Switch the temp-file payload to idx_position.record() (a 1-indexed count of records read so far, set by add_record after each parse and independent of line terminator) and use a constant position_delta = 1. record() - 1 always yields the 0-indexed record index that Indexed::seek expects.
  • The existing extsort_csvmode golden fixture (adur-public-toilets-extsorted-csvmode.csv) is LF-terminated, so the old code was correct for it and the fixture is unchanged.

Verified manually: a 9-row CRLF input that previously dropped val1 and duplicated val9 now sorts correctly. Adur (LF) and the extsort_issue_2391 fixture are byte-identical to the previous output.

Test plan

  • cargo test extsort -F all_features — all three tests (extsort_linemode, extsort_csvmode, extsort_issue_2391) pass.
  • cargo test -F all_features — full suite passes (2731 passed, 29 ignored).
  • cargo clippy --bin qsv -F all_features -- -D warnings — clean.
  • Manual repro: qsv extsort -s val on a 9-row CRLF file outputs all 9 rows in correct sorted order (previously last row missing, first duplicated).

🤖 Generated with Claude Code

The previous code wrote `csv::Position::line()` to the temp file and
then subtracted `position_delta` (2 with headers, 1 without) when
seeking the index. That worked for LF-terminated input where
`Position::line()` is the absolute source line — header on line 1,
first data record on line 2, last on line N+1, so `seek(line-2)`
mapped 2..N+1 → 0..N-1 correctly.

For CRLF-terminated input the csv crate produced different
`line()` values (1..N for the first..last data record in a 9-row
synthetic file), so `seek(line-2)` clamped via `saturating_sub`
on the first record (masking the bug) and dropped the last record
on the floor — producing output where the last record was missing
and the first was duplicated.

`Position::record()` is a 1-indexed count of records read so far
(it is incremented by `add_record` after each parse and is
independent of line terminator), so it is consistent across LF and
CRLF input. Switch the temp-file payload to `record()` and use a
constant `position_delta = 1` — `record()-1` always yields the
0-indexed record index that `Indexed::seek` expects.

The existing `extsort_csvmode` golden fixture
(`adur-public-toilets-extsorted-csvmode.csv`) is LF-terminated, so
the old code was correct for it and the fixture is unchanged. All
three extsort tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 29, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Roborev review of the prior commit (job 1821) flagged that the
hard-coded `position_delta = 1` regressed the no-headers path: in
`--no-headers` mode `csv::Reader::byte_records()` re-emits the row
buffered by `byte_headers()` with its original `Position` attached
(`record() = 0`), so subsequent records have `record() = 1..N-1`.
With a constant `position_delta = 1` the first record was duplicated
(saturating_sub clamped 0→0) and the last was dropped — the same
symptom the previous commit set out to fix.

Restore the conditional: `position_delta = 0` for `--no-headers`,
`1` otherwise. `record() - position_delta` then yields the 0-indexed
seek key in both modes. Trim the in-line comment to the user-visible
invariant and drop the implementation-detail justification.

Also add two new tests covering the previously uncovered branch:
- `extsort_csvmode_no_headers` (LF + --no-headers)
- `extsort_csvmode_crlf_no_headers` (CRLF + --no-headers)

Both build an index, run extsort with `--no-headers`, and assert
every input row appears exactly once in the output. They fail
against the buggy `position_delta = 1` constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a CRLF-specific off-by-one bug in qsv extsort CSV mode by switching the temp-file “position” payload from csv::Position::line() to csv::Position::record(), which is stable across LF/CRLF inputs, and adjusts the seek offset logic accordingly. Adds targeted regression tests to ensure no-headers + (LF/CRLF) paths preserve all rows exactly once.

Changes:

  • Write idx_position.record() (instead of line()) into the sort-key temp file for CSV-mode extsort.
  • Update the second-pass seek offset calculation to match record() semantics and header presence.
  • Add regression tests covering CSV-mode --no-headers for both LF and CRLF inputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/cmd/extsort.rs Switches position tracking from line() to record() and updates the index seek offset logic for CSV-mode extsort.
tests/test_extsort.rs Adds regression tests for CSV-mode --no-headers with LF and CRLF inputs to prevent row duplication/dropping regressions.

Comment thread src/cmd/extsort.rs Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@jqnatividad jqnatividad merged commit c657f70 into master Apr 29, 2026
12 of 13 checks passed
@jqnatividad jqnatividad deleted the extsort-position-delta-fix branch April 29, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants