extsort: fix CRLF off-by-one by switching from line() to record()#3790
Merged
jqnatividad merged 3 commits intomasterfrom Apr 29, 2026
Merged
extsort: fix CRLF off-by-one by switching from line() to record()#3790jqnatividad merged 3 commits intomasterfrom
jqnatividad merged 3 commits intomasterfrom
Conversation
The previous code wrote `csv::Position::line()` to the temp file and then subtracted `position_delta` (2 with headers, 1 without) when seeking the index. That worked for LF-terminated input where `Position::line()` is the absolute source line — header on line 1, first data record on line 2, last on line N+1, so `seek(line-2)` mapped 2..N+1 → 0..N-1 correctly. For CRLF-terminated input the csv crate produced different `line()` values (1..N for the first..last data record in a 9-row synthetic file), so `seek(line-2)` clamped via `saturating_sub` on the first record (masking the bug) and dropped the last record on the floor — producing output where the last record was missing and the first was duplicated. `Position::record()` is a 1-indexed count of records read so far (it is incremented by `add_record` after each parse and is independent of line terminator), so it is consistent across LF and CRLF input. Switch the temp-file payload to `record()` and use a constant `position_delta = 1` — `record()-1` always yields the 0-indexed record index that `Indexed::seek` expects. The existing `extsort_csvmode` golden fixture (`adur-public-toilets-extsorted-csvmode.csv`) is LF-terminated, so the old code was correct for it and the fixture is unchanged. All three extsort tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Roborev review of the prior commit (job 1821) flagged that the hard-coded `position_delta = 1` regressed the no-headers path: in `--no-headers` mode `csv::Reader::byte_records()` re-emits the row buffered by `byte_headers()` with its original `Position` attached (`record() = 0`), so subsequent records have `record() = 1..N-1`. With a constant `position_delta = 1` the first record was duplicated (saturating_sub clamped 0→0) and the last was dropped — the same symptom the previous commit set out to fix. Restore the conditional: `position_delta = 0` for `--no-headers`, `1` otherwise. `record() - position_delta` then yields the 0-indexed seek key in both modes. Trim the in-line comment to the user-visible invariant and drop the implementation-detail justification. Also add two new tests covering the previously uncovered branch: - `extsort_csvmode_no_headers` (LF + --no-headers) - `extsort_csvmode_crlf_no_headers` (CRLF + --no-headers) Both build an index, run extsort with `--no-headers`, and assert every input row appears exactly once in the output. They fail against the buggy `position_delta = 1` constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a CRLF-specific off-by-one bug in qsv extsort CSV mode by switching the temp-file “position” payload from csv::Position::line() to csv::Position::record(), which is stable across LF/CRLF inputs, and adjusts the seek offset logic accordingly. Adds targeted regression tests to ensure no-headers + (LF/CRLF) paths preserve all rows exactly once.
Changes:
- Write
idx_position.record()(instead ofline()) into the sort-key temp file for CSV-mode extsort. - Update the second-pass seek offset calculation to match
record()semantics and header presence. - Add regression tests covering CSV-mode
--no-headersfor both LF and CRLF inputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/cmd/extsort.rs |
Switches position tracking from line() to record() and updates the index seek offset logic for CSV-mode extsort. |
tests/test_extsort.rs |
Adds regression tests for CSV-mode --no-headers with LF and CRLF inputs to prevent row duplication/dropping regressions. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extsortCSV mode where, for CRLF-terminated input, the last data record was dropped on the floor and the first record was duplicated in the output. LF-terminated input was unaffected.csv::Position::line()to the temp file and the second pass dididxfile.seek(position - position_delta)withposition_delta = 2(headers) or1(no headers). That assumedline()is the absolute source line — true for LF input, but the csv crate produced differentline()values for CRLF input (e.g.1..9rather than2..10for a 9-row synthetic file with headers). The first-record breakage was masked bysaturating_sub; the last-record breakage produced duplicate output.idx_position.record()(a 1-indexed count of records read so far, set byadd_recordafter each parse and independent of line terminator) and use a constantposition_delta = 1.record() - 1always yields the 0-indexed record index thatIndexed::seekexpects.extsort_csvmodegolden fixture (adur-public-toilets-extsorted-csvmode.csv) is LF-terminated, so the old code was correct for it and the fixture is unchanged.Verified manually: a 9-row CRLF input that previously dropped
val1and duplicatedval9now sorts correctly. Adur (LF) and theextsort_issue_2391fixture are byte-identical to the previous output.Test plan
cargo test extsort -F all_features— all three tests (extsort_linemode,extsort_csvmode,extsort_issue_2391) pass.cargo test -F all_features— full suite passes (2731 passed, 29 ignored).cargo clippy --bin qsv -F all_features -- -D warnings— clean.qsv extsort -s valon a 9-row CRLF file outputs all 9 rows in correct sorted order (previously last row missing, first duplicated).🤖 Generated with Claude Code