Skip to content

feat(qa): return start/end character offsets for question answering (#1245)#1671

Open
Suh0161 wants to merge 1 commit into
huggingface:mainfrom
Suh0161:feat/qa-start-end-1245
Open

feat(qa): return start/end character offsets for question answering (#1245)#1671
Suh0161 wants to merge 1 commit into
huggingface:mainfrom
Suh0161:feat/qa-start-end-1245

Conversation

@Suh0161

@Suh0161 Suh0161 commented May 1, 2026

Copy link
Copy Markdown

Summary

Adds HF-style start / end on the question-answering pipeline output: half-open character indices [start, end) into context, aligned with Hugging Face transformers behavior (issue #1245).

Changes

  • Thread return_offsets_mapping through tokenizer _call / _encode_plus and expose offset_mapping when requested.
  • For text_pair, compute offsets in the correct segment (question vs context; BERT-style [SEP] / token_type_ids handling).
  • QA pipeline maps predicted token spans to character offsets in context; pad offset_mapping consistently with input_ids.
  • When return_tensor: true, keep offset_mapping as nested JS arrays (not a tensor), with the same batch unwrap behavior the pipeline expects.

Tests

  • Tokenizer Offset mapping cases (BERT uncased, GPT-2 ByteLevel, batching/padding, text_pair).
  • QA pipeline expectations updated for start / end.

Closes #1245

Thread return_offsets_mapping through tokenizer _call/_encode_plus; pair-aware offset_mapping for text/text_pair. Question answering maps predicted spans to half-open indices in context; pad offset_mapping alongside input_ids. Add tokenizer offset tests (BERT/GPT-2 batching, text_pair) and update QA expectations.

Closes huggingface#1245

Co-authored-by: Cursor <cursoragent@cursor.com>

@nico-martin nico-martin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @Suh0161,

I think thats a useful feature that also aligns with the transformers library: https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/pipelines/question_answering.py#L243-L250

* @property {number} score The probability associated to the answer.
* @property {number} [start] The character start index of the answer (in the tokenized version of the input).
* @property {number} [end] The character end index of the answer (in the tokenized version of the input).
* @property {number} start The answer start offset (character index **in `context`**; slice with `context.slice(start, end)`).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source JSDoc now makes start and end required and context-relative, but the checked-in declarations still have the old optional fields and old wording in packages/transformers/types/pipelines/question-answering.d.ts. The tokenizer declarations also do not expose return_offsets_mapping or offset_mapping in packages/transformers/types/tokenization_utils.d.ts.

TypeScript users will not see the new API correctly. Please regenerate the type declarations and include them in the PR.

// Ġ (U+0120) is used by GPT-2's ByteLevel pre-tokenizer.
// ▁ (U+2581) is used by SentencePiece (LLaMA, Mistral, T5, …).
const byteLevelSpacePrefix = token.startsWith('\u0120');
const clean = token.replace(/^[\u0120\u2581]+/, '');

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset reconstruction strips ByteLevel/SentencePiece prefixes, but not WordPiece continuation prefixes:

BERT-style tokenizers commonly emit continuation tokens like ##ing or ##s. Those strings are not present in the original text, so indexOf(clean, pos) fails and the PR records [0, 0]. That means QA answers whose start or end token is a continuation token can return incorrect start / end values, often 0.

Please handle WordPiece continuation tokens, or preferably use tokenizer-provided offsets if the underlying tokenizer exposes them. At minimum, add a test where the expected answer starts or ends on a ##... token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QuestionAnsweringOutput does not return start/end index

2 participants