feat(qa): return start/end character offsets for question answering (#1245) by Suh0161 · Pull Request #1671 · huggingface/transformers.js

Suh0161 · 2026-05-01T19:02:48Z

Summary

Adds HF-style start / end on the question-answering pipeline output: half-open character indices [start, end) into context, aligned with Hugging Face transformers behavior (issue #1245).

Changes

Thread return_offsets_mapping through tokenizer _call / _encode_plus and expose offset_mapping when requested.
For text_pair, compute offsets in the correct segment (question vs context; BERT-style [SEP] / token_type_ids handling).
QA pipeline maps predicted token spans to character offsets in context; pad offset_mapping consistently with input_ids.
When return_tensor: true, keep offset_mapping as nested JS arrays (not a tensor), with the same batch unwrap behavior the pipeline expects.

Tests

Tokenizer Offset mapping cases (BERT uncased, GPT-2 ByteLevel, batching/padding, text_pair).
QA pipeline expectations updated for start / end.

Closes #1245

Thread return_offsets_mapping through tokenizer _call/_encode_plus; pair-aware offset_mapping for text/text_pair. Question answering maps predicted spans to half-open indices in context; pad offset_mapping alongside input_ids. Add tokenizer offset tests (BERT/GPT-2 batching, text_pair) and update QA expectations. Closes huggingface#1245 Co-authored-by: Cursor <cursoragent@cursor.com>

nico-martin

HI @Suh0161,

I think thats a useful feature that also aligns with the transformers library: https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/pipelines/question_answering.py#L243-L250

nico-martin · 2026-06-08T08:41:04Z

 * @property {number} score The probability associated to the answer.
- * @property {number} [start] The character start index of the answer (in the tokenized version of the input).
- * @property {number} [end] The character end index of the answer (in the tokenized version of the input).
+ * @property {number} start The answer start offset (character index **in `context`**; slice with `context.slice(start, end)`).


The source JSDoc now makes start and end required and context-relative, but the checked-in declarations still have the old optional fields and old wording in packages/transformers/types/pipelines/question-answering.d.ts. The tokenizer declarations also do not expose return_offsets_mapping or offset_mapping in packages/transformers/types/tokenization_utils.d.ts.

TypeScript users will not see the new API correctly. Please regenerate the type declarations and include them in the PR.

nico-martin · 2026-06-08T08:43:27Z

+        // Ġ (U+0120) is used by GPT-2's ByteLevel pre-tokenizer.
+        // ▁ (U+2581) is used by SentencePiece (LLaMA, Mistral, T5, …).
+        const byteLevelSpacePrefix = token.startsWith('\u0120');
+        const clean = token.replace(/^[\u0120\u2581]+/, '');


The offset reconstruction strips ByteLevel/SentencePiece prefixes, but not WordPiece continuation prefixes:

BERT-style tokenizers commonly emit continuation tokens like ##ing or ##s. Those strings are not present in the original text, so indexOf(clean, pos) fails and the PR records [0, 0]. That means QA answers whose start or end token is a continuation token can return incorrect start / end values, often 0.

Please handle WordPiece continuation tokens, or preferably use tokenizer-provided offsets if the underlying tokenizer exposes them. At minimum, add a test where the expected answer starts or ends on a ##... token.

Suh0161 mentioned this pull request May 4, 2026

Auto device fallback on provider failure and tokenizer_options passthrough in Text2TextGenerationPipeline #1670

Closed

nico-martin self-assigned this Jun 8, 2026

nico-martin requested changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qa): return start/end character offsets for question answering (#1245)#1671

feat(qa): return start/end character offsets for question answering (#1245)#1671
Suh0161 wants to merge 1 commit into
huggingface:mainfrom
Suh0161:feat/qa-start-end-1245

Suh0161 commented May 1, 2026

Uh oh!

nico-martin left a comment

Uh oh!

nico-martin Jun 8, 2026

Uh oh!

nico-martin Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Suh0161 commented May 1, 2026

Summary

Changes

Tests

Uh oh!

nico-martin left a comment

Choose a reason for hiding this comment

Uh oh!

nico-martin Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

nico-martin Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants