feat(qa): return start/end character offsets for question answering (#1245)#1671
feat(qa): return start/end character offsets for question answering (#1245)#1671Suh0161 wants to merge 1 commit into
Conversation
Thread return_offsets_mapping through tokenizer _call/_encode_plus; pair-aware offset_mapping for text/text_pair. Question answering maps predicted spans to half-open indices in context; pad offset_mapping alongside input_ids. Add tokenizer offset tests (BERT/GPT-2 batching, text_pair) and update QA expectations. Closes huggingface#1245 Co-authored-by: Cursor <cursoragent@cursor.com>
nico-martin
left a comment
There was a problem hiding this comment.
HI @Suh0161,
I think thats a useful feature that also aligns with the transformers library: https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/pipelines/question_answering.py#L243-L250
| * @property {number} score The probability associated to the answer. | ||
| * @property {number} [start] The character start index of the answer (in the tokenized version of the input). | ||
| * @property {number} [end] The character end index of the answer (in the tokenized version of the input). | ||
| * @property {number} start The answer start offset (character index **in `context`**; slice with `context.slice(start, end)`). |
There was a problem hiding this comment.
The source JSDoc now makes start and end required and context-relative, but the checked-in declarations still have the old optional fields and old wording in packages/transformers/types/pipelines/question-answering.d.ts. The tokenizer declarations also do not expose return_offsets_mapping or offset_mapping in packages/transformers/types/tokenization_utils.d.ts.
TypeScript users will not see the new API correctly. Please regenerate the type declarations and include them in the PR.
| // Ġ (U+0120) is used by GPT-2's ByteLevel pre-tokenizer. | ||
| // ▁ (U+2581) is used by SentencePiece (LLaMA, Mistral, T5, …). | ||
| const byteLevelSpacePrefix = token.startsWith('\u0120'); | ||
| const clean = token.replace(/^[\u0120\u2581]+/, ''); |
There was a problem hiding this comment.
The offset reconstruction strips ByteLevel/SentencePiece prefixes, but not WordPiece continuation prefixes:
BERT-style tokenizers commonly emit continuation tokens like ##ing or ##s. Those strings are not present in the original text, so indexOf(clean, pos) fails and the PR records [0, 0]. That means QA answers whose start or end token is a continuation token can return incorrect start / end values, often 0.
Please handle WordPiece continuation tokens, or preferably use tokenizer-provided offsets if the underlying tokenizer exposes them. At minimum, add a test where the expected answer starts or ends on a ##... token.
Summary
Adds HF-style
start/endon the question-answering pipeline output: half-open character indices[start, end)intocontext, aligned with Hugging Facetransformersbehavior (issue #1245).Changes
return_offsets_mappingthrough tokenizer_call/_encode_plusand exposeoffset_mappingwhen requested.text_pair, compute offsets in the correct segment (question vs context; BERT-style[SEP]/token_type_idshandling).context; padoffset_mappingconsistently withinput_ids.return_tensor: true, keepoffset_mappingas nested JS arrays (not a tensor), with the same batch unwrap behavior the pipeline expects.Tests
Offset mappingcases (BERT uncased, GPT-2 ByteLevel, batching/padding,text_pair).start/end.Closes #1245