Add return_offset_mapping support to PreTrainedTokenizer#1702
Add return_offset_mapping support to PreTrainedTokenizer#1702anidoesdev wants to merge 4 commits into
Conversation
nico-martin
left a comment
There was a problem hiding this comment.
Hi @anidoesdev,
Thanks for looking into this.
First, I have one question about the API design. You chose return_offset_mapping: boolean, but in the Python transformers library, this is exposed as return_offsets_mapping with an s. Since the PR summary says this matches Python behavior, I suggest renaming the option to return_offsets_mapping.
Overall, I think this is a good starting point, but I do not think it is ready to ship as a general public API yet. Correct offset mapping depends on the tokenizer's native normalization, pre-tokenization, byte-level handling, and post-processing behavior. Reconstructing spans from token strings works for simple examples, but can silently produce wrong character spans for common inputs such as accents, Unicode text, byte-level BPE tokenizers, and paired sequences.
Do you think you could look into whether we can use native offsets from the tokenizer backend? Happy to review again after that.
|
For sure, I will look into it and get back to you as soon as possible |
|
Hi @nico-martin, I looked into the native offset approach and wanted to align on direction before implementing. The current backend is a pure JS reimplementation that doesn't track character offsets internally so there's no offset data to expose from encode() today. I found a few paths forward:
My instinct is option 1 is the right call, but I wanted to check: is contributing to @huggingface/tokenizers in scope for this PR, or would you prefer a different approach? Happy to go whichever direction makes sense. |
Considering this is available inside the rust tokenizers library, I think it's well within scope to add to So, I'd say we can migrate this PR to there? 🤗 |
|
I agree. The core logic should be in https://github.com/huggingface/tokenizers.js |
|
Thanks for the direction! I'll migrate the implementation to https://github.com/huggingface/tokenizers.js and open a PR there. I'll link back here once it's up. |
|
Hi @nico-martin and @xenova, I have opened the PR here: [huggingface/tokenizers.js#30] This implements native offset tracking directly in @huggingface/tokenizers across all four tokenizer families (BERT, GPT-2, T5, RoBERTa). Once that PR is merged and the package version is bumped here, computeOffsetMapping can be deleted entirely and _encode_plus will read offsets directly from the encode result. |
Summary
Closes #425
Adds
return_offset_mappingparameter toPreTrainedTokenizer, matching the behaviour of the Pythontransformerslibrary. When enabled, the tokenizer output includes anoffset_mappingfield containing[start, end]character pairs for each token, where the positions refer to the original (pre-tokenization) string.Changes
src/tokenization_utils.jscomputeOffsetMapping(text, tokens, special_tokens_set)— a private helper that reconstructs character offsets from the token strings returned by@huggingface/tokenizers. It handles the three common subword prefix conventions:##word— BERT-style WordPiece continuation (no space before)Ġword— GPT-2 / RoBERTa BPE (space-preceded word)▁word— SentencePiece word boundary[CLS],[SEP],[PAD], etc.) always map to[0, 0]return_offset_mappingoption toTokenizerCallOptionstypedef and_call/_encode_plusmethodspadHelpercall to padoffset_mappingwith[0, 0](not0)offset_mappingis excluded from Tensor conversion and always returned as a plain JS array, consistent with the Python librarytests/tokenizers.test.jsreturn_tensor=trueUsage