Skip to content

Add return_offset_mapping support to PreTrainedTokenizer#1702

Open
anidoesdev wants to merge 4 commits into
huggingface:mainfrom
anidoesdev:main
Open

Add return_offset_mapping support to PreTrainedTokenizer#1702
anidoesdev wants to merge 4 commits into
huggingface:mainfrom
anidoesdev:main

Conversation

@anidoesdev

Copy link
Copy Markdown

Summary

Closes #425

Adds return_offset_mapping parameter to PreTrainedTokenizer, matching the behaviour of the Python transformers library. When enabled, the tokenizer output includes an offset_mapping field containing [start, end] character pairs for each token, where the positions refer to the original (pre-tokenization) string.

Changes

src/tokenization_utils.js

  • Added computeOffsetMapping(text, tokens, special_tokens_set) — a private helper that reconstructs character offsets from the token strings returned by @huggingface/tokenizers. It handles the three common subword prefix conventions:
    • ##word — BERT-style WordPiece continuation (no space before)
    • Ġword — GPT-2 / RoBERTa BPE (space-preceded word)
    • ▁word — SentencePiece word boundary
    • Special tokens ([CLS], [SEP], [PAD], etc.) always map to [0, 0]
  • Added return_offset_mapping option to TokenizerCallOptions typedef and _call / _encode_plus methods
  • Updated padHelper call to pad offset_mapping with [0, 0] (not 0)
  • offset_mapping is excluded from Tensor conversion and always returned as a plain JS array, consistent with the Python library

tests/tokenizers.test.js

  • Added 7 tests covering: default behaviour (opt-in), single string, subword tokens, batched input, padding, truncation, and return_tensor=true

Usage

import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/bert-base-uncased");

const output = tokenizer("Hello, I am John", {
  return_tensor: false,
  return_offset_mapping: true,
});

console.log(output.offset_mapping);
// [[0,0], [0,5], [5,6], [7,8], [9,11], [12,16], [0,0]]
//   CLS   Hello    ,     I      am      John     SEP

// Map each predicted NER tag back to its span in the original string:
output.offset_mapping.forEach(([start, end], i) => {
  if (start === 0 && end === 0) return; // skip special tokens
  console.log(`Token ${i}: "${text.slice(start, end)}" → [${start}, ${end}]`);
});

@nico-martin nico-martin self-assigned this Jun 8, 2026

@nico-martin nico-martin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anidoesdev,

Thanks for looking into this.

First, I have one question about the API design. You chose return_offset_mapping: boolean, but in the Python transformers library, this is exposed as return_offsets_mapping with an s. Since the PR summary says this matches Python behavior, I suggest renaming the option to return_offsets_mapping.

Overall, I think this is a good starting point, but I do not think it is ready to ship as a general public API yet. Correct offset mapping depends on the tokenizer's native normalization, pre-tokenization, byte-level handling, and post-processing behavior. Reconstructing spans from token strings works for simple examples, but can silently produce wrong character spans for common inputs such as accents, Unicode text, byte-level BPE tokenizers, and paired sequences.

Do you think you could look into whether we can use native offsets from the tokenizer backend? Happy to review again after that.

@anidoesdev

Copy link
Copy Markdown
Author

For sure, I will look into it and get back to you as soon as possible

@anidoesdev

anidoesdev commented Jun 8, 2026

Copy link
Copy Markdown
Author

Hi @nico-martin,

I looked into the native offset approach and wanted to align on direction before implementing.

The current backend is a pure JS reimplementation that doesn't track character offsets internally so there's no offset data to expose from encode() today. I found a few paths forward:

  1. Add offset tracking to @huggingface/tokenizers - thread a char-position cursor through encode_text across the normalization and pre-tokenization stages, the same way the Rust library does. This is the correct universal fix (works in browser + Node.js) but requires a PR to that repo first.

  2. Use the NAPI tokenizers package - the official Rust binding exposes getOffsets() natively, but it only works in Node.js (no browser support), so it'd be a partial solution.

  3. Build a WASM wrapper - compile the Rust tokenizers crate to WASM to get native offsets universally. Correct, but significant standalone build infrastructure work.

My instinct is option 1 is the right call, but I wanted to check: is contributing to @huggingface/tokenizers in scope for this PR, or would you prefer a different approach? Happy to go whichever direction makes sense.

@xenova

xenova commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Use the NAPI tokenizers package - the official Rust binding exposes getOffsets() natively, but it only works in Node.js (no browser support), so it'd be a partial solution.

Considering this is available inside the rust tokenizers library, I think it's well within scope to add to @huggingface/tokenizers. The repo is at https://github.com/huggingface/tokenizers.js

So, I'd say we can migrate this PR to there? 🤗

@nico-martin

Copy link
Copy Markdown
Collaborator

I agree. The core logic should be in https://github.com/huggingface/tokenizers.js

@anidoesdev

Copy link
Copy Markdown
Author

Thanks for the direction! I'll migrate the implementation to https://github.com/huggingface/tokenizers.js and open a PR there. I'll link back here once it's up.

@anidoesdev

anidoesdev commented Jun 11, 2026

Copy link
Copy Markdown
Author

Hi @nico-martin and @xenova,

I have opened the PR here: [huggingface/tokenizers.js#30]

This implements native offset tracking directly in @huggingface/tokenizers across all four tokenizer families (BERT, GPT-2, T5, RoBERTa). Once that PR is merged and the package version is bumped here, computeOffsetMapping can be deleted entirely and _encode_plus will read offsets directly from the encode result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request] Return offset mapping using tokenizer

3 participants