Add return_offset_mapping support to PreTrainedTokenizer by anidoesdev · Pull Request #1702 · huggingface/transformers.js

anidoesdev · 2026-06-08T00:01:00Z

Summary

Closes #425

Adds return_offset_mapping parameter to PreTrainedTokenizer, matching the behaviour of the Python transformers library. When enabled, the tokenizer output includes an offset_mapping field containing [start, end] character pairs for each token, where the positions refer to the original (pre-tokenization) string.

Changes

src/tokenization_utils.js

Added computeOffsetMapping(text, tokens, special_tokens_set) — a private helper that reconstructs character offsets from the token strings returned by @huggingface/tokenizers. It handles the three common subword prefix conventions:
- ##word — BERT-style WordPiece continuation (no space before)
- Ġword — GPT-2 / RoBERTa BPE (space-preceded word)
- ▁word — SentencePiece word boundary
- Special tokens ([CLS], [SEP], [PAD], etc.) always map to [0, 0]
Added return_offset_mapping option to TokenizerCallOptions typedef and _call / _encode_plus methods
Updated padHelper call to pad offset_mapping with [0, 0] (not 0)
offset_mapping is excluded from Tensor conversion and always returned as a plain JS array, consistent with the Python library

tests/tokenizers.test.js

Added 7 tests covering: default behaviour (opt-in), single string, subword tokens, batched input, padding, truncation, and return_tensor=true

Usage

import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/bert-base-uncased");

const output = tokenizer("Hello, I am John", {
  return_tensor: false,
  return_offset_mapping: true,
});

console.log(output.offset_mapping);
// [[0,0], [0,5], [5,6], [7,8], [9,11], [12,16], [0,0]]
//   CLS   Hello    ,     I      am      John     SEP

// Map each predicted NER tag back to its span in the original string:
output.offset_mapping.forEach(([start, end], i) => {
  if (start === 0 && end === 0) return; // skip special tokens
  console.log(`Token ${i}: "${text.slice(start, end)}" → [${start}, ${end}]`);
});

nico-martin

Hi @anidoesdev,

Thanks for looking into this.

First, I have one question about the API design. You chose return_offset_mapping: boolean, but in the Python transformers library, this is exposed as return_offsets_mapping with an s. Since the PR summary says this matches Python behavior, I suggest renaming the option to return_offsets_mapping.

Overall, I think this is a good starting point, but I do not think it is ready to ship as a general public API yet. Correct offset mapping depends on the tokenizer's native normalization, pre-tokenization, byte-level handling, and post-processing behavior. Reconstructing spans from token strings works for simple examples, but can silently produce wrong character spans for common inputs such as accents, Unicode text, byte-level BPE tokenizers, and paired sequences.

Do you think you could look into whether we can use native offsets from the tokenizer backend? Happy to review again after that.

anidoesdev · 2026-06-08T12:13:22Z

For sure, I will look into it and get back to you as soon as possible

anidoesdev · 2026-06-08T17:39:16Z

Hi @nico-martin,

I looked into the native offset approach and wanted to align on direction before implementing.

The current backend is a pure JS reimplementation that doesn't track character offsets internally so there's no offset data to expose from encode() today. I found a few paths forward:

Add offset tracking to @huggingface/tokenizers - thread a char-position cursor through encode_text across the normalization and pre-tokenization stages, the same way the Rust library does. This is the correct universal fix (works in browser + Node.js) but requires a PR to that repo first.
Use the NAPI tokenizers package - the official Rust binding exposes getOffsets() natively, but it only works in Node.js (no browser support), so it'd be a partial solution.
Build a WASM wrapper - compile the Rust tokenizers crate to WASM to get native offsets universally. Correct, but significant standalone build infrastructure work.

My instinct is option 1 is the right call, but I wanted to check: is contributing to @huggingface/tokenizers in scope for this PR, or would you prefer a different approach? Happy to go whichever direction makes sense.

xenova · 2026-06-10T04:26:00Z

Use the NAPI tokenizers package - the official Rust binding exposes getOffsets() natively, but it only works in Node.js (no browser support), so it'd be a partial solution.

Considering this is available inside the rust tokenizers library, I think it's well within scope to add to @huggingface/tokenizers. The repo is at https://github.com/huggingface/tokenizers.js

So, I'd say we can migrate this PR to there? 🤗

nico-martin · 2026-06-10T06:47:20Z

I agree. The core logic should be in https://github.com/huggingface/tokenizers.js

anidoesdev · 2026-06-10T08:15:51Z

Thanks for the direction! I'll migrate the implementation to https://github.com/huggingface/tokenizers.js and open a PR there. I'll link back here once it's up.

anidoesdev · 2026-06-11T01:11:14Z

Hi @nico-martin and @xenova,

I have opened the PR here: [huggingface/tokenizers.js#30]

This implements native offset tracking directly in @huggingface/tokenizers across all four tokenizer families (BERT, GPT-2, T5, RoBERTa). Once that PR is merged and the package version is bumped here, computeOffsetMapping can be deleted entirely and _encode_plus will read offsets directly from the encode result.

anidoesdev added 3 commits June 8, 2026 05:01

added return_offset_mapping support to PreTrainedTokenizer

fa94faa

Add JSDoc limitations note to return_offset_mapping option

d9db4ec

auto-fix formatting

24ffdfe

nico-martin self-assigned this Jun 8, 2026

nico-martin requested changes Jun 8, 2026

View reviewed changes

spell_error

15146b9

anidoesdev mentioned this pull request Jun 11, 2026

Add offset tracking to Encoding huggingface/tokenizers.js#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add return_offset_mapping support to PreTrainedTokenizer#1702

Add return_offset_mapping support to PreTrainedTokenizer#1702
anidoesdev wants to merge 4 commits into
huggingface:mainfrom
anidoesdev:main

anidoesdev commented Jun 8, 2026

Uh oh!

nico-martin left a comment

Uh oh!

anidoesdev commented Jun 8, 2026

Uh oh!

anidoesdev commented Jun 8, 2026 •

edited

Loading

Uh oh!

xenova commented Jun 10, 2026

Uh oh!

nico-martin commented Jun 10, 2026

Uh oh!

anidoesdev commented Jun 10, 2026

Uh oh!

anidoesdev commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anidoesdev commented Jun 8, 2026

Summary

Changes

Usage

Uh oh!

nico-martin left a comment

Choose a reason for hiding this comment

Uh oh!

anidoesdev commented Jun 8, 2026

Uh oh!

anidoesdev commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Jun 10, 2026

Uh oh!

nico-martin commented Jun 10, 2026

Uh oh!

anidoesdev commented Jun 10, 2026

Uh oh!

anidoesdev commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anidoesdev commented Jun 8, 2026 •

edited

Loading

anidoesdev commented Jun 11, 2026 •

edited

Loading