Skip to content

[Bug]: Docstring does not align with behaviour #531

@RiccardoPazzi

Description

@RiccardoPazzi

What happened?

I was using the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model to embed chunks of text, and I noticed that while the docstring in the model list specifies 512 input tokens truncation the actual number of tokens before truncation is 128 in the tokenizer at embedder.tokenizer.
We tested the embedding procedure and it seems that the embedder is indeed using 128 long truncation, as is specified in the paraphrase documentation.

What is the expected behaviour?

The docstring and documentation should allign with actual token truncation behaviour.

A minimal reproducible example

import TextEmbedding
embedder = TextEmbedding(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
print(embedder.model.tokenizer.truncation) // Shows the actual token truncation window

// Truncation can be changed with
embedder.model.tokenizer.enable_truncation(max_size=512, padding=0)
// However this is not the default setting for paraphrase-multilingual

What Python version are you on? e.g. python --version

python3.12

FastEmbed version

v0.6.0

What os are you seeing the problem on?

Linux

Relevant stack traces and/or logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions