What happened?
I was using the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model to embed chunks of text, and I noticed that while the docstring in the model list specifies 512 input tokens truncation the actual number of tokens before truncation is 128 in the tokenizer at embedder.tokenizer.
We tested the embedding procedure and it seems that the embedder is indeed using 128 long truncation, as is specified in the paraphrase documentation.
What is the expected behaviour?
The docstring and documentation should allign with actual token truncation behaviour.
A minimal reproducible example
import TextEmbedding
embedder = TextEmbedding(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
print(embedder.model.tokenizer.truncation) // Shows the actual token truncation window
// Truncation can be changed with
embedder.model.tokenizer.enable_truncation(max_size=512, padding=0)
// However this is not the default setting for paraphrase-multilingual
What Python version are you on? e.g. python --version
python3.12
FastEmbed version
v0.6.0
What os are you seeing the problem on?
Linux
Relevant stack traces and/or logs
What happened?
I was using the
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2model to embed chunks of text, and I noticed that while the docstring in the model list specifies512 input tokens truncationthe actual number of tokens before truncation is 128 in the tokenizer atembedder.tokenizer.We tested the embedding procedure and it seems that the embedder is indeed using 128 long truncation, as is specified in the paraphrase documentation.
What is the expected behaviour?
The docstring and documentation should allign with actual token truncation behaviour.
A minimal reproducible example
What Python version are you on? e.g. python --version
python3.12
FastEmbed version
v0.6.0
What os are you seeing the problem on?
Linux
Relevant stack traces and/or logs