Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
-
Updated
Nov 17, 2025 - Python
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
An educational Python project for learning tokenization step by step by building character-level, byte-level, and BPE tokenizers from scratch.
Implemented GPT from scratch
LLM inference engine built from scratch in C++. No PyTorch, no frameworks.
GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
R-BPE: Improving BPE-Tokenizers with Token Reuse
BPE tokenizer for LLMs in Pure Zig
Teaching transformer-based architectures
Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET
A custom tokenizer (byte level BPE) trained by me to try to replicate LLAMA 2's massive token vocabulary
High-Performance Tokenizer implementation in PHP.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8
🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.
(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.
Experimental local transformer framework for training small language models, chat assistants, and code focused AI systems from scratch. Built for learning, research, and rapid LLM experimentation.
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.
To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."