Skip to content

Detector Training

Tmob edited this page Jan 28, 2026 · 2 revisions

Detector Training Guide

Kiri OCR supports training custom text detectors using either CRAFT (Character Region Awareness for Text Detection) or DB (Differentiable Binarization). While the pre-trained detector works well for general documents, training a custom detector is recommended for specific layouts, novel fonts, or challenging backgrounds.

1. Generate Training Data

Training a detector requires a dataset of images with ground truth bounding boxes. Kiri OCR provides a synthetic data generator to create this data from text files.

Prerequisites

  • Text Corpus: A data.txt file containing sample text lines (one per line).
  • Fonts: A directory of .ttf or .otf font files.
  • Backgrounds (Optional): A directory of background images to paste text onto.

Command

kiri-ocr generate-detector \
    --text-file data.txt \
    --fonts-dir fonts/ \
    --output my_detector_data \
    --num-train 1000 \
    --num-val 200 \
    --min-lines 5 \
    --max-lines 15

Parameters:

  • --text-file: Source text.
  • --output: Output directory.
  • --num-train: Number of training images to generate.
  • --num-val: Number of validation images.
  • --min-lines / --max-lines: Number of text lines per image. Randomly chosen between min and max.
  • --image-height: Height of the generated images (default 512). Width is calculated to maintain aspect ratio.

Output Structure

The command creates a directory my_detector_data with the following structure compatible with YOLO-style formatting (used internally by the DB trainer):

my_detector_data/
├── images/
│   ├── train/
│   │   ├── img_00001.jpg
│   │   └── ...
│   └── val/
│       ├── img_00001.jpg
│       └── ...
├── labels/
│   ├── train/
│   │   ├── img_00001.txt
│   │   └── ...
│   └── val/
│       ├── img_00001.txt
│       └── ...
└── data.yaml  # Configuration file pointing to paths

2. Train the Detector

Once the data is generated, use the train-detector command. Currently, this trains a DBNet model.

kiri-ocr train-detector \
    --data-yaml my_detector_data/data.yaml \
    --epochs 100 \
    --batch-size 8 \
    --image-size 640 \
    --name my_custom_detector

Parameters:

  • --data-yaml: Path to the data.yaml generated in step 1.
  • --epochs: Number of training epochs.
  • --batch-size: Batch size (reduce if you run out of GPU memory).
  • --image-size: Input image size for training. Must be a multiple of 32.
  • --model-size: Size of the backbone. Options: n (nano), s (small), m (medium), l (large). Default is n.
  • --name: Name of the training run. Checkpoints will be saved in runs/detect/{name}/.

3. Monitor Training

Training progress is printed to the console. You will see metrics for:

  • Box Loss: How accurately the bounding boxes are predicted.
  • Class Loss: (Usually 0 for text detection as there is only one class).
  • DFL Loss: Distribution Focal Loss.

4. Use Your Custom Detector

After training, the best model weights are saved to runs/detect/my_custom_detector/weights/best.pt.

To use this model in your application:

from kiri_ocr import OCR

# Initialize OCR with your custom detector
ocr = OCR(
    det_model_path="runs/detect/my_custom_detector/weights/best.pt",
    det_method="db"
)

# Run prediction
results = ocr.process_document("test_image.jpg")

Tips for Better Detection

  1. Diverse Backgrounds: If your real-world data has complex backgrounds (receipts, street scenes), ensure your training data reflects this. You can modify the generator to use background images.
  2. Image Size: Use a larger --image-size (e.g., 1024 or 1280) if you need to detect very small text.
  3. Augmentation: The trainer applies standard augmentations (flip, scale, color jitter). Ensure these are appropriate for your text (e.g., vertical flip might not be good for text).

Clone this wiki locally