An end-to-end computer-vision + OCR pipeline that classifies, deskews, cleans, and reads Arabic fields from real-world Tunisian official documents — the CIN national ID card (recto + verso) and the carte grise vehicle registration — then serializes the extracted fields to JSON and auto-fills a contract template.
The problem. Tunisian administrative documents carry their key information in Arabic — a cursive, right-to-left script that off-the-shelf OCR handles far worse than Latin text. In practice these documents arrive as phone photos: skewed, glare-affected, unevenly lit, low-resolution, and cluttered with printed labels next to the fields you actually want. Reading those fields reliably enough to populate a paper contract is impossible without a full image-preparation chain before any OCR runs.
The pipeline. This project is a 6-stage vision pipeline, each stage prototyped in its own notebook and then assembled end-to-end in integration_notebook.ipynb:
[1] CLASSIFY [2] ORIENT / DESKEW [3] LOCALIZE & CLEAN
CNN: CIN vs. → projection-profile → face anchor + contour crop
carte grise score sweep + Haar + K-channel threshold
+ glare inpaint + label mask
│
▼
[6] CONTRACT AUTOFILL ← [5] SERIALIZE ← [4] ARABIC OCR + PARSE
Pillow draw + bidi fields → JSON OCR.space → ArabicOCR → EasyOCR
+ reshape/bidi + regex fields
Two document types, six output fields. The pipeline targets CIN recto (cin, nom), CIN verso (profession, adresse), and carte grise (serie_type, dpmc), all of which are written into the final contract.
Each stage is justified by a concrete failure mode of the stage after it. The whole pipeline is "front-load every transformation that OCR is sensitive to, so the OCR call sees the cleanest possible crop."
A small convolutional network decides carte d'identité vs. carte grise before any document-specific logic runs, because the two types need completely different localization (a CIN is anchored on the face; a carte grise is anchored on a red flag).
- Architecture (
data_augmentation.ipynb): input100×100×3→Conv2D(16, 3×3, same, ReLU)→MaxPool→Conv2D(32)→MaxPool→Conv2D(64)→MaxPool→Dropout(0.2)→Flatten→Dense(128, ReLU)→Dense(2). Compiled with Adam + BinaryCrossentropy, trained 30 epochs on atrain_test_splitfrom scikit-learn. - Why heavy augmentation: the dataset is small and locally collected, so the model is fronted by a
keras.Sequentialaugmentation stack — ~9RandomZoomlevels, ~9RandomContrastlevels, andRandomFlip("horizontal_and_vertical")— to manufacture invariance to the scale, contrast, and orientation noise of phone photos.
Arabic OCR is orientation-sensitive: a few degrees of skew collapses recognition of a cursive script. Two complementary methods are implemented:
- Projection-profile score sweep (
correct_orientationin the integration notebook): Otsu-binarize, then rotate the image across −90°…+90° in 0.1° steps, scoring each angle by the squared difference of the row-sum histogram (Σ(hist[i+1]−hist[i])²). The angle with the sharpest horizontal banding wins — that is the angle at which text lines are horizontal. - Geometry-based fallbacks:
Orientation.pyderives the skew fromminAreaRecton the Otsu mask;orientation_quelque_soit_angle.ipynbestimates the dominant axis with PCA (cv2.PCACompute2) on the contour point cloud for arbitrary input angles;rotation.pyis a brute-force search that rotates across angles and keeps the orientation that yields the longest OCR string (a recognition-driven tiebreaker).
The OCR must see only the fields, not the surrounding card art and labels.
- CIN recto — a Haar cascade (
haarcascade) detects the face; the field block is then cropped relative to the face box and the OCR text-overlay geometry (maxTop/Leftof detected words), so the crop adapts to each photo instead of using fixed coordinates. - Carte grise — a dual-range red HSV mask (Hue
0–10and160–179) finds the small red flag; candidate contours are filtered by aspect ratio and size, and a K-means dominant-color check (n_colors=7, red channel ≥ 1.8× green) confirms the flag before using it as an anchor to crop the registration block. - Glare / flash removal (
projet.py) — bright specular regions are masked with a220threshold and reconstructed withcv2.inpaint(TELEA); printed labels are inpainted out (INPAINT_NS) after an EasyOCR pass locates them (noNeeded_text_mask.ipynb). - Adaptive brightness — mean luminance is measured with
PIL.ImageStatand the V-channel is boosted on a 16-step ladder, so dark photos are lifted without blowing out already-bright ones. - K-channel binarization — fields are isolated via the printing-style K (key/black) channel (
1 − max(R,G,B)) thresholded at ~140–150, which separates dark Arabic ink from a colored card background better than a plain grayscale threshold.
OCR runs on the cleaned crop through a layered fallback chain, because no single Arabic OCR engine is reliable on official-document imagery:
- OCR.space API (
language=ara,isOverlayRequired=true) — primary; itsTextOverlayline/word geometry is reused upstream to drive the crop. - ArabicOCR (
ArabicOcr.arabicocr.arabic_ocr) — fallback when the API returns noParsedResults. - EasyOCR (
Reader(['ar','en'])) — used for the carte grise, where a regex date pattern extracts theDPMCand an alphanumeric-length filter (>9chars, mixed letters+digits) extracts the serial/type number.
Arabic text is then reshaped (arabic_reshaper) so letters take their correct contextual forms, and reordered with python-bidi (get_display) so the right-to-left string renders correctly when drawn left-to-right. python-Levenshtein supports fuzzy field matching/cleanup.
The six fields are assembled into a dict and written as UTF-8 JSON (json.dumps(..., indent=4)) — a clean, inspectable hand-off boundary between the vision pipeline and the document-generation step.
The JSON is rendered onto a contract image template with Pillow (ImageDraw.text at fixed field coordinates, arial.ttf). Arabic values are passed through bidi.get_display again so they appear correctly on the rendered contract.
This is not "wrap Tesseract in a loop." The difficulty is structural:
- Arabic is the hard case for OCR. It is cursive (letters change shape by position), right-to-left, and diacritic-bearing. Mainstream OCR tooling is tuned for Latin scripts; Arabic accuracy degrades sharply, and correct display requires an explicit reshape + bidi step that Latin pipelines never need.
- Real official documents, not scans. Inputs are phone photos with glare, skew, motion blur, low resolution, and dense decorative backgrounds — exactly the conditions that off-the-shelf OCR fails on. Most of the engineering here is the preprocessing that makes the OCR call viable at all.
- Localization is document-specific and self-anchoring. Rather than fixed bounding boxes, the pipeline anchors on physical document landmarks (the CIN portrait via Haar cascade, the carte-grise red flag via HSV + dominant-color check) so it tolerates translation and scale variation across photos.
- Layered OCR with graceful degradation. Three engines in a fallback chain, with OCR geometry feeding back into the crop, is a deliberate response to the fact that no single Arabic OCR engine is dependable on this imagery.
Together this is a genuinely under-served niche: Tunisian-document Arabic field extraction with end-to-end contract generation, built from open CV/OCR components rather than a paid document-AI service.
| File | Role |
|---|---|
integration_notebook.ipynb |
End-to-end pipeline. Orientation → Haar face anchor → field localization → K-channel threshold → layered Arabic OCR for CIN recto/verso + carte grise → JSON serialization → contract auto-fill. |
data_augmentation.ipynb |
CNN document-type classifier. Loads carte_identite / carte_grise folders, builds the augmentation stack + 3-block CNN, trains (30 epochs) and predicts the card type. |
conour_extraction.ipynb |
Contour detection & extraction (best on dark backgrounds): thresholding, findContours, bounding-box cropping. |
noNeeded_text_mask.ipynb |
Locates printed labels with EasyOCR and inpaints them out, keeping only the useful fields after a fixed resize. |
zoom_in.ipynb |
Computes a zoom factor from the dominant contour and rescales to enlarge the region of interest. |
orientation_quelque_soit_angle.ipynb |
Orientation estimation for any input angle via PCA (PCACompute2) and minAreaRect. |
projet.py |
Standalone cleanup stage: glare inpaint + 16-step adaptive brightness + label masking. |
crop.py |
Carte-grise crop driven by the red-flag HSV mask + K-means dominant-color anchor. |
Orientation.py |
CLI skew correction via Otsu + minAreaRect (--image argument). |
rotation.py |
Brute-force rotation search keeping the angle that maximizes OCR text length. |
orientaation.png |
Reference illustration for the orientation step. |
- Language / environment: Python, Jupyter Notebook
- Computer vision: OpenCV (
opencv-python), NumPy, SciPy (scipy.ndimage),imutils, Pillow (PIL.ImageStat/ImageDraw), Matplotlib - Deep learning: TensorFlow / Keras (CNN + preprocessing-layer augmentation), scikit-learn (
train_test_split) - OCR (Arabic): EasyOCR, ArabicOCR, the OCR.space HTTP API
- Arabic text handling:
arabic-reshaper,python-bidi,python-Levenshtein - HTTP:
requests
# 1. Clone
git clone https://github.com/MuhamedHabib/DataScienceProject.git
cd DataScienceProject
# 2. (Recommended) create a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
# 3. Install the libraries used across the notebooks
pip install opencv-python numpy scipy imutils pillow matplotlib \
tensorflow scikit-learn easyocr ArabicOcr \
arabic-reshaper python-bidi python-Levenshtein requests
# 4. Launch Jupyter
jupyter notebookOpen integration_notebook.ipynb to run the full pipeline, or run any single notebook to explore that stage in isolation.
# Standalone scripts run directly, e.g.:
python Orientation.py --image path/to/document.jpgThis is a working academic / research prototype (ESPRIT PI project). It demonstrates a complete preprocessing-and-OCR pipeline, with honest boundaries:
- No committed evaluation set, and therefore no accuracy figures. The repository contains no benchmark, no labeled test split, and no reported field-level accuracy — and this README deliberately reports none. The classifier's
model.evaluateruns against an in-notebook split, but those numbers are not persisted or committed. A real quantitative eval — per-field OCR accuracy (e.g. CER / exact-match per field) on a held-out, hand-labeled set of CIN and carte-grise photos — is the most important missing piece. Without it, robustness claims are anecdotal. - Hard-coded local paths. The notebooks reference absolute paths and named images (
ali1.jpg,verso.jpg,cg12.jpg,grise12.png, etc.) from the author's machine. Update the paths and bring your own document images before running. - Bring your own OCR.space key. The pipeline calls the OCR.space API; supply your own API key via an environment variable and never commit it. (Use a placeholder such as
OCR_SPACE_API_KEYin code.) - No requirements lockfile. The
pip installabove lists the libraries actually imported by the notebooks; pin versions to match your environment. - Heuristic thresholds. Glare thresholds, the brightness ladder, the HSV red ranges, and the crop offsets are tuned to the sample images. A natural next step is to make localization and cleanup parameters data-driven (or replace the layered OCR + heuristics with a single fine-tuned Arabic document model) and to add an automated regression harness over a labeled image corpus.
- Language note. The notebooks were authored with French headings/comments; this README summarizes them in English. Class names like
carte_identite(ID card) andcarte_grise(vehicle registration) keep the original French naming.
Built by Mohamed Habib Khattat — GitHub (@MuhamedHabib) · LinkedIn