GitHub - sahilalaknur21/SmartDocs-Multillingual-Agentic-Rag: Intelligent multilingual PDF Q&A system for Indian professionals. Sarvam-30B + multilingual-e5-large. 22 Indian languages, no translation layer.

SmartDocs achieved 95%+ Hindi Faithfulness — 0.9069 cross-language Hindi/English cosine similarity.

Multilingual Agentic RAG engineered for India's linguistic reality, processing legal, GST, and insurance documents natively across 22 languages without intermediate translation layers.

🚨 Reality Check: The Silent Failure of English-First RAG

Naive RAG pipelines fail silently in Indian production environments. Introducing a translation layer into an English-centric system degrades semantic fidelity — GST provisions, legal constraints, and precise numerical thresholds are systematically distorted. A system reporting 90% accuracy in English but 65% in Hindi is not multilingual; it is an English pipeline with superficial language coverage. In this architecture, cross-lingual alignment and robust Devanagari Unicode handling are treated as first-order requirements, with deployment gated if the Hindi-to-English faithfulness ratio drops below 0.97.

🎯 What This System Solves

Standard RAG architectures break down on real-world Indian data. This system explicitly engineers solutions for four root failures:

Translation-Induced Semantic Loss: Eliminates the intermediate Query → English → LLM → Target Language translation step. Legal and financial terminology (e.g., "Input Tax Credit," "धारा 16") degrades irreversibly across translation boundaries.
Devanagari Processing Noise: Mitigates Zero-Width Joiner (ZWJ) artifacts, 16 Unicode space variants, and OCR extraction anomalies prior to embedding — failures invisible in dev but catastrophic in production on government-issued PDFs.
Retrieval Failure on Alphanumerics: Solves the dense embedding blindspot for exact-match terms (GSTINs, section codes, specific tax amounts) using Reciprocal Rank Fusion (RRF) with a BM25 sparse index trained on Indic-normalized text.
Language Detection Instability: Replaces statistically unstable detection models — which misclassify "transformer kya hai" as Norwegian — with a deterministic 7-step script-and-lexicon execution tree seeded for reproducibility.

📊 Evaluation Results

Measured RAGAS results on a 60-pair multilingual test set (30 Hindi + 30 English, native — not translated). Deployment is blocked if any gate metric is missed — aggregate accuracy hiding Hindi failures is not accepted.

Metric	Result	Deployment Gate
Hindi Faithfulness	97%+	< 85% → BLOCK
English Faithfulness	90%+	< 90% → BLOCK
Hindi / English Faithfulness Ratio	0.97	< 0.97 → BLOCK
Language Accuracy (Hindi ↔ English)	95%+	< 92% → BLOCK
Cross-language Cosine Similarity	0.9069	< 0.85 → BLOCK
Context Precision	98%+	—
Hallucination Rate	< 5%	—
P95 End-to-End Latency	~8–12s	—

The ratio metric is the definition of success. 90% English / 65% Hindi faithfulness = product failure, regardless of aggregate score.

P95 latency reflects the full agentic pipeline: language detection → 3 parallel query transformations → hybrid retrieval → reranking → Sarvam-30B streaming generation → self-critique. Single-shot generation without transformation runs ~3–4s.

🌀 System Overview

The pipeline is non-linear and stateful — not a sequential chain.

Ingestion: PDF parsing → 5-step Indic Preprocessing → Script-Aware Parent/Child Chunking → Dense + Sparse Embedding → Row-Level Security (RLS) PostgreSQL persistence.
Query Processing: Deterministic 7-step Language Detection → Async Multi-Query Expansion (Multi-query + HyDE + Step-back via asyncio.gather).
Hybrid Retrieval: Dense (multilingual-e5-large) + Sparse (BM25Okapi on Indic-normalized text) fused via RRF (k=60, weights 0.7/0.3) → FlashRank Reranking to top-5.
Agentic Reasoning: 11-node LangGraph state machine: conditional routing, CRAG web fallback, context assembly, Sarvam-30B generation.
Validation & Guardrails: Post-generation faithfulness + language-match evaluation. Cyclic retries (max 2) before terminal guardrails (PII redaction, injection detection).
Streaming Delivery: 4-event SSE pipeline — sources → answer → metadata → done. Sources populate while the answer streams (LAW 9).

🏗️ System Architecture

Core Components

Component	Implementation	Key Detail
Foundation Embedding	`intfloat/multilingual-e5-large` (1024-dim, ~560MB)	Cross-language cosine > 0.85 is the deployment gate. `passage:` prefix at ingestion, `query:` at retrieval — missing either degrades similarity by 0.05–0.15.
Sparse Index	`BM25Okapi` on Indic-normalized text	Catches GSTIN exact matches, legal clause numbers, specific tax amounts — all invisible to semantic vectors.
Reranker	`FlashRank` `ms-marco-MiniLM-L-12-v2` (~21MB, CPU)	Reranks top-20 to top-5. Singleton — loaded once. Score thresholds: > 0.7 → proceed, 0.3–0.7 → CRAG, < 0.3 → insufficient.
Generator	`Sarvam-30B` via OpenAI-compatible streaming API	Temperature 0.1 for factual grounding. `include_usage=True` on streaming calls for token counting. No-info responses available in 6 Indian languages.
Orchestration	`LangGraph 0.3.x` StateGraph	11 nodes, 3 conditional edges, self-critique retry cycle. `SmartDocsState` TypedDict — 25 fields, `total=False`.
Vector Store	`pgvector` on Supabase + RLS	`_set_user_context()` called before every DB query. All writes inside `async with conn.transaction()` for RLS correctness.
Semantic Cache	`redis.asyncio`	0.95 cosine threshold. Per-user namespace — User A's cache never served to User B. Flushed on document update.
Observability	LangSmith auto-instrumentation	9 mandatory tags per trace. Language routing failures are the hardest bugs to debug in multilingual RAG — traces from Day 1 are non-negotiable.

🔄 LangGraph Flow Architecture

11 Nodes — State Machine

Node	Responsibility	Writes to State
`preprocess_node`	Language detection + query classification + cache check	`lang_result`, `query_classification`, `cache_hit`
`serve_cache_node`	Return cached response — zero retrieval or generation	`final_answer`, `cited_sources`, `cost`
`transform_node`	Multi-query + HyDE + step-back via `asyncio.gather`	`transformed_queries`
`retrieve_node`	Hybrid retrieval across all query variants in parallel	`retrieved_chunks`
`rerank_node`	FlashRank top-20 → top-5 + routing decision	`rerank_result`
`crag_node`	Tavily web search (score 0.3–0.7 only) — ephemeral, never stored	`crag_result`, `crag_triggered`
`assemble_context_node`	Sort + source diversity + citation injection + XML delimiter wrap	`assembled_context`, `system_prompt`, `cited_sources`
`generate_node`	Sarvam-30B streaming generation	`raw_answer`, `input_tokens`, `output_tokens`
`critique_node`	Faithfulness + language match + refined query on retry	`critique_result`, `retry_count`
`guardrail_node`	PII redaction + injection check + cost computation + cache store	`final_answer`, `cost`, `total_latency_ms`
`insufficient_node`	Graceful no-info response in user's detected language	`final_answer`

3 Conditional Edges

preprocess_node:   cache_hit = True    →  serve_cache_node → END
                   cache_hit = False   →  transform → retrieve → rerank

rerank_node:       score > 0.7         →  assemble_context_node
                   score 0.3–0.7       →  crag_node → assemble_context_node
                   score < 0.3         →  insufficient_node → END

critique_node:     passed = True       →  guardrail_node → END
                   retry_count < 2     →  retrieve_node (refined query)
                   retry_count >= 2    →  insufficient_node → END

📐 Key Design Decisions

📝 Zero Translation Layer

Decision: Query and retrieve natively in the source language using multilingual-e5-large.

Reason: Legal and financial terminology degrades when translated to English and back. "Input Tax Credit" loses structural precision when round-tripped through translation in Hindi legal documents.

Trade-off: Requires a heavier 1024-dim embedding model and strictly mandated passage: / query: prefixing. Missing either prefix degrades similarity by 0.05–0.15 — silent and catastrophic at the 0.85 deployment gate.

📝 Script-Aware Parent-Child Chunking

Decision: Devanagari: 400-token chunks (40 overlap). Latin: 500-token chunks (50 overlap). Child chunks (256t) for retrieval, Parent chunks (1024t) for generation context.

Reason: Devanagari is structurally more information-dense per token. indic-nlp-library sentence tokenizer runs before chunking — no chunk ends mid-sentence. Child chunks give precise retrieval; parent chunks give rich context to Sarvam-30B.

Trade-off: Script-detection overhead at ingestion. Two index entries per chunk.

📝 Deterministic Language Detection

Decision: 7-step deterministic logic tree overriding probabilistic models.

Reason: Standard langdetect returns Norwegian for "transformer kya hai". The system enforces a 130-word Hinglish lexicon match and 85% ASCII dominance threshold before trusting statistical models. DetectorFactory.seed = 0 for production reproducibility.

Trade-off: Requires manual maintenance of the Hinglish lexicon as the language evolves.

📝 Agentic Workflow over Static Pipeline

Decision: LangGraph stateful graph over sequential LCEL chain.

Reason: Static pipelines cannot recover from mid-flight hallucinations. The self-critique node requires cyclic graph execution — retry with a refined query without failing the user request. Max 2 retries prevent infinite loops; failure after 2 retries degrades gracefully to insufficient_node, never hangs.

Trade-off: Increased P95 latency vs single-shot generation.

📝 Transaction-Bound RLS Enforcement

Decision: All pgvector writes inside async with conn.transaction(). _set_user_context() called before every query. Middleware enforces user_id on every API request unconditionally.

Reason: SET LOCAL config vars scope to the current transaction. Without the transaction wrapper, RLS context evaporates after the SET statement — the UPDATE silently affects zero rows. This was the most critical production bug in the ingestion audit.

Trade-off: Slightly higher connection overhead per write. The alternative is silent data corruption and documents permanently stuck at status='pending'.

⚙️ The 18 Engineering Laws

Every architectural decision maps to a numbered law. These are enforced as code constraints, not documentation.

Law	Principle	Enforcement Point
LAW 1	All 5 Indic preprocessing steps before any embedding or BM25 index	`indic_preprocessing_pipeline()` — import and run; never skip individual steps
LAW 2	Cross-language cosine similarity > 0.85 before any ingestion code is written	`embedding_validator.py` — hard deployment gate
LAW 3	Devanagari: 400 tokens, not 500. `indic` sentence tokenizer runs before chunking	`chunker.py` script-aware branching
LAW 4	`passage:` at ingestion. `query:` at retrieval. Missing either = 0.05–0.15 similarity drop	`dense_embedder.py` prefix contracts
LAW 5	Dense 0.7 + BM25 0.3 + RRF. Dense-only loses 15–25% of retrievable Hindi answers	`hybrid_retriever.py` — hybrid is the minimum acceptable standard
LAW 6	Per-user RLS isolation. `user_id` NEVER null. `_set_user_context()` before every DB query	`pgvector_client.py` + `assert user_id` in metadata builder
LAW 7	LangSmith traces every Sarvam call from Day 1 — 9 mandatory tags	`langsmith_tracer.py` `build_run_metadata()`
LAW 8	Self-critique: faithfulness < 0.75 → retry. Language mismatch → retry. Max 2	`self_critique.py` → `critique_node` conditional edge
LAW 9	`sources` event emitted before `answer` event — users see retrieval while reading response	SSE order: `sources → answer → metadata → done`
LAW 10	15 injection patterns (Hindi + English) at ingestion. XML delimiter wrapping in context	`injection_scanner.py` + `context_assembler.py` delimiters
LAW 11	PII tagged at ingestion (`pii_detected=True`). Redacted from every API response	`pii_detector.py` tags + `output_guardrail.py` redacts
LAW 12	Semantic cache: 0.95 cosine, per-user namespace, flush on document update. Cache failure = non-fatal	`cache.py` — fail open, never blocks a query
LAW 13	> 0.7 → generate, 0.3–0.7 → CRAG (Tavily), < 0.3 → insufficient. CRAG results never stored	`reranker.py` threshold decision + `crag_fallback.py`
LAW 14	Sort by score. Max 2 chunks per source. Citation per chunk. 12,000 char budget	`context_assembler.py` — 5 rules in sequence
LAW 15	Language detection on every query. Confidence < 0.85 → default English	`language_detector.py` 7-step tree
LAW 16	`faithfulness_hindi < 0.85` OR `language_accuracy < 0.92` → BLOCK deployment	`deployment_gate.py`
LAW 17	Language badge from `/ingest` response — shown before first query, zero extra API calls	`upload_panel.py` `primary_language` field
LAW 18	Cost in INR per query, session total, and latency traffic light shown to user	`cost_tracker.py` + `cost_panel.py`

🔐 Security Architecture

Security is enforced at three independent layers. An application-layer failure is backstopped by the next layer. Database RLS is the final guarantee.

Layer 1 — Ingestion:   injection_scanner.py  →  flags injection_risk=True per chunk
                       pii_detector.py        →  flags pii_detected=True per chunk
                       Flagged chunks excluded from retrieval entirely (WHERE injection_risk = FALSE)

Layer 2 — Context:     context_assembler.py  →  wraps every chunk in XML delimiters
                       <retrieved_document>citation\ntext</retrieved_document>
                       Prevents retrieved content from escaping into system prompt space

Layer 3 — Output:      output_guardrail.py   →  PII redaction (Aadhaar, PAN, phone, GSTIN)
                       15-pattern injection check on generated response text
                       Off-topic flag for answers without citation signals

Database:              Supabase RLS          →  user_id = current_setting('app.current_user_id')
                       SET LOCAL scoped to transaction — enforced in every pgvector_client method
                       Service role key used server-side only — never exposed to UI layer

Without async with conn.transaction(), SET LOCAL config vars evaporate after the SET statement. The RLS UPDATE then silently passes with zero rows affected — documents stay at status='pending' forever with no error thrown. This was Bug #1 and Bug #2 in the production audit.

🪲 Production Engineering — 10 Bugs Fixed Before First API Route

The ingestion layer was fully audited before any API code was written. 10 bugs were found and fixed. The most critical ones caused silent data corruption — no errors, wrong results.

#	Severity	Bug	Silent Impact
1	CRITICAL	Missing UPDATE RLS policy in `schema.sql`	`total_chunks` stayed 0 after every ingestion — documents showed `pending` forever
2	CRITICAL	`update_document_status()` outside `conn.transaction()`	`SET LOCAL` context evaporated — RLS UPDATE silently failed even after policy was added
3	HIGH	`store_document()` returned `True` or `str` — ambiguous on driver version differences	Duplicate ingestion on asyncpg UUID return. Fixed: `StoreDocumentResult(is_new: bool, doc_id: str)`
4	HIGH	`locals().get('doc_id')` in except block	Exception before assignment returned `''` — wrong document tagged on failure. Fixed: initialize before `try`
5	HIGH	`exclude_injection_risk=True` accepted but WHERE clause never included the filter	Every retrieval returned injection-flagged chunks — poisoned PDFs bypassed LAW 10 entirely
6	MEDIUM	`DenseEmbedder()` instantiated per upload	Reloaded 1.8GB model + 3–5s latency on every upload. Fixed: `get_embedder()` singleton
7	MEDIUM	`asyncio.get_event_loop()` deprecated Python 3.10	Fails in Python 3.12+. Fixed: `asyncio.get_running_loop()`
8	MEDIUM	Double `passage:` prefix — worker and `embed_passages()` both adding it	Model received `passage: passage: text` — silent similarity score degradation, no error
9	LOW	`document_classifier.py` existed but `ingestion_worker.py` hardcoded `doc_type='other'`	Every GST notice stored as `other` — classification was a no-op
10	LOW	0-chunk edge case (scanned PDFs) returned `status='completed'` with `total_chunks=0`	Document appeared successfully ingested. Queries silently returned nothing

All 10 fixes validated by targeted unit tests in tests/test_5a_ingestion_pipeline.py. schema_patch.sql was also written to add the missing UPDATE RLS policy, error_message, and updated_at columns to the production schema.

⚖️ What Makes This Different

Standard RAG Systems	SmartDocs Architecture
English-Biased: Relies on translation for non-English queries	Multilingual-First: Embeds and retrieves natively across 22 languages
Static Execution: Linear ingest → retrieve → generate	Stateful/Agentic: Conditional edges, CRAG web fallback, self-critique retry cycles
Naive Text Splitters: Character or recursive splitting	Indic-Aware: 5-step preprocessing, ZWJ removal, script-density-adjusted parent/child chunks
Blind Delivery: Returns whatever the LLM produces	Evaluation Gated: Post-generation faithfulness validation with retry before user sees output
Single-layer security: Application-level checks only	Three-layer security: Ingestion flag → context delimiter → output redaction + database RLS

⚠️ Failure Modes & Limitations

Noisy OCR Degradation: BM25Okapi is highly sensitive to token fragmentation from poor OCR in scanned Indian government PDFs. Mitigation: detect_and_flag_low_quality() in preprocessing step 4 flags but does not delete affected chunks — they are available for audit.
Low-Resource Language Drift: While robust for Hindi/Marathi/Tamil, embedding alignment drops for extremely low-resource Indic dialects. Hindi and English test pairs score 0.90+; coverage narrows for rarer scripts.
CRAG Context Overflow: When Tavily web search is triggered (score 0.3–0.7), appended web text risks pushing assembled_context beyond the 12,000-character hard-cap (~3,000 tokens), forcing truncation of the primary PDF context.
Multivariate Query Ambiguity: Extremely short queries (e.g., "tax amount") matching multiple diverse documents produce equalized FlashRank scores, incorrectly triggering CRAG fallback.
Transaction Pooler Requirement: Supabase DATABASE_URL must use the Transaction Pooler (port 6543), not the direct connection (port 5432). asyncpg combined with pgbouncer on the direct port causes statement cache conflicts.

🚀 Quick Start

⏱️ Time estimate: 20–30 minutes for full setup. Steps 5–7 (model download, Supabase schema, Redis) take the longest on first run.

1. Prerequisites

Requirement	Version	Notes
Python	3.12+	3.12 or 3.13 both work
uv	latest	`pip install uv`
RAM	8GB+	`multilingual-e5-large` loads ~1.8GB into VRAM or system RAM
Disk	1.5GB+	Model cache (~560MB) + indic NLP resources
CUDA	12.1+	Optional — CPU fallback works automatically, ~3× slower embedding
Supabase account	—	Free tier works
Sarvam AI API key	—	dashboard.sarvam.ai
LangSmith API key	—	smith.langchain.com

2. Clone & Environment Setup

git clone https://github.com/YOUR_USERNAME/SmartDocs-Multillingual-Agentic-Rag-Project.git
cd SmartDocs-Multillingual-Agentic-Rag-Project

# Create virtual environment
uv venv

# Activate (Windows PowerShell)
.venv\Scripts\Activate.ps1

# Activate (macOS / Linux)
source .venv/bin/activate

3. Install Dependencies

# Install all dependencies from pyproject.toml
uv pip install -e .

# Install PyTorch with CUDA 12.1 support (RTX GPU users)
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA (falls back to CPU automatically if not available)
uv run python -c "import torch; print('CUDA:', torch.cuda.is_available())"

4. Download Indic NLP Resources

Required for Hindi/Indic sentence tokenization. Without this, chunking fails on Devanagari PDFs — the chunker splits mid-sentence and retrieval quality degrades silently.

git clone --depth 1 https://github.com/anoopkunchukuttan/indic_nlp_resources.git

This creates indic_nlp_resources/ in your project root. The .gitignore already excludes it.

5. Download & Validate Embedding Model

SmartDocs uses intfloat/multilingual-e5-large locally — zero API cost, ~560MB download.

# Pre-download the model
uv run python -c "
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/multilingual-e5-large')
print('Model downloaded and cached')
print('Embedding dimension:', model.get_sentence_embedding_dimension())
"

Expected:

Model downloaded and cached
Embedding dimension: 1024

Mandatory validation — do not skip. This is the deployment gate (LAW 2):

uv run python -c "
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('intfloat/multilingual-e5-large')
hindi   = model.encode(['passage: भूमि अधिग्रहण मुआवजा'])
english = model.encode(['passage: land acquisition compensation'])
sim = cosine_similarity(hindi, english)[0][0]
print(f'Cross-language similarity: {sim:.4f}')
assert sim > 0.85, f'FAILED: {sim} — do not proceed'
print('PASSED — multilingual retrieval confirmed')
"

Expected:

Cross-language similarity: 0.9XXX
PASSED — multilingual retrieval confirmed

⚠️ If similarity < 0.85, stop. The entire retrieval pipeline is compromised. Do not proceed to ingestion.

6. Configure Environment Variables

# Windows
Copy-Item .env.example .env

# macOS / Linux
cp .env.example .env

Open .env and fill in your credentials:

# ── Sarvam AI (Generation) ──────────────────────────────────────────
SARVAM_API_KEY=your_sarvam_api_key_here
SARVAM_BASE_URL=https://api.sarvam.ai/v1
SARVAM_MODEL_30B=sarvam-m
SARVAM_COST_PER_1K_INPUT_TOKENS_INR=0.50
SARVAM_COST_PER_1K_OUTPUT_TOKENS_INR=0.66

# ── Supabase (Vector Store) ─────────────────────────────────────────
# Get from: supabase.com → project → Settings → API
SUPABASE_URL=https://YOUR_PROJECT_REF.supabase.co
SUPABASE_SERVICE_KEY=your_service_role_key_here

# Get from: supabase.com → project → Settings → Database → Connection Pooling
# Use Transaction Pooler URL (port 6543) — NOT direct connection (port 5432)
DATABASE_URL=postgresql://postgres.YOUR_PROJECT_REF:YOUR_PASSWORD@aws-X-REGION.pooler.supabase.com:6543/postgres

# ── LangSmith (Observability) ───────────────────────────────────────
# Get from: smith.langchain.com → Settings → API Keys
LANGSMITH_API_KEY=your_langsmith_api_key_here
LANGSMITH_PROJECT=smartdocs
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langsmith_api_key_here
LANGCHAIN_PROJECT=smartdocs

# ── Optional Services ────────────────────────────────────────────────
TAVILY_API_KEY=your_tavily_key_here        # CRAG web search fallback
REDIS_URL=redis://localhost:6379            # Semantic query cache

# ── App ──────────────────────────────────────────────────────────────
# Generate with: python -c "import secrets; print(secrets.token_hex(32))"
APP_SECRET_KEY=your_generated_secret_here
ENVIRONMENT=development
EMBEDDING_DEVICE=cuda                       # Use cpu if no NVIDIA GPU
EMBEDDING_BATCH_SIZE=32                     # Reduce to 8 if VRAM < 4GB
HF_TOKEN=                                   # Optional — speeds up model downloads

💡 SARVAM_BASE_URL must end with /v1 — https://api.sarvam.ai/v1 not https://api.sarvam.ai.

💡 Supabase region matters. Your DATABASE_URL hostname must match your project region (e.g. aws-0-ap-south-1 for Mumbai). Get the exact URL from Supabase → Settings → Database → Connection Pooling.

7. Set Up Supabase Database Schema

Step 7A — Run the base schema:

Open Supabase SQL Editor → your project → SQL Editor → paste the contents of vectorstore/schema.sql → click Run.

Expected: Success. No rows returned.

Step 7B — Run the schema patch (mandatory):

In the same SQL Editor, paste the contents of vectorstore/schema_patch.sql → click Run.

⚠️ schema_patch.sql adds the missing UPDATE RLS policy, error_message, and updated_at columns. Without it, total_chunks stays 0 after every ingestion — documents appear uploaded but are silently broken. This was the most critical bug found in the production audit.

Step 7C — Verify tables exist:

uv run python -c "
import asyncio, asyncpg

async def check():
    from config.settings import get_settings
    s = get_settings()
    conn = await asyncpg.connect(s.database_url, statement_cache_size=0, ssl='require')
    tables = await conn.fetch(\"SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'\")
    print('Connection: OK')
    print('Tables:', [t['table_name'] for t in tables])
    await conn.close()

asyncio.run(check())
"

Expected:

Connection: OK
Tables: ['documents', 'document_chunks']

8. Start Redis (Semantic Query Cache)

# Windows — using Docker
docker run -d -p 6379:6379 redis:alpine

# macOS
brew install redis && brew services start redis

# Linux
sudo apt install redis-server && sudo systemctl start redis

# Verify
redis-cli ping
# Expected: PONG

Redis is optional. If not running, SmartDocs degrades gracefully — queries still work, semantic cache is disabled. The health check reports redis: degraded which is acceptable. All other health checks must be ok.

9. Start the FastAPI Backend

uv run uvicorn api.main:app --reload --host 0.0.0.0 --port 8000 --log-level info

Application startup sequence (3 steps, in order):

SmartDocs API starting up...
✓ pgvector pool connected          ← Step 1: DB pool established
✓ DenseEmbedder warmed up on cuda  ← Step 2: multilingual-e5-large loaded once
✓ LangGraph compiled               ← Step 3: graph ready for queries
SmartDocs API ready | env=development | model=sarvam-m
INFO: Uvicorn running on http://0.0.0.0:8000

Verify all components healthy:

# Windows PowerShell
curl.exe http://localhost:8000/health/ping
curl.exe http://localhost:8000/health

# macOS / Linux
curl http://localhost:8000/health/ping
curl http://localhost:8000/health

Expected /health/ping:

{"status": "alive"}

Expected /health:

{
  "status": "ok",
  "checks": {
    "db":        {"status": "ok",      "detail": "query returned 1"},
    "redis":     {"status": "ok",      "detail": "PING → True"},
    "embedding": {"status": "ok",      "detail": "model loaded on cuda"},
    "graph":     {"status": "ok",      "detail": "graph compiled: CompiledStateGraph"}
  },
  "latency_ms": 812.4
}

10. Start the Streamlit UI

Open a second terminal:

cd SmartDocs-Multillingual-Agentic-Rag-Project

# Windows
.venv\Scripts\Activate.ps1

# macOS / Linux
source .venv/bin/activate

uv run streamlit run ui/app.py --server.port 8501

Open http://localhost:8501 in your browser.

11. First Query — End-to-End Test

Upload a PDF — click Browse files in the sidebar
Watch the language badge — a Hindi PDF shows 🇮🇳 Hindi before you type anything (LAW 17)

Ask in Hindi:

इस दस्तावेज़ का मुख्य विषय क्या है?

Ask in English:

What is the main topic of this document?

Both answers appear in their respective languages with source citations and cost tracking in INR.

12. Smoke Test — Full Pipeline Validation

uv run python smoke_test.py --api http://localhost:8000 --user user_123

Expected:

================================================================
SMARTDOCS POST-DEPLOY SMOKE TEST
================================================================
[health] PING OK: {"status": "alive"}
[health] Status: ok

[hi_01] Basic Hindi GST question
  Status:   PASS
  Language: OK
  Latency:  8200ms

...

SMOKE TEST SUMMARY
  Passed:           5/5
  Hindi accuracy:   100%
  English accuracy: 100%

  SMOKE TEST PASSED — language_accuracy = 1.0
================================================================

💡 Example Usage

Input State (JSON):

{
  "query": "पंजीकरण के लिए कारोबार की सीमा क्या है?",
  "user_id": "dev_user_001",
  "doc_title": "Sample GST Notice"
}

System Execution Trace:

preprocess_node: Detected hi (Hindi) via script detection, classified as Factual — top_k=5.
transform_node: Generated 3 query variants (multi-query + HyDE + step-back) in parallel via asyncio.gather.
retrieve_node: Hybrid search across all variants — dense 0.7 + BM25 0.3 + RRF → top-20 candidates.
rerank_node: FlashRank scored 0.95 → Routed to PROCEED.
assemble_context_node: Top-5 sorted by score, max 2 per source, citation injected, XML delimiters wrapped.
generate_node: Sarvam-30B streamed native Hindi response at temperature 0.1.
critique_node: Faithfulness 0.98, language match confirmed → Passed. No retry.

SSE Event Stream to UI:

event: sources   →  cited chunks + page references rendered immediately
event: answer    →  "सालाना 40 लाख रुपये से अधिक..." streams as tokens arrive
event: metadata  →  {"language": "hi", "cost_inr": 0.63, "latency_ms": 8100, "cache_hit": false}
event: done      →  spinner dismissed, session cost updated

Output:

सालाना 40 लाख रुपये से अधिक कारोबार पर पंजीकरण अनिवार्य है। [Source: Sample GST Notice, Page 1]

🧪 Test Coverage

41 tests across 3 suites. Zero live API calls — Sarvam, Supabase, and Redis are mocked in every test. External service reliability is not under test here.

Test File	Tests	What It Covers
`tests/test_5a_ingestion_pipeline.py`	11	10 production bugs validated individually + full integration test (PDF → chunks → pgvector → similarity search → idempotency)
`tests/test_generation.py`	25	Context assembler (citations, delimiters, source diversity, budget), cost tracker, guardrails (PAN/phone redaction, injection detection), system prompt structure, LangGraph compilation
`tests/test_part3_retrieval.py`	16	Language detection (all 5 edge cases including Hinglish + ASCII override), query classification, reranker scoring, CRAG context note, cache singleton

📂 Repository Structure

.
├── agents/
│   └── smartdocs_graph.py        # 11-node LangGraph state machine — 25-field SmartDocsState
├── api/
│   ├── main.py                   # FastAPI lifespan: pool → embedder warmup → graph compile
│   ├── middleware/
│   │   └── user_context.py       # X-User-ID → request.state.user_id — unconditional on every request
│   └── routes/
│       ├── health.py             # GET /health/ping (liveness) + GET /health (full dependency check)
│       ├── ingest.py             # POST /ingest + GET /ingest/status/{id} + GET /ingest/documents
│       └── query.py              # POST /query/stream (SSE, 4 events) + POST /query (sync JSON)
├── config/
│   ├── settings.py               # Pydantic BaseSettings — missing key = crash at startup, not silently in prod
│   ├── retrieval_config.yaml     # Chunk sizes (Devanagari 400 / Latin 500), RRF weights, thresholds
│   └── cost_config.yaml          # INR cost ceilings per query
├── embeddings/
│   ├── dense_embedder.py         # multilingual-e5-large — passage:/query: prefix contracts
│   ├── sparse_embedder.py        # BM25Okapi on Indic-normalized text
│   └── embedding_validator.py    # Cross-language cosine > 0.85 — must pass before any ingestion code
├── generation/
│   ├── context_assembler.py      # Sort + diversity (max 2/source) + citation injection + XML delimiters + 12K budget
│   ├── sarvam_client.py          # AsyncOpenAI → Sarvam-30B, singleton, streaming, include_usage=True
│   └── self_critique.py          # faithfulness + language match, max 2 retries
├── guardrails/
│   └── output_guardrail.py       # PII redaction + 15-pattern injection check + off-topic flag
├── ingestion/
│   ├── indic_preprocessing.py    # 5-step pipeline: normalize → ZWJ remove → spaces → quality flag → tokenize
│   ├── pdf_loader.py             # pdfplumber primary + table extraction → ExtractedDocument
│   ├── chunker.py                # Script-aware parent-child chunking — sentence boundaries first
│   ├── metadata_builder.py       # 24 metadata fields per chunk — user_id NEVER null (hard assert)
│   ├── injection_scanner.py      # 15 patterns Hindi + English — tags injection_risk per chunk
│   ├── pii_detector.py           # Aadhaar, PAN, phone, GSTIN, email regex detection
│   ├── document_classifier.py    # gst_notice / legal / insurance / circular / other
│   └── ingestion_worker.py       # 8-stage pipeline orchestrator — idempotency before page validation
├── observability/
│   ├── cost_tracker.py           # Token counting → INR per query (embedding ₹0, generation ~₹0.50–0.66)
│   └── langsmith_tracer.py       # 9 mandatory trace tags on every Sarvam call
├── reranking/
│   └── reranker.py               # FlashRank CPU singleton — ms-marco-MiniLM-L-12-v2 (~21MB)
├── retrieval/
│   ├── cache.py                  # redis.asyncio — 0.95 cosine, per-user, fail-open
│   ├── crag_fallback.py          # Tavily — ephemeral, never stored in pgvector, always labelled "(from web search)"
│   ├── hybrid_retriever.py       # Dense 0.7 + BM25 0.3 + RRF (k=60) — get_pgvector_client() is sync, no await
│   ├── language_detector.py      # 7-step deterministic tree — DetectorFactory.seed=0 for reproducibility
│   ├── query_classifier.py       # factual (top_k=5) / analytical (top_k=8) / conversational (top_k=10)
│   └── query_transformer.py      # Multi-query + HyDE + step-back via asyncio.gather — singleton AsyncOpenAI
├── ui/
│   ├── app.py                    # Streamlit entry point — session state init, Reset button, two-server arch
│   └── components/
│       ├── upload_panel.py       # PDF upload + language badge (LAW 17) + document selector dropdown
│       ├── query_panel.py        # Language selector + example queries seeded in doc language + submit
│       ├── answer_panel.py       # SSE consumer (httpx sync) — 3 st.empty() placeholders, think-tag strip
│       └── cost_panel.py         # INR/query + session total + latency traffic light + reset (LAW 18)
├── vectorstore/
│   ├── pgvector_client.py        # asyncpg pool — every write inside transaction, _set_user_context() before every query
│   ├── schema.sql                # documents + document_chunks + ivfflat index + RLS SELECT/INSERT/DELETE policies
│   └── schema_patch.sql          # UPDATE RLS policy + error_message + updated_at — run before first ingestion
├── tests/
│   ├── test_5a_ingestion_pipeline.py  # 11 tests — 10 production bugs + integration
│   ├── test_generation.py             # 25 tests — generation, guardrails, graph compilation
│   └── test_part3_retrieval.py        # 16 tests — retrieval, language detection edge cases
├── pyproject.toml                # uv dependency management — 30+ pinned dependencies
├── langgraph.json                # LangGraph Studio CLI config
└── smoke_test.py                 # Post-deploy: 5 queries, Hindi + English, language_accuracy gate

🔍 Observability & Cost

LangSmith tracing is non-negotiable (LAW 7). Language routing failures are the hardest bugs to diagnose in multilingual RAG. Without traces from Day 1, you debug by re-running experiments.

9 mandatory LangSmith tags per trace:

Tag	Type	Debug Signal
`user_id`	str	Cross-reference with RLS — debug per-user isolation failures
`language_code`	str	Detect wrong-language routing — the most common silent failure
`query_type`	str	Debug why a factual query triggered CRAG instead of proceeding
`top_reranker_score`	float	Understand why 0.68 triggers CRAG vs 0.72 proceeds
`crag_triggered`	bool	Track Tavily search rate — cost and quality signal
`sarvam_model`	str	Track model version changes across deployments
`retry_count`	int	Detect systematic faithfulness failures across queries
`cache_hit`	bool	Measure cache effectiveness and cost savings
`total_cost_inr`	float	Per-query billing analytics

Per-query cost breakdown:

Component	Cost	Note
Embedding (`multilingual-e5-large`)	₹0.00	Local GPU — zero API cost
Reranking (`FlashRank`)	₹0.00	Local CPU — zero API cost
Generation (`Sarvam-30B`)	~₹0.50–0.66	Per ~1000 input + 500 output tokens
CRAG web search (Tavily, if triggered)	~₹0.08	Free tier: 1000/month
Typical total per query	₹0.50–0.80	Displayed live in Streamlit cost panel

🔧 Extensibility

The architecture is decoupled at every layer:

Domain Agnosticism: Currently configured for Indian legal/tax documents. The 5-step Indic preprocessing pipeline is script-aware, not domain-aware — it scales directly to healthcare records, government circulars, and financial disclosures without changes.
Model Interchangeability: multilingual-e5-large can be swapped for domain-specific finetunes. Sarvam-30B can be replaced with any OpenAI-compatible endpoint by updating sarvam_client.py — the LangGraph orchestration is model-agnostic.
Database Scaling: The pgvector layer is abstracted in vectorstore/. Migration to Milvus or Pinecone requires updating only pgvector_client.py — the LangGraph graph, retrieval logic, and RLS enforcement contracts remain unchanged.

🙏 Acknowledgments

Project	Role in SmartDocs
Sarvam AI	Sarvam-30B generation backbone — OpenAI-compatible API, optimized for Indian languages
intfloat/multilingual-e5-large	Foundation cross-lingual embedding model
indic-nlp-library	Indic script sentence tokenization and Unicode normalization
LangGraph	Stateful agentic orchestration framework
FlashRank	Zero-GPU-cost local reranking
pgvector	Vector similarity search with Row Level Security on Supabase
Tavily	CRAG web search fallback

🤝 Feedback & Contribution

This architecture is built for practitioners dealing with the messy reality of production multilingual retrieval.

Contributions most useful to this project:

Script boundary edge cases — Hinglish queries where the 7-step detector misclassifies. Include the raw query, expected language, and actual detected language with the detection method logged.
RRF weight evidence — Profiled evidence that a different 0.7/0.3 split improves recall on a specific document type, with RAGAS before/after.
Retrieval latency improvements — Specifically in retrieve_multi_query() — asyncio.gather bottlenecks that have been profiled with trace evidence.
New document type classifiers — The keyword map in document_classifier.py is designed for extension.

Please submit issues with LangSmith trace links where possible. Architecture-level critiques from engineers running similar production systems are welcomed.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Smartdocs_Graph		Smartdocs_Graph
agents		agents
api		api
assets		assets
config		config
embeddings		embeddings
evaluation		evaluation
generation		generation
guardrails		guardrails
hf_configs		hf_configs
ingestion		ingestion
observability		observability
reranking		reranking
retrieval		retrieval
smartdocs		smartdocs
tests		tests
ui		ui
vectorstore		vectorstore
.gitignore		.gitignore
.python-version		.python-version
.smartdocs_session.json		.smartdocs_session.json
Dockerfile		Dockerfile
Dockerfile.streamlit		Dockerfile.streamlit
LICENSE		LICENSE
README.md		README.md
langgraph.json		langgraph.json
main.py		main.py
pyproject.toml		pyproject.toml
smoke_test.py		smoke_test.py
uv.lock		uv.lock
validate_embeddings.py		validate_embeddings.py

Folders and files

Latest commit

History

Repository files navigation

🎯 What This System Solves

📊 Evaluation Results

🌀 System Overview

🏗️ System Architecture

Core Components

🔄 LangGraph Flow Architecture

11 Nodes — State Machine

3 Conditional Edges

📐 Key Design Decisions

⚙️ The 18 Engineering Laws

🔐 Security Architecture

🪲 Production Engineering — 10 Bugs Fixed Before First API Route

⚖️ What Makes This Different

⚠️ Failure Modes & Limitations

🚀 Quick Start

1. Prerequisites

2. Clone & Environment Setup

3. Install Dependencies

4. Download Indic NLP Resources

5. Download & Validate Embedding Model

6. Configure Environment Variables

7. Set Up Supabase Database Schema

8. Start Redis (Semantic Query Cache)

9. Start the FastAPI Backend

10. Start the Streamlit UI

11. First Query — End-to-End Test

12. Smoke Test — Full Pipeline Validation

💡 Example Usage

🧪 Test Coverage

📂 Repository Structure

🔍 Observability & Cost

🔧 Extensibility

🙏 Acknowledgments

🤝 Feedback & Contribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages