Distill is a context intelligence layer for LLM agents. It gives agents persistent, deduplicated memory that survives across sessions, deduplicates semantically similar context chunks, compresses verbose content, and re-ranks for diversity. It also detects conflicting information, classifies sensitive content, and manages token-budgeted context windows. Total overhead is ~12ms. No LLM calls.
Fetching fewer results risks missing relevant information. The better approach is to over-fetch (retrieve 20-50 results) and then intelligently deduplicate. This casts a wide net for recall, then optimizes for precision and diversity.
No. Exact dedup is trivial (hash comparison). Distill does semantic dedup - it identifies chunks that convey the same information in different words. Two paragraphs explaining "how JWT auth works" with different wording will be clustered together, and only the best one is kept.
LLMs are non-deterministic. The same input can produce different compressed outputs across runs. Distill uses deterministic algorithms (cosine distance, agglomerative clustering, MMR) so the same input always produces the same output. It's also ~40x faster (~12ms vs ~500ms) and ~100x cheaper per call.
Persistent memory that accumulates knowledge across agent sessions. Store context once, recall it later by semantic similarity + recency. Memories are deduplicated on write, compressed over time through hierarchical decay (full text → summary → keywords → evicted), and automatically classified for sensitivity (PII, credentials, internal IPs). On store, conflicting memories (cosine distance 0.15–0.35) are flagged. On recall, results can be boosted by tags and task context. Enable with --memory on the api or mcp commands.
Token-budgeted context windows for long-running agent tasks. Push context incrementally as the agent works - Distill deduplicates entries, compresses aging ones, and evicts when the budget is exceeded. The preserve_recent setting keeps the N most recent entries at full fidelity. Enable with --session on the api or mcp commands.
Memory is cross-session: knowledge persists after a session ends and can be recalled in future sessions. Sessions are within-task: a bounded context window that tracks what the agent has seen during a single task, enforcing a token budget. Use memory for long-term knowledge, sessions for working context.
When storing a memory, Distill checks existing entries by cosine distance. Entries below 0.15 are duplicates (skipped). Entries between 0.15 and 0.35 are flagged as conflicts — semantically related but different enough to be contradictory. The conflicts are returned in the store response so the agent can decide which version to keep, or supersede the old one.
Distill can automatically scan memory content for PII (emails, phone numbers, SSNs), credentials (API keys, tokens, passwords), and internal infrastructure (private IPs, internal domains). Enable with auto_classify: true on store. Recall results include max_sensitivity and a list of sensitive_chunks so agents can handle sensitive data appropriately.
Expire soft-deletes a memory — it stays in the database but is excluded from recall by default. Useful for marking outdated information without losing it. Supersede links an old memory to its replacement — the old entry is expired and tagged with the new entry's ID. This preserves the audit trail while ensuring only current information is recalled.
K-Means requires specifying K upfront and assumes spherical clusters. Agglomerative clustering adapts to the data - it stops merging when the distance between the closest clusters exceeds the threshold. If your 20 chunks have 8 natural groups, you get 8 clusters. If they have 15, you get 15. No tuning required.
Cosine distance of 0.15 means cosine similarity of 0.85. Two chunks with 85%+ similarity are considered "saying the same thing." For code, use 0.10 (stricter - code is more precise). For prose, use 0.20 (looser - natural language has more variation).
MMR greedily selects chunks that balance relevance and diversity:
MMR(chunk) = λ × relevance - (1-λ) × max_similarity(chunk, already_selected)
λ = 1.0- pure relevance (top-K by score)λ = 0.5- balanced (default)λ = 0.0- pure diversity (maximize distance from selected chunks)
Distance matrix computation is O(N² × D) where N = number of chunks and D = embedding dimension. The merge loop is O(N³) worst case. For typical RAG inputs (N=20-50, D=1536), the full pipeline completes in ~12ms. For larger inputs (N=1000+), the K-Means path with parallel workers is available.
Three rule-based strategies, chainable via a pipeline:
- Extractive - Scores sentences by position, length, and keyword signals. Keeps the top sentences within a token budget.
- Placeholder - Detects JSON, XML, and tables. Replaces them with structural summaries (e.g.,
[JSON object with 12 keys: id, name, ...]). - Pruner - Removes filler phrases ("as mentioned earlier", "basically", "it is important to note that") and intensifiers.
No API calls needed.
Three integration paths, from simplest to deepest:
1. MCP (works today): Distill ships an MCP server (distill mcp). LangChain supports MCP via langchain-mcp-adapters. Distill's tools (deduplicate_chunks, retrieve_deduplicated, analyze_redundancy) become LangChain tools automatically.
from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain.agents import create_agent
client = MultiServerMCPClient({
"distill": {
"command": "distill",
"args": ["mcp"],
"transport": "stdio",
}
})
tools = await client.get_tools()
agent = create_agent("openai:gpt-4.1", tools)2. HTTP API (works today): Call POST /v1/dedupe as a post-processing step on retrieval results.
import httpx
def deduplicate(docs, threshold=0.15):
chunks = [{"id": str(i), "text": doc.page_content} for i, doc in enumerate(docs)]
resp = httpx.post("https://distill-api-4u92.onrender.com/v1/dedupe", json={
"chunks": chunks, "threshold": threshold
})
kept = {c["id"] for c in resp.json()["chunks"]}
return [doc for i, doc in enumerate(docs) if str(i) in kept]
raw_docs = retriever.invoke("query") # Over-fetch 20 results
clean_docs = deduplicate(raw_docs) # -> ~8 unique results3. Python SDK (planned - #5): A DistillRetriever that wraps any LangChain retriever with automatic dedup.
Yes. The HTTP API is framework-agnostic. MCP works with any MCP-compatible client. The planned Python SDK (#5) will include a LlamaIndex NodePostprocessor.
LangChain's search_type="mmr" applies MMR at the vector DB level - a single re-ranking step. Distill runs a multi-stage pipeline: cache lookup, agglomerative clustering (groups similar chunks), representative selection (picks the best from each group), compression (reduces token count), then MMR (diversity re-ranking). The clustering step is the key difference - it understands group structure, not just pairwise similarity.
The base MCP server exposes deduplicate_context and analyze_redundancy. With --memory, it adds store_memory, recall_memory, forget_memory, memory_expire, memory_supersede, memory_stats. With --session, it adds create_session, push_session, session_context, delete_session. Enable both with distill mcp --memory --session.
Yes. The dedup pipeline itself doesn't call any LLM - it's pure math (cosine distance, clustering). For embeddings, Distill supports OpenAI, Ollama, and Cohere via --embedding-provider:
# Use Ollama locally (no API key needed)
distill api --embedding-provider ollama --embedding-base-url http://localhost:11434
# Use Cohere
distill api --embedding-provider cohereYou can also send chunks with pre-computed embeddings to skip embedding generation entirely.
~12ms total for the pipeline: distance matrix ~2ms, clustering ~6ms, selection <1ms, MMR ~3ms. Embedding generation adds more if needed (depends on OpenAI API latency, typically 100-300ms for a batch). If embeddings are pre-computed, it's just the 12ms.
If chunks already have embeddings (from your vector DB): $0. If text-only chunks are sent, Distill uses text-embedding-3-small at $0.02 per 1M tokens. A typical 20-chunk request with ~100 tokens each = 2,000 tokens = $0.00004.
The agglomerative clustering is O(N²) for the distance matrix. For N=50, this is trivial (~2ms). For N=1,000, it's still fast (~100ms). For N=10,000+, the K-Means path (pkg/dedup/) with parallel workers is available. A batch API is planned in #11.
If you send text-only chunks to the API, Distill generates embeddings on the fly using the configured provider (OpenAI by default, or Ollama/Cohere via --embedding-provider). If you send chunks with pre-computed embeddings (e.g., from your vector DB retrieval), no embedding call is needed.
Three options:
# Binary
distill api --port 8080
# Docker
docker run -p 8080:8080 -e OPENAI_API_KEY=xxx ghcr.io/siddhant-k-code/distill
# Build from source
go build -o distill . && ./distill apiSet DISTILL_API_KEYS with comma-separated API keys. Clients must include Authorization: Bearer <key> in requests.
export DISTILL_API_KEYS="key1,key2,key3"
distill api --port 8080- Prometheus metrics at
/metrics- request counts, latency histograms, chunk reduction ratios, cluster counts - OpenTelemetry tracing - per-stage spans (embedding, clustering, selection, MMR) with W3C Trace Context propagation
- Grafana dashboard - pre-built template in
grafana/
Larger context windows don't solve redundancy. If you stuff 50 chunks into a 128K window and 20 say the same thing, the model still processes all of them. This wastes tokens, increases latency, and can confuse the model. Distill ensures the model sees unique, diverse chunks instead of overlapping ones.
Yes, MIT. The full pipeline, CLI, API server, MCP server, and all algorithms are open source. Free to use, modify, and distribute — including in commercial and closed-source products.
Shipped:
- Context Memory — persistent deduplicated memory with hierarchical decay (#29)
- Session Management — token-budgeted context windows with compression and eviction (#31)
- Memory Intelligence (v0.9.0) — conflict detection (#77), task-relevance ranking (#78), expiry/supersession (#79), sensitivity classification (#82)
- Multi-provider embeddings — OpenAI, Ollama, Cohere via
--embedding-provider(#25, #33) - OpenAPI spec & Swagger UI (v0.9.1) — interactive docs at
/docs(#23) - v2.0 Documentation — guides, API reference, examples (#8)
- Code Intelligence — dependency graphs, blast radius, semantic commit analysis (#30, #32)
- Batch API — async job queue with progress polling (#11)
Upcoming: