Case Study
Document Theme Identifier Chatbot
A full-stack AI research assistant that ingests 75+ documents at once, runs OCR + semantic indexing, and lets teams query their corpus with citation-backed responses. Perfect for audits, academic reviews, legal diligence, and enterprise knowledge search.
- Role
- Full-stack ML Engineer
- Timeline
- Jan–Apr 2025
- Stack highlights
- FastAPI, LangChain, pgvector
- Impact
- 4x faster audit synthesis
Why it mattered
Research teams were spending days skimming PDFs and screenshots just to build a first-pass summary. The goal was to centralize ingestion, automate OCR, and give analysts a conversational interface that cites every statement. That meant reliable processing at scale, zero hallucinations, and governance-grade traceability.
Upload 75+ PDFs, scans, or mixed-format files in one batch
Automatic OCR for image-based docs with quality heuristics
Semantic clustering to surface recurring themes
LLM answers with inline citations back to sources
Audit-friendly chat history with exportable traces
Architecture snapshot
Documents flow through an ingestion service that shards workloads across Celery workers. Extracted text lands in S3 with metadata before embeddings are written to pgvector. A Next.js UI controls uploads, progress, and chat while FastAPI exposes RAG endpoints with rate limiting, audit logging, and observability hooks.
- Python
- FastAPI
- LangChain
- pgvector
- Celery
- Redis
- Tesseract OCR
- Next.js frontend
Citation-first UX
Every chat turn highlights the supporting snippets, page ranges, and filenames. Analysts can expand a citation to preview the original paragraph without leaving the thread, or export a report with footnotes—ideal for audits and research compliance.
Throughput: 75 docs / 8 min ingest window
Answer latency: ~1.7 s median
Citation coverage: 100% responses with ≥2 references
Workflow + guardrails
The RAG pipeline favors transparency: we log embeddings, prompt templates, and responses for replay. Automated evaluations run nightly with synthetic questions to ensure the LLM stays grounded in source material.
Ingestion + OCR
Handles PDF, scanned PDF, PNG, and TIFF files. A Celery worker runs Tesseract + layout repair, then stores clean text and metadata in S3.
Embedding + store
Chunks documents with adaptive windowing, generates transformer embeddings, and writes them to a Postgres + pgvector store for hybrid search.
Theme engine
DBSCAN and Top2Vec cluster embeddings to label emergent themes, feeding a summary graph that powers instant overviews.
Chat orchestration
A FastAPI layer retrieves the most relevant passages, injects citation metadata, and calls the LLM with guardrails for source-grounded responses.
Outcomes
- 4x faster audit briefing because analysts receive citation-ready answers instantly.
- Zero hallucination policy backed by guardrails that block responses without supporting chunks.
- Enterprise-ready governance with full chain-of-custody logs for every query.
Bring it to your corpus
I can adapt the ingestion adapters, embedding models, and prompt flows to your compliance requirements or private cloud. Let's explore data residency, on-prem vector stores, or custom evaluators together.