Case Study

Document Theme Identifier Chatbot

A full-stack AI research assistant that ingests 75+ documents at once, runs OCR + semantic indexing, and lets teams query their corpus with citation-backed responses. Perfect for audits, academic reviews, legal diligence, and enterprise knowledge search.

View GitHub repo Request a live demo

Role: Full-stack ML Engineer
Timeline: Jan–Apr 2025
Stack highlights: FastAPI, LangChain, pgvector
Impact: 4x faster audit synthesis

Why it mattered

Research teams were spending days skimming PDFs and screenshots just to build a first-pass summary. The goal was to centralize ingestion, automate OCR, and give analysts a conversational interface that cites every statement. That meant reliable processing at scale, zero hallucinations, and governance-grade traceability.

Upload 75+ PDFs, scans, or mixed-format files in one batch
Automatic OCR for image-based docs with quality heuristics
Semantic clustering to surface recurring themes
LLM answers with inline citations back to sources
Audit-friendly chat history with exportable traces

Architecture snapshot

Documents flow through an ingestion service that shards workloads across Celery workers. Extracted text lands in S3 with metadata before embeddings are written to pgvector. A Next.js UI controls uploads, progress, and chat while FastAPI exposes RAG endpoints with rate limiting, audit logging, and observability hooks.

Python
FastAPI
LangChain
pgvector
Celery
Redis
Tesseract OCR
Next.js frontend

Citation-first UX

Every chat turn highlights the supporting snippets, page ranges, and filenames. Analysts can expand a citation to preview the original paragraph without leaving the thread, or export a report with footnotes—ideal for audits and research compliance.

Throughput: 75 docs / 8 min ingest window

Answer latency: ~1.7 s median

Citation coverage: 100% responses with ≥2 references

Workflow + guardrails

The RAG pipeline favors transparency: we log embeddings, prompt templates, and responses for replay. Automated evaluations run nightly with synthetic questions to ensure the LLM stays grounded in source material.

Ingestion + OCR

Handles PDF, scanned PDF, PNG, and TIFF files. A Celery worker runs Tesseract + layout repair, then stores clean text and metadata in S3.

Embedding + store

Chunks documents with adaptive windowing, generates transformer embeddings, and writes them to a Postgres + pgvector store for hybrid search.

Theme engine

DBSCAN and Top2Vec cluster embeddings to label emergent themes, feeding a summary graph that powers instant overviews.

Chat orchestration

A FastAPI layer retrieves the most relevant passages, injects citation metadata, and calls the LLM with guardrails for source-grounded responses.

Outcomes

4x faster audit briefing because analysts receive citation-ready answers instantly.
Zero hallucination policy backed by guardrails that block responses without supporting chunks.
Enterprise-ready governance with full chain-of-custody logs for every query.

Bring it to your corpus

I can adapt the ingestion adapters, embedding models, and prompt flows to your compliance requirements or private cloud. Let's explore data residency, on-prem vector stores, or custom evaluators together.

Book a working session Talk to me