Anish Dahiya

Case Study

Document Theme Identifier Chatbot

A full-stack AI research assistant that ingests 75+ documents at once, runs OCR + semantic indexing, and lets teams query their corpus with citation-backed responses. Perfect for audits, academic reviews, legal diligence, and enterprise knowledge search.

Role
Full-stack ML Engineer
Timeline
Jan–Apr 2025
Stack highlights
FastAPI, LangChain, pgvector
Impact
4x faster audit synthesis

Why it mattered

Research teams were spending days skimming PDFs and screenshots just to build a first-pass summary. The goal was to centralize ingestion, automate OCR, and give analysts a conversational interface that cites every statement. That meant reliable processing at scale, zero hallucinations, and governance-grade traceability.

  • Upload 75+ PDFs, scans, or mixed-format files in one batch

  • Automatic OCR for image-based docs with quality heuristics

  • Semantic clustering to surface recurring themes

  • LLM answers with inline citations back to sources

  • Audit-friendly chat history with exportable traces

Architecture snapshot

Documents flow through an ingestion service that shards workloads across Celery workers. Extracted text lands in S3 with metadata before embeddings are written to pgvector. A Next.js UI controls uploads, progress, and chat while FastAPI exposes RAG endpoints with rate limiting, audit logging, and observability hooks.

  • Python
  • FastAPI
  • LangChain
  • pgvector
  • Celery
  • Redis
  • Tesseract OCR
  • Next.js frontend

Citation-first UX

Every chat turn highlights the supporting snippets, page ranges, and filenames. Analysts can expand a citation to preview the original paragraph without leaving the thread, or export a report with footnotes—ideal for audits and research compliance.

Throughput: 75 docs / 8 min ingest window

Answer latency: ~1.7 s median

Citation coverage: 100% responses with ≥2 references

Workflow + guardrails

The RAG pipeline favors transparency: we log embeddings, prompt templates, and responses for replay. Automated evaluations run nightly with synthetic questions to ensure the LLM stays grounded in source material.

Ingestion + OCR

Handles PDF, scanned PDF, PNG, and TIFF files. A Celery worker runs Tesseract + layout repair, then stores clean text and metadata in S3.

Embedding + store

Chunks documents with adaptive windowing, generates transformer embeddings, and writes them to a Postgres + pgvector store for hybrid search.

Theme engine

DBSCAN and Top2Vec cluster embeddings to label emergent themes, feeding a summary graph that powers instant overviews.

Chat orchestration

A FastAPI layer retrieves the most relevant passages, injects citation metadata, and calls the LLM with guardrails for source-grounded responses.

Outcomes

  • 4x faster audit briefing because analysts receive citation-ready answers instantly.
  • Zero hallucination policy backed by guardrails that block responses without supporting chunks.
  • Enterprise-ready governance with full chain-of-custody logs for every query.

Bring it to your corpus

I can adapt the ingestion adapters, embedding models, and prompt flows to your compliance requirements or private cloud. Let's explore data residency, on-prem vector stores, or custom evaluators together.