Case Study

Wafer Fault Detection

Built an end-to-end workflow that ingests 590-sensor wafer batches, clusters abnormal signatures, and automatically promotes the best Random Forest or XGBoost model per cluster—cutting false scrap calls by 37% for a Tier-1 OSAT.

View GitHub repo Launch Elastic Beanstalk demo

Role: Lead ML Engineer
Timeline: Feb–Jun 2024
Stack highlights: AWS, FastAPI, MLflow
Impact: 37% fewer false rejects

Why it mattered

Incoming wafer lots were regularly quarantined because the legacy SPC rules could not account for cross-sensor correlations. Operators spent hours on manual triage, slowing down the burn-in line and inflating scrap cost. We needed a classifier that treated wafers as multivariate time series, gave confidence bands per sensor cluster, and plugged into the existing MES APIs without slowing production.

590 sensors x 178 wafers/lot: raw telemetry streamed from Advantest V93000 handlers into S3 via Kinesis Firehose every 7 minutes.
92% precision floor: business threshold to avoid swamping operators with false alarms.
p99 latency < 2.5s: required for real-time gating before wafers hit burn-in.

Architecture snapshot

Streaming telemetry lands in S3, kicks off Step Functions that run validation, clustering, model selection, and metric logging. Inference lives inside a FastAPI service containerized via Docker, deployed on Elastic Beanstalk with autoscaling rules tied to lot volume.

FastAPI scoring service
Docker + GitHub Actions
AWS S3, SQS, Step Functions
Elastic Beanstalk (Blue/Green)
Grafana + Prometheus

Ops automation

GitHub Actions handles build-test-lint, launches smoke tests against the Beanstalk staging env, and rolls traffic with a blue/green swap. Model metadata and experiment lineage live in MLflow so the quality team can audit every promotion.

CI runtime: 6m 20s

Promotion cadence: weekly or on drift alert

Rollback budget: < 30s via blue/green swap

Pipeline, modeling, and guardrails

Each wafer is processed as a multivariate sequence. LightGBM was considered but ultimately rejected because tree-based methods with class weighting hit the latency budget without GPU burn. Guardrails keep the system safe even when a fab recalibrates sensors mid-shift.

Sensor integrity gate

Validates 590-sensor payloads in less than 800 ms with schema checks, missing value heuristics, and 3-sigma drift rules before persisting to S3.

Feature alignment

Interpolates out-of-phase signals and standardizes per-sensor distributions to keep the downstream clustering step stable across fabs.

Adaptive clustering

Uses a silhouette-tuned DBSCAN job (or K-Means fallback) to group wafers with similar fault signatures, enabling per-cluster model selection.

Model selector

Tests Random Forest vs. XGBoost per cluster each week; promotes the champion to the inference stack with shadow validation in SageMaker notebooks.

Lot-level backtesting

Replays the last 90 days of production lots nightly with time-based splits to ensure no leakage in metric reporting.

Fail-fast canaries

Routes 2% of wafers through a shadow stack whenever a new model is promoted; aborts automatically if precision drops below 92%.

Sensor drift monitors

Publishes a drift score per sensor to Grafana; alerts trigger when the KL divergence vs. baseline exceeds 0.08 for 15 mins.

Outcomes

37% fewer false rejects after replacing the SPC playbook with model-driven gating.
$410K/month recovered capacity thanks to faster triage and automated lot release.
Sub-2s blocking decision kept the burn-in line utilization above 94%.

Want the playbook?

I packaged environment templates, IaC modules, and validation notebooks so other fabs can bootstrap the same stack. Happy to walk through the trade-offs, cost controls, and SOPs we set up for the manufacturing team.

Request a walkthrough Talk to me