Case Study
Wafer Fault Detection
Built an end-to-end workflow that ingests 590-sensor wafer batches, clusters abnormal signatures, and automatically promotes the best Random Forest or XGBoost model per cluster—cutting false scrap calls by 37% for a Tier-1 OSAT.
- Role
- Lead ML Engineer
- Timeline
- Feb–Jun 2024
- Stack highlights
- AWS, FastAPI, MLflow
- Impact
- 37% fewer false rejects
Why it mattered
Incoming wafer lots were regularly quarantined because the legacy SPC rules could not account for cross-sensor correlations. Operators spent hours on manual triage, slowing down the burn-in line and inflating scrap cost. We needed a classifier that treated wafers as multivariate time series, gave confidence bands per sensor cluster, and plugged into the existing MES APIs without slowing production.
590 sensors x 178 wafers/lot: raw telemetry streamed from Advantest V93000 handlers into S3 via Kinesis Firehose every 7 minutes.
92% precision floor: business threshold to avoid swamping operators with false alarms.
p99 latency < 2.5s: required for real-time gating before wafers hit burn-in.
Architecture snapshot
Streaming telemetry lands in S3, kicks off Step Functions that run validation, clustering, model selection, and metric logging. Inference lives inside a FastAPI service containerized via Docker, deployed on Elastic Beanstalk with autoscaling rules tied to lot volume.
- FastAPI scoring service
- Docker + GitHub Actions
- AWS S3, SQS, Step Functions
- Elastic Beanstalk (Blue/Green)
- Grafana + Prometheus
Ops automation
GitHub Actions handles build-test-lint, launches smoke tests against the Beanstalk staging env, and rolls traffic with a blue/green swap. Model metadata and experiment lineage live in MLflow so the quality team can audit every promotion.
CI runtime: 6m 20s
Promotion cadence: weekly or on drift alert
Rollback budget: < 30s via blue/green swap
Pipeline, modeling, and guardrails
Each wafer is processed as a multivariate sequence. LightGBM was considered but ultimately rejected because tree-based methods with class weighting hit the latency budget without GPU burn. Guardrails keep the system safe even when a fab recalibrates sensors mid-shift.
Sensor integrity gate
Validates 590-sensor payloads in less than 800 ms with schema checks, missing value heuristics, and 3-sigma drift rules before persisting to S3.
Feature alignment
Interpolates out-of-phase signals and standardizes per-sensor distributions to keep the downstream clustering step stable across fabs.
Adaptive clustering
Uses a silhouette-tuned DBSCAN job (or K-Means fallback) to group wafers with similar fault signatures, enabling per-cluster model selection.
Model selector
Tests Random Forest vs. XGBoost per cluster each week; promotes the champion to the inference stack with shadow validation in SageMaker notebooks.
Lot-level backtesting
Replays the last 90 days of production lots nightly with time-based splits to ensure no leakage in metric reporting.
Fail-fast canaries
Routes 2% of wafers through a shadow stack whenever a new model is promoted; aborts automatically if precision drops below 92%.
Sensor drift monitors
Publishes a drift score per sensor to Grafana; alerts trigger when the KL divergence vs. baseline exceeds 0.08 for 15 mins.
Outcomes
- 37% fewer false rejects after replacing the SPC playbook with model-driven gating.
- $410K/month recovered capacity thanks to faster triage and automated lot release.
- Sub-2s blocking decision kept the burn-in line utilization above 94%.
Want the playbook?
I packaged environment templates, IaC modules, and validation notebooks so other fabs can bootstrap the same stack. Happy to walk through the trade-offs, cost controls, and SOPs we set up for the manufacturing team.