Case Study
Parkinson's Disease Prediction
A biomedical voice analytics pipeline that detects early Parkinsonian signatures. The project standardizes acoustic biomarkers, compares supervised learning models, and surfaces interpretable outputs clinicians can trust for screening support.
- Role
- ML Researcher
- Timeline
- Jan–Apr 2022
- Stack highlights
- scikit-learn, pandas, Seaborn
- Impact
- 86% accuracy baseline
Why voice?
Voice deviations appear before overt motor decline, making acoustic biomarkers a promising early-warning channel. By quantifying jitter, shimmer, and spectral noise, the model offers clinicians an inexpensive triage aid that complements in-person exams.
MDVP:Fo(Hz) – fundamental frequency
Jitter / Jitter(%) – micro-variations in pitch
Shimmer / Shimmer(dB) – amplitude deviations
NHR / HNR – noise-to-harmonics ratios
RPDE + DFA – nonlinear complexity measures
PPE – fundamental frequency variation
Architecture snapshot
A modular notebook-to-library workflow: preprocessing utilities clean and scale inputs, training scripts benchmark models with cross-validation, and a reporting module exports metrics, plots, and coefficients for clinicians.
- Python
- NumPy
- pandas
- scikit-learn
- Matplotlib
- Seaborn
Model lineup
Logistic Regression
≈86% accuracy — Champion model—interpretable weights and reliable calibration for screening workflows.
SVM (RBF)
High accuracy — Great precision but harder to explain to clinicians.
Random Forest
Strong F1 — Captures feature interactions; used as a robustness check.
KNN
Lagging — Too sensitive to scaling and noisy neighbors, documented as a baseline.
Workflow and validation
Every experiment logs confusion matrices, ROC curves, and class-wise recall. Clinical usability is preserved by favoring interpretable coefficients and by flagging predictions that fall within an uncertainty margin.
Data curation
Aggregated biomedical voice samples from Parkinson's and healthy cohorts, tracked provenance, and tagged recordings with session metadata.
Preprocessing + EDA
Imputed missing values, standardized every feature, and ran correlation heatmaps plus SHAP-style importance to understand drivers.
Model benchmarking
Cross-validated Logistic Regression, SVM (RBF), Random Forest, and KNN classifiers with precision/recall + ROC monitoring.
Explainability + reporting
Generated coefficient tables, confusion matrices, and clinician-friendly insights that highlight dominant biomarkers.
Outcomes
- 86% accuracy baseline with clean coefficient reporting for clinicians.
- ROC-AUC 0.91 demonstrating strong separation of positive cases.
- Feature transparency thanks to per-biomarker importance exports.
Extend it
The repo includes preprocessing scripts, model configs, and evaluation templates so labs can swap in new biomarkers or multimodal signals. I'm happy to adapt the workflow to your research study or clinical trial.