Anish Dahiya

Case Study

Parkinson's Disease Prediction

A biomedical voice analytics pipeline that detects early Parkinsonian signatures. The project standardizes acoustic biomarkers, compares supervised learning models, and surfaces interpretable outputs clinicians can trust for screening support.

Role
ML Researcher
Timeline
Jan–Apr 2022
Stack highlights
scikit-learn, pandas, Seaborn
Impact
86% accuracy baseline

Why voice?

Voice deviations appear before overt motor decline, making acoustic biomarkers a promising early-warning channel. By quantifying jitter, shimmer, and spectral noise, the model offers clinicians an inexpensive triage aid that complements in-person exams.

  • MDVP:Fo(Hz) – fundamental frequency

  • Jitter / Jitter(%) – micro-variations in pitch

  • Shimmer / Shimmer(dB) – amplitude deviations

  • NHR / HNR – noise-to-harmonics ratios

  • RPDE + DFA – nonlinear complexity measures

  • PPE – fundamental frequency variation

Architecture snapshot

A modular notebook-to-library workflow: preprocessing utilities clean and scale inputs, training scripts benchmark models with cross-validation, and a reporting module exports metrics, plots, and coefficients for clinicians.

  • Python
  • NumPy
  • pandas
  • scikit-learn
  • Matplotlib
  • Seaborn

Model lineup

  • Logistic Regression

    ≈86% accuracyChampion model—interpretable weights and reliable calibration for screening workflows.

  • SVM (RBF)

    High accuracyGreat precision but harder to explain to clinicians.

  • Random Forest

    Strong F1Captures feature interactions; used as a robustness check.

  • KNN

    LaggingToo sensitive to scaling and noisy neighbors, documented as a baseline.

Workflow and validation

Every experiment logs confusion matrices, ROC curves, and class-wise recall. Clinical usability is preserved by favoring interpretable coefficients and by flagging predictions that fall within an uncertainty margin.

Data curation

Aggregated biomedical voice samples from Parkinson's and healthy cohorts, tracked provenance, and tagged recordings with session metadata.

Preprocessing + EDA

Imputed missing values, standardized every feature, and ran correlation heatmaps plus SHAP-style importance to understand drivers.

Model benchmarking

Cross-validated Logistic Regression, SVM (RBF), Random Forest, and KNN classifiers with precision/recall + ROC monitoring.

Explainability + reporting

Generated coefficient tables, confusion matrices, and clinician-friendly insights that highlight dominant biomarkers.

Outcomes

  • 86% accuracy baseline with clean coefficient reporting for clinicians.
  • ROC-AUC 0.91 demonstrating strong separation of positive cases.
  • Feature transparency thanks to per-biomarker importance exports.

Extend it

The repo includes preprocessing scripts, model configs, and evaluation templates so labs can swap in new biomarkers or multimodal signals. I'm happy to adapt the workflow to your research study or clinical trial.