Anish Dahiya

Case Study

Diamond Price Prediction

An open-source workflow that helps gem traders, jewelers, and hobbyists price diamonds with confidence. The system ingests raw datasets, engineers features, benchmarks multiple regressors, and wraps the winning model in a polished Flask app so anyone can estimate prices from a browser.

Role
Data Scientist
Timeline
May–Aug 2023
Stack highlights
Flask, scikit-learn, Docker
Impact
R2 of 0.98 on holdout

Why it mattered

Appraisers often rely on spreadsheets or manual heuristics to price stones. This project shows how a transparent ML workflow can turn raw data into an interactive estimator that educates end users about the levers that influence price while keeping the experience approachable.

  • Guided ingestion that validates CSV schemas and catches missing attributes before training

  • Transformation service that scales numeric fields and one-hot encodes cut, color, and clarity

  • Model registry comparing Linear, Lasso, Ridge, and Decision Tree regressors

  • Flask UI with a responsive form and instant predictions

  • Background job that logs each prediction for drift analysis

Architecture snapshot

Training pipelines run inside Docker so feature engineering and evaluation are identical locally and in CI. The Flask app consumes the packaged model via a lightweight service that performs validation, prediction, and response formatting in under 200 ms.

  • Python
  • Pandas
  • scikit-learn
  • Flask
  • HTML/CSS
  • Docker

Dataset essentials

Users can inspect the exact attributes the model expects. Each field is documented in both the repo and the UI so collectors know why predictions move when they tweak inputs.

carat
Gemstone weight captured to two decimal places
cut
Categorical grade capturing light performance
color
Graded scale from D to J representing hue
clarity
Visibility of inclusions under 10x magnification
depth
Height from culet to table expressed as a percentage
table
Top facet width relative to the girdle
x
Length in millimeters
y
Width in millimeters
z
Depth in millimeters

Workflow and guardrails

The end-to-end flow mirrors production data science teams: ingest, feature, model, evaluate, deploy, monitor. Each layer surfaces artifacts so future contributors can extend the stack with gradient boosting or cloud hosting.

Data ingestion

Batch jobs pull raw Kaggle exports, validate schema, and persist clean parquet files so training stays reproducible.

Feature engineering

Numeric fields are scaled with RobustScaler, categorical grades become one-hot vectors, and interaction terms capture non-linear relationships.

Model comparison

Grid search tunes Linear, Lasso, Ridge, and Decision Tree regressors. Metrics, coefficients, and feature importances are logged for review.

Serving layer

The champion model is serialized to pickle, loaded inside a Flask Blueprint, and fronted by a user-friendly HTML form with validation.

Outcomes

  • R2 0.98 / RMSE 492 on a held-out validation split.
  • Two-minute onboarding since the web form explains every field inline.
  • Deployment ready with Dockerfile, requirements lock, and CI hooks.

Extend it

Want to swap in gradient boosting, add SHAP explanations, or host on Render? The repo includes templates and docs to accelerate your fork. Happy to pair on custom pricing tools for retail or wholesale operations.