Case Study

Diamond Price Prediction

An open-source workflow that helps gem traders, jewelers, and hobbyists price diamonds with confidence. The system ingests raw datasets, engineers features, benchmarks multiple regressors, and wraps the winning model in a polished Flask app so anyone can estimate prices from a browser.

View GitHub repo Request a walkthrough

Role: Data Scientist
Timeline: May–Aug 2023
Stack highlights: Flask, scikit-learn, Docker
Impact: R2 of 0.98 on holdout

Why it mattered

Appraisers often rely on spreadsheets or manual heuristics to price stones. This project shows how a transparent ML workflow can turn raw data into an interactive estimator that educates end users about the levers that influence price while keeping the experience approachable.

Guided ingestion that validates CSV schemas and catches missing attributes before training
Transformation service that scales numeric fields and one-hot encodes cut, color, and clarity
Model registry comparing Linear, Lasso, Ridge, and Decision Tree regressors
Flask UI with a responsive form and instant predictions
Background job that logs each prediction for drift analysis

Architecture snapshot

Training pipelines run inside Docker so feature engineering and evaluation are identical locally and in CI. The Flask app consumes the packaged model via a lightweight service that performs validation, prediction, and response formatting in under 200 ms.

Python
Pandas
scikit-learn
Flask
HTML/CSS
Docker

Dataset essentials

Users can inspect the exact attributes the model expects. Each field is documented in both the repo and the UI so collectors know why predictions move when they tweak inputs.

carat: Gemstone weight captured to two decimal places
cut: Categorical grade capturing light performance
color: Graded scale from D to J representing hue
clarity: Visibility of inclusions under 10x magnification
depth: Height from culet to table expressed as a percentage
table: Top facet width relative to the girdle
x: Length in millimeters
y: Width in millimeters
z: Depth in millimeters

Workflow and guardrails

The end-to-end flow mirrors production data science teams: ingest, feature, model, evaluate, deploy, monitor. Each layer surfaces artifacts so future contributors can extend the stack with gradient boosting or cloud hosting.

Data ingestion

Batch jobs pull raw Kaggle exports, validate schema, and persist clean parquet files so training stays reproducible.

Feature engineering

Numeric fields are scaled with RobustScaler, categorical grades become one-hot vectors, and interaction terms capture non-linear relationships.

Model comparison

Grid search tunes Linear, Lasso, Ridge, and Decision Tree regressors. Metrics, coefficients, and feature importances are logged for review.

Serving layer

The champion model is serialized to pickle, loaded inside a Flask Blueprint, and fronted by a user-friendly HTML form with validation.

Outcomes

R2 0.98 / RMSE 492 on a held-out validation split.
Two-minute onboarding since the web form explains every field inline.
Deployment ready with Dockerfile, requirements lock, and CI hooks.

Extend it

Want to swap in gradient boosting, add SHAP explanations, or host on Render? The repo includes templates and docs to accelerate your fork. Happy to pair on custom pricing tools for retail or wholesale operations.

Book a working session Talk to me