Case Study
Diamond Price Prediction
An open-source workflow that helps gem traders, jewelers, and hobbyists price diamonds with confidence. The system ingests raw datasets, engineers features, benchmarks multiple regressors, and wraps the winning model in a polished Flask app so anyone can estimate prices from a browser.
- Role
- Data Scientist
- Timeline
- May–Aug 2023
- Stack highlights
- Flask, scikit-learn, Docker
- Impact
- R2 of 0.98 on holdout
Why it mattered
Appraisers often rely on spreadsheets or manual heuristics to price stones. This project shows how a transparent ML workflow can turn raw data into an interactive estimator that educates end users about the levers that influence price while keeping the experience approachable.
Guided ingestion that validates CSV schemas and catches missing attributes before training
Transformation service that scales numeric fields and one-hot encodes cut, color, and clarity
Model registry comparing Linear, Lasso, Ridge, and Decision Tree regressors
Flask UI with a responsive form and instant predictions
Background job that logs each prediction for drift analysis
Architecture snapshot
Training pipelines run inside Docker so feature engineering and evaluation are identical locally and in CI. The Flask app consumes the packaged model via a lightweight service that performs validation, prediction, and response formatting in under 200 ms.
- Python
- Pandas
- scikit-learn
- Flask
- HTML/CSS
- Docker
Dataset essentials
Users can inspect the exact attributes the model expects. Each field is documented in both the repo and the UI so collectors know why predictions move when they tweak inputs.
- carat
- Gemstone weight captured to two decimal places
- cut
- Categorical grade capturing light performance
- color
- Graded scale from D to J representing hue
- clarity
- Visibility of inclusions under 10x magnification
- depth
- Height from culet to table expressed as a percentage
- table
- Top facet width relative to the girdle
- x
- Length in millimeters
- y
- Width in millimeters
- z
- Depth in millimeters
Workflow and guardrails
The end-to-end flow mirrors production data science teams: ingest, feature, model, evaluate, deploy, monitor. Each layer surfaces artifacts so future contributors can extend the stack with gradient boosting or cloud hosting.
Data ingestion
Batch jobs pull raw Kaggle exports, validate schema, and persist clean parquet files so training stays reproducible.
Feature engineering
Numeric fields are scaled with RobustScaler, categorical grades become one-hot vectors, and interaction terms capture non-linear relationships.
Model comparison
Grid search tunes Linear, Lasso, Ridge, and Decision Tree regressors. Metrics, coefficients, and feature importances are logged for review.
Serving layer
The champion model is serialized to pickle, loaded inside a Flask Blueprint, and fronted by a user-friendly HTML form with validation.
Outcomes
- R2 0.98 / RMSE 492 on a held-out validation split.
- Two-minute onboarding since the web form explains every field inline.
- Deployment ready with Dockerfile, requirements lock, and CI hooks.
Extend it
Want to swap in gradient boosting, add SHAP explanations, or host on Render? The repo includes templates and docs to accelerate your fork. Happy to pair on custom pricing tools for retail or wholesale operations.