Doordash — SangHyun Kim Personal Site

I analyzed a historical subset of DoorDash deliveries to understand what drives delivery delays and built machine learning models to predict delivery duration. The goal was to move beyond just modeling and identify when ETAs become unreliable.

What I did

Created the target variable (delivery_duration) and handled noise/outliers
Captured lunch/dinner demand patterns
Built marketplace features measuring supply vs. demand (orders per dasher, busy ratio) and interaction features (congestion during dinner rush)
Trained and compared multiple models: baseline, linear regression, random forest, XGBoost, and LightGBM
Performed model error analysis to understand failure modes

Early EDA

Delivery times: Most deliveries occur between 30–60 minutes, with a long tail of extreme delays due to injected noise
Time patterns: Converting timestamps from UTC to US/Pacific revealed clear lunch and dinner demand periods.
Marketplace conditions: Supply–demand variables (dashers vs. outstanding orders) were used to engineer orders per dasher which ended up being a key feature used in modeling.

Model Performance (RMSE, seconds)

Baseline: 1014.86
Linear Regression: 854.99
Random Forest: 834.50
XGBoost: 813.97
LightGBM: 813.79 ~ 13.6 minutes (best)

Error Analysis Results

Prediction error increases with congestion, suggesting ETAs become harder under heavy load

Error varies by hour, with early morning showing high error (likely low sample size) and dinner hours showing moderately higher error than midday

Tech Stack

Python, pandas, numpy, scikit-learn, LightGBM/XGBoost, matplotlib/seaborn

View full analysis -> GitHub

Index

sanghyundkim@outlook.com