~5 Minute Read
Overview
In this project, my team and I built machine learning models to predict a batter’s MLB performance using their Triple-A performance. The goal was to determine which offensive skills best transfer to Major League Baseball and identify prospects who may be undervalued by traditional scouting methods.
Question: Can we use Triple-A statistics and Statcast data to predict future MLB batting performance?
The target variable was wRC+ (Weighted Runs Created Plus). It adjusts for park effects and league environment, allowing hitters to be compared on a standardized scale. A wRC+ of 100 is league average, while a wRC+ of 120 indicates a hitter who creates 20% more runs than the league average.
Data Collection
We combined data from:
- FanGraphs Triple-A batting statistics
- Baseball Savant Statcast metrics
- MLB offensive outcomes
The final datasets consisted of:
- 502 Triple-A players who later had >= 100 MLB plate appearances
- 902 Triple-A players with < 100 MLB plate appearances for predictions
- Data spanning the 2021-2025 seasons
Features included
Traditional Statistics:
- Batting Average
- OBP
- ISO
- Home Run Rate
- Walk Rate
- Strikeout Rate
- Exit Velocity
- Hard Hit %
- Barrel Rate
- Launch Angle
- Expected Statistics (xBA, xwOBA)
- Age
- Playing Time (Games, Plate Appearances)
- Consecutive Triple-A Seasons
Some features were added to standardize for playing time. For example:
- Strikeout Rate = Strikeouts / Plate Appearances
Also, we
- Removed unstable samples (<100 MLB PA)
- Imputed missing values using median imputation
- Standardized features where appropriate
- Linear Regression
-
Lasso Regression
-
Random Forest
- Tuned Random Forest
- XGBoost
We also built an ensemble model that averaged predictions from LASSO, Random Forest, and XGBoost.
Models were evaluated with a 80/20 train-test split using:
- RMSE
- MAE
- R²
Results
The ensemble model achieved the overall strongest performance with lowest RMSE, lowest MAE, and highest R². While predictive power was modest, this resulted made us realize that MLB success depends on far more than minor league statistics alone. There are other important factors such as injuries and coaching prior to playing in the Major League.
What Traits Predict MLB Success?
One of the most interesting findings from this project was that different models emphasize different skills.
- Lasso Regression
LASSO highlighted:
- Walk rate (bb_rate)
- Singles rate
- Triples rate
- Differences between expected and actual offensive production
This shows that plate discipline is a valuable skill in the Major League.
- Random Forest & XGBoost
On the other hand, tree-based models had greater emphasis on:
-
Strikeout rate (k_rate)
- Exit velocity
- Barrel rate
- Hard-hit %
- Launch speed
These models favored players with strong power and good contact quality.
Prospect Archetype Experiment
The models consistently predicted power-hitters to achieve higher wRC+ values than speed oriented or pure contact hitters. The key takeaway is: Power metrics such as barrel rate, exit velocity, and home run rate had more predictive power than speed or batting-average-driven archetypes.
Note: Of course, elite all around players had highest wRC+.
Top Projected Prospects
Using the ensemble model, we ranked Triple-A hitters with less than 100 MLB plate appearances as of May 22, 2026.
Credit: STATS 141XP Teammates - Andrew Bush, Fabiola Campuzano, Ashley Chan, and Stewart Fang
Data Sources: Triple-A and MLB statistics were obtained from FanGraphs, while Statcast batted-ball and quality-of-contact metrics were collected from Baseball Savant.