Send/Hold Grader — Methodology

Detailed data sources, modeling choices, approximations, and known limitations for the third-base coach send/hold decision module.

What we measure

For every ball hit to the outfield with a runner on second base, the third-base coach makes a binary call: send the runner to attempt to score, or hold him at third. We grade that call against the RE24 break-even probability for the situation.

This is distinct from player ability. A talented runner sent on a 50/50 opportunity produces the same coach grade regardless of whether he happens to be safe or out.

Scope

Situation: Runner on second base, ball hit to the outfield (single or double), opportunity to attempt to score.
Years: 2020–2026 MLB regular season. 2026 is in progress; entries are flagged Live in the leaderboard.
Data source: MLB Statcast event-level data via pybaseball.
Out of scope for this module: stolen bases, IBBs, other in-game decisions.

Opportunity identification

We filter Statcast play-by-play data for events where a runner was on second base (on_2b field) and the batter put the ball in play to the outfield (hit_location 7/8/9, events = single or double). We then parse the play description text (desfield) to classify the runner's outcome: SCORED, OUT_AT_HOME, HELD_AT_BASE, or HELD_OR_UNKNOWN.

To avoid false positives in multi-runner situations, outcomes are attributed to the specific runner on second using Unicode-normalized name matching against the description text. Plays classified as HELD_OR_UNKNOWN (~2.5% of opportunities) are excluded from grading.

P(safe) — probability of scoring if sent

We use an empirical bin approach: divide sent plays into 18 bins (2 event types × 3 field positions × 3 out states) and compute the fraction of sent runners who scored within each bin. This avoids the selection bias inherent in training a predictive model on sent plays only — coaches send more often when they expect success, so a model trained on sent plays would overstate P(safe) for holds.

All 18 bins had empirical P(safe) ≥ 0.947 (range: 0.947–1.000). The minimum gap between empirical P(safe) and the maximum break-even probability across all bins was 0.033. This means sending was the statistically correct call in every bin — no bin produced a situation where holding was the right expected-value choice.

A logistic regression model (AUC = 0.78, features: throw distance, runner sprint speed, outfielder arm strength, hit type, field position) is retained as a secondary comparison column, but is not used as the primary grading signal due to the selection bias issue described above.

Throw distance approximation

We do not have raw ball-tracking data. Throw distance is approximated from Statcast hit coordinates (hc_x, hc_y) using a scale factor of 2.5 feet per coordinate unit, calibrated against known field geometry. Spot checks: RF single ≈ 239 ft, LF single ≈ 180 ft, LF double ≈ 275 ft, RF double ≈ 268 ft — all plausible.

This approximation affects the logistic regression model only. The empirical bin approach does not use throw distance.

Break-even probability (RE24)

The break-even probability is the P(safe) at which sending and holding produce equal expected run value:

P_be = (RE_hold − RE_out) / (RE_safe − RE_out)

RE values come from a 24-state run expectancy table computed from 2020–2024 Statcast data. State transitions account for other runners on base: the runner on 2B is assumed to score (RE_safe) or be retired (RE_out); the runner on 1B (if any) advances by one base on a single, two on a double.

The runner on 3B (if any) is assumed to score on all outfield hits — this holds approximately 95% of the time and is a minor source of error.

Grading logic

Each play is graded using empirical P(safe) as the primary signal:

GOOD_SEND: Runner was sent and empirical P(safe) ≥ P_be.
BAD_SEND: Runner was sent and empirical P(safe) < P_be. (No plays in this category under the empirical approach — see key finding.)
BAD_HOLD: Runner was held and empirical P(safe) ≥ P_be. These represent run value left on the table.
GOOD_HOLD: Runner was held and empirical P(safe) < P_be.

Run value = P(safe) × RE_safe + (1 − P(safe)) × RE_out − RE_hold. Positive = correct decision. For BAD_HOLD plays this is the expected runs left on the table by the hold.

Aggregation and leaderboards

Team-year and coach-career leaderboards report bad_hold_runs_per100as the primary metric: the expected run value left on the table by over-holding, normalized per 100 opportunities. This accounts for variation in opportunity count across teams, seasons, and parks.

Entries with fewer than 150 graded opportunities are flagged Low sample. The 2020 season (60 games) is separately flagged.

Key structural finding

All run value loss in this dataset comes from over-holding — not from over-sending.

Empirical P(safe) ≥ 0.947 in all 18 bins. The maximum break-even probability across all bins is 0.914. The minimum gap is 0.033 — meaning every bin has P(safe) comfortably above the break-even threshold. Under the empirical approach, zero bins produced a situation where holding was the correct expected-value call. The 21 BAD_SENDs produced by the logistic regression model all flip to GOOD_SEND when the empirical P(safe) is substituted. This is a finding, not a model gap — stated explicitly rather than forcing the model to appear balanced.

External validation

Send rate rankings were correlated against Baseball Reference's Extra Bases Taken % (XBT%) — the fraction of opportunities where a team took an extra base — as an independent external check. The two metrics share no data: our send_rate is computed from Statcast play-by-play; XBT% is computed by Baseball Reference from their play-by-play.

Spearman ρ = +0.780(p < 0.0001, n = 120 team-years with ≥ 100 opportunities). This is a strong positive correlation. Teams our model identifies as aggressive senders are independently identified as aggressive by Baseball Reference, and vice versa.

Year-over-year stability of send_rate within our dataset (same team, consecutive seasons):

Season pair	n teams	Spearman ρ	p-value
2020 vs 2021	30	+0.231	0.219 (not sig.)
2021 vs 2022	30	+0.349	0.059 (borderline)
2022 vs 2023	30	+0.436	0.016 ✓
2023 vs 2024	30	+0.345	0.062 (borderline)
Mean	—	+0.340	—

The 2020–21 pair is weaker, likely because 2020 was a 60-game season with small, noisier per-team send_rate estimates. The consistent positive ρ across all pairs (mean +0.340) confirms that send_rate captures a real coaching philosophy signal rather than year-to-year noise.

Rank comparison: our rankings vs. Baseball Reference

Below are the top 15 team-seasons by our send_rate and their corresponding XBT% rank (Baseball Reference), plus the 10 most conservative team-seasons. Rank gaps arise from scope differences: XBT% covers all types of extra-base advancement (including from 1B on a single, or first-to-third on a double), while our module covers only the specific situation of a runner on 2B with a ball hit to the outfield.

Top 15 most aggressive (by our send_rate):

Team	Year	Send%	Our rank	XBT%	BR rank
WSH	2024	88.1%	1	.45	20
ATL	2022	87.4%	2	.50	2
DET	2024	87.2%	3	.49	4
BAL	2023	86.7%	4	.49	4
CIN	2023	85.2%	5	.47	9
TB	2022	84.7%	7	.47	9
COL	2021	84.4%	10	.45	20
STL	2022	84.4%	10	.46	14
LAD	2024	84.4%	10	.49	4

10 most conservative (by our send_rate):

Team	Year	Send%	Our rank	XBT%	BR rank
SEA	2023	72.0%	111	.40	78
SF	2021	71.4%	112	.38	99
NYY	2023	71.3%	113	.39	89
CHC	2021	70.9%	114	.40	78
MIN	2021	70.8%	115	.37	108
CIN	2021	70.8%	115	.35	119
BOS	2022	69.8%	117	.38	99
NYM	2021	69.5%	118	.37	108
MIA	2023	69.3%	119	.38	99
SEA	2022	69.2%	120	.38	99

Notable outliers (rank gap > 30):The largest discrepancies come from teams where our specific situation (runner on 2B scoring on OF hit) diverges from the broader XBT% population. TB 2024 (rank 6 ours, rank 44 BR) and ATH 2021 (rank 78 ours, rank 14 BR) are prominent examples. These outliers are a natural consequence of measuring a subset of baserunning situations, not a model error — both rankings are correct for what each measures. 20 of 120 team-years have rank gaps > 30.

Data sources

Play-by-play: MLB Statcast via pybaseball (statcast(), 2020–2024).
Sprint speed: Baseball Savant sprint speed leaderboard via pybaseball.
Arm strength: Baseball Savant arm strength leaderboard CSV (LF/CF/RF). Available 2020+ only.
Run expectancy: Computed from 2020–2024 Statcast data.
XBT%: Baseball Reference team baserunning pages.
Coach attribution: Baseball Reference team pages, manually compiled.

Known limitations

▲Throw distance is approximated from hit coordinates, not measured from tracking data. Scale factor 2.5 ft/unit is calibrated but not exact. Affects only the secondary logistic model.
▲Arm strength is season-average by position for each outfielder. Play-level arm strength data is not publicly available.
▲No relay-throw modeling. Cutoff and relay quality affects throw time to home, but this data is not in the public Statcast feed.
▲Runner from 3B assumed to always score on outfield hits. True ~95% of the time.
▲Runner from 1B advancement simplified: +1 base on a single, +2 bases on a double. Actual advancement varies.
▲HELD_OR_UNKNOWN plays (~2.5%) excluded.The play description text did not unambiguously indicate the runner's outcome.
▲2020 entries carry higher uncertainty due to the 60-game shortened season (~50–80 opportunities per team vs. ~150–220 in full seasons).
▲Empirical P(safe) uses bin averages. Within-bin variation in throw distance, runner speed, and fielder arm is not captured. The empirical approach trades granularity for freedom from selection bias.
▲Zero bad sends is a finding, not a gap. Under the empirical bin approach, no situation existed where sending was the wrong call at the bin level. This is stated plainly rather than forcing the model to produce bad sends to appear balanced.

Methodology overview View the Send/Hold Grader →