Methodology overview

Send/Hold Grader — Methodology

Detailed data sources, modeling choices, approximations, and known limitations for the third-base coach send/hold decision module.

What we measure

For every ball hit to the outfield with a runner on second base, the third-base coach makes a binary call: send the runner to attempt to score, or hold him at third. We grade that call against the RE24 break-even probability for the situation.

This is distinct from player ability. A talented runner sent on a 50/50 opportunity produces the same coach grade regardless of whether he happens to be safe or out.

Scope

  • Situation: Runner on second base, ball hit to the outfield (single or double), opportunity to attempt to score.
  • Years: 2020–2026 MLB regular season. 2026 is in progress; entries are flagged Live in the leaderboard.
  • Data source: MLB Statcast event-level data via pybaseball.
  • Out of scope for this module: stolen bases, IBBs, other in-game decisions.

Opportunity identification

We filter Statcast play-by-play data for events where a runner was on second base (on_2b field) and the batter put the ball in play to the outfield (hit_location 7/8/9, events = single or double). We then parse the play description text (desfield) to classify the runner's outcome: SCORED, OUT_AT_HOME, HELD_AT_BASE, or HELD_OR_UNKNOWN.

To avoid false positives in multi-runner situations, outcomes are attributed to the specific runner on second using Unicode-normalized name matching against the description text. Plays classified as HELD_OR_UNKNOWN (~2.5% of opportunities) are excluded from grading.

P(safe) — probability of scoring if sent

We use an empirical bin approach: divide sent plays into 18 bins (2 event types × 3 field positions × 3 out states) and compute the fraction of sent runners who scored within each bin. This avoids the selection bias inherent in training a predictive model on sent plays only — coaches send more often when they expect success, so a model trained on sent plays would overstate P(safe) for holds.

All 18 bins had empirical P(safe) ≥ 0.947 (range: 0.947–1.000). The minimum gap between empirical P(safe) and the maximum break-even probability across all bins was 0.033. This means sending was the statistically correct call in every bin — no bin produced a situation where holding was the right expected-value choice.

A logistic regression model (AUC = 0.78, features: throw distance, runner sprint speed, outfielder arm strength, hit type, field position) is retained as a secondary comparison column, but is not used as the primary grading signal due to the selection bias issue described above.

Throw distance approximation

We do not have raw ball-tracking data. Throw distance is approximated from Statcast hit coordinates (hc_x, hc_y) using a scale factor of 2.5 feet per coordinate unit, calibrated against known field geometry. Spot checks: RF single ≈ 239 ft, LF single ≈ 180 ft, LF double ≈ 275 ft, RF double ≈ 268 ft — all plausible.

This approximation affects the logistic regression model only. The empirical bin approach does not use throw distance.

Break-even probability (RE24)

The break-even probability is the P(safe) at which sending and holding produce equal expected run value:

P_be = (RE_hold − RE_out) / (RE_safe − RE_out)

RE values come from a 24-state run expectancy table computed from 2020–2024 Statcast data. State transitions account for other runners on base: the runner on 2B is assumed to score (RE_safe) or be retired (RE_out); the runner on 1B (if any) advances by one base on a single, two on a double.

The runner on 3B (if any) is assumed to score on all outfield hits — this holds approximately 95% of the time and is a minor source of error.

Grading logic

Each play is graded using empirical P(safe) as the primary signal:

  • GOOD_SEND: Runner was sent and empirical P(safe) ≥ P_be.
  • BAD_SEND: Runner was sent and empirical P(safe) < P_be. (No plays in this category under the empirical approach — see key finding.)
  • BAD_HOLD: Runner was held and empirical P(safe) ≥ P_be. These represent run value left on the table.
  • GOOD_HOLD: Runner was held and empirical P(safe) < P_be.

Run value = P(safe) × RE_safe + (1 − P(safe)) × RE_out − RE_hold. Positive = correct decision. For BAD_HOLD plays this is the expected runs left on the table by the hold.

Aggregation and leaderboards

Team-year and coach-career leaderboards report bad_hold_runs_per100as the primary metric: the expected run value left on the table by over-holding, normalized per 100 opportunities. This accounts for variation in opportunity count across teams, seasons, and parks.

Entries with fewer than 150 graded opportunities are flagged Low sample. The 2020 season (60 games) is separately flagged.

Key structural finding

All run value loss in this dataset comes from over-holding — not from over-sending.

Empirical P(safe) ≥ 0.947 in all 18 bins. The maximum break-even probability across all bins is 0.914. The minimum gap is 0.033 — meaning every bin has P(safe) comfortably above the break-even threshold. Under the empirical approach, zero bins produced a situation where holding was the correct expected-value call. The 21 BAD_SENDs produced by the logistic regression model all flip to GOOD_SEND when the empirical P(safe) is substituted. This is a finding, not a model gap — stated explicitly rather than forcing the model to appear balanced.

External validation

Send rate rankings were correlated against Baseball Reference's Extra Bases Taken % (XBT%) — the fraction of opportunities where a team took an extra base — as an independent external check. The two metrics share no data: our send_rate is computed from Statcast play-by-play; XBT% is computed by Baseball Reference from their play-by-play.

Spearman ρ = +0.780(p < 0.0001, n = 120 team-years with ≥ 100 opportunities). This is a strong positive correlation. Teams our model identifies as aggressive senders are independently identified as aggressive by Baseball Reference, and vice versa.

Year-over-year stability of send_rate within our dataset (same team, consecutive seasons):

Season pairn teamsSpearman ρp-value
2020 vs 202130+0.2310.219 (not sig.)
2021 vs 202230+0.3490.059 (borderline)
2022 vs 202330+0.4360.016 ✓
2023 vs 202430+0.3450.062 (borderline)
Mean+0.340

The 2020–21 pair is weaker, likely because 2020 was a 60-game season with small, noisier per-team send_rate estimates. The consistent positive ρ across all pairs (mean +0.340) confirms that send_rate captures a real coaching philosophy signal rather than year-to-year noise.

Rank comparison: our rankings vs. Baseball Reference

Below are the top 15 team-seasons by our send_rate and their corresponding XBT% rank (Baseball Reference), plus the 10 most conservative team-seasons. Rank gaps arise from scope differences: XBT% covers all types of extra-base advancement (including from 1B on a single, or first-to-third on a double), while our module covers only the specific situation of a runner on 2B with a ball hit to the outfield.

Top 15 most aggressive (by our send_rate):

TeamYearSend%Our rankXBT%BR rank
WSH202488.1%1.4520
ATL202287.4%2.502
DET202487.2%3.494
BAL202386.7%4.494
CIN202385.2%5.479
TB202284.7%7.479
COL202184.4%10.4520
STL202284.4%10.4614
LAD202484.4%10.494

10 most conservative (by our send_rate):

TeamYearSend%Our rankXBT%BR rank
SEA202372.0%111.4078
SF202171.4%112.3899
NYY202371.3%113.3989
CHC202170.9%114.4078
MIN202170.8%115.37108
CIN202170.8%115.35119
BOS202269.8%117.3899
NYM202169.5%118.37108
MIA202369.3%119.3899
SEA202269.2%120.3899

Notable outliers (rank gap > 30):The largest discrepancies come from teams where our specific situation (runner on 2B scoring on OF hit) diverges from the broader XBT% population. TB 2024 (rank 6 ours, rank 44 BR) and ATH 2021 (rank 78 ours, rank 14 BR) are prominent examples. These outliers are a natural consequence of measuring a subset of baserunning situations, not a model error — both rankings are correct for what each measures. 20 of 120 team-years have rank gaps > 30.

Data sources

  • Play-by-play: MLB Statcast via pybaseball (statcast(), 2020–2024).
  • Sprint speed: Baseball Savant sprint speed leaderboard via pybaseball.
  • Arm strength: Baseball Savant arm strength leaderboard CSV (LF/CF/RF). Available 2020+ only.
  • Run expectancy: Computed from 2020–2024 Statcast data.
  • XBT%: Baseball Reference team baserunning pages.
  • Coach attribution: Baseball Reference team pages, manually compiled.

Known limitations

  • Throw distance is approximated from hit coordinates, not measured from tracking data. Scale factor 2.5 ft/unit is calibrated but not exact. Affects only the secondary logistic model.
  • Arm strength is season-average by position for each outfielder. Play-level arm strength data is not publicly available.
  • No relay-throw modeling. Cutoff and relay quality affects throw time to home, but this data is not in the public Statcast feed.
  • Runner from 3B assumed to always score on outfield hits. True ~95% of the time.
  • Runner from 1B advancement simplified: +1 base on a single, +2 bases on a double. Actual advancement varies.
  • HELD_OR_UNKNOWN plays (~2.5%) excluded.The play description text did not unambiguously indicate the runner's outcome.
  • 2020 entries carry higher uncertainty due to the 60-game shortened season (~50–80 opportunities per team vs. ~150–220 in full seasons).
  • Empirical P(safe) uses bin averages. Within-bin variation in throw distance, runner speed, and fielder arm is not captured. The empirical approach trades granularity for freedom from selection bias.
  • Zero bad sends is a finding, not a gap. Under the empirical bin approach, no situation existed where sending was the wrong call at the bin level. This is stated plainly rather than forcing the model to produce bad sends to appear balanced.