tennis methodology

Tennis Methodology

Q: Why does the model sometimes pick a less-accomplished player over a name?

Usually because of the small-sample shrinkage + aging-curve interaction. A young player with few grass matches gets their grass Elo pulled up toward their hard/clay overall (capped at 50%), and the aging curve credits players in the grass-peak window (roughly age 22-27). Combined, this can tip a close matchup toward the younger, less-grass-experienced player. The /wimbledon/match page shows exactly which driver is doing the work for any given match — if grass Elo is favoring the veteran but aging flips it, you'll see both contributions side by side.

Q: Why grass-Elo shrinkage K=40 with a 50% cap?

Without shrinkage, sparse grass samples produce artifacts — Sinner's 39 grass matches gave a raw Elo ~280 points below his cross-surface overall, leaving him underrated entering 2026 Wimbledon despite being world #1. K=40 fixed that. But uncapped, K=40 over-corrected on players with <20 grass matches: Fils (11 matches) got pulled 78% toward overall, ending up higher on grass than Shelton (80 matches) — wrong. The 50% cap preserves Sinner's correction (his w was already 0.506) and stops the laundering for everyone else.

Q: Why not steepen the aging tail to handle these cases?

Because that would be survivorship bias — steepening the aging tail to fix one player's projection is fitting noise. The shrinkage cap is OOS-testable and principled. The aging tail stays where the data put it.

Q: How does the model update after each match?

Per-surface Elo updates with each completed match (standard Elo update with surface-specific K). Aging curves and archetype assignments are stable across the tournament. The Wimbledon refresh script (D:/Tennis Projections/scripts/refresh_during_wimbledon.py) pulls the previous day's results, re-derives Elo, re-runs the bracket MC, and re-deploys the bracket + per-pair odds + projected JSON. The /wimbledon page picks up the updated probabilities on the next page load.

Q: What does the model NOT account for?

Injuries (a player carrying a thigh strain isn't surfaced in the projection — we only see match results, not in-tournament withdrawals or load management). Weather (grass plays differently in heat / humidity / under the roof). Match scheduling and accumulated fatigue (back-to-back five-setters). Court assignment effects (Centre vs Court 1 vs outside courts — bounce and feel differ). Recent off-court news (coaching changes, personal). Doubles partnerships (singles only for now). These all matter, but we don't ingest them and we won't pretend otherwise.

Q: How should I read counterintuitive calls?

Open the /wimbledon/match page for that pairing. It shows the model's full breakdown: grass Elo (raw + shrunk + sample size), age + aging contribution, archetype matchup. If a call surprises you, one of those drivers is doing unusual work — usually a small-sample shrinkage on a younger player, or an archetype edge you didn't expect. The model is making a probabilistic statement, not a prediction; a 55% call still loses 45% of the time.

Q: How was the K=40 shrinkage cap derived?

After Wimbledon 2026 draw posted, an audit of seven flagged matches found 5 of them shared the same defect: K=40 was overpowered on small-grass-sample players, hauling their grass Elo 60-80% toward hard-court overall. Capping at w_max=0.5 was the smallest change that fixed all 5 without disturbing the n>=40 specialists. Verified with re-run MC: 5 of 7 calls flipped to the intuitive winner; Sinner's champion% unchanged at 47.6%.

Surface-specific Elo + per-archetype aging + draw-aware Monte Carlo. Grass-Elo gets a sample-size shrinkage toward overall Elo, but capped at 50% — Sinner's correction is preserved while small-sample players (Fils, Tien, Fonseca) no longer get their hard-court rating laundered through grass.

Tennis projections use a separate Elo per surface (hard / clay / grass). Aging curves are fit per archetype, not league-wide. Each tournament's bracket runs Monte Carlo over the published draw with the matchup-by-matchup probabilities the model assigns. The /wimbledon and /wimbledon/match pages surface those drivers explicitly — what's the grass Elo, what's the aging contribution, what archetype matchup edge applies — so the model's reasoning is auditable on every match.

engine

Surface Elo (capped shrinkage) + per-archetype aging + FDR archetype matchup + draw-aware MC

Each player carries a separate Elo per surface. Grass-Elo on small samples is shrunk toward the player's cross-surface overall Elo with strength K/(K+n_grass), capped at 50% pull — so a player with very few grass matches can borrow from overall but never to the point that overall replaces grass entirely. Aging curves are fit per archetype (Big-Server, Counter-Puncher, Modern Tall Baseliner, etc.) — different player types peak and decline differently. Archetype-matchup edges (e.g. Modern Baseliner over Crafty Veteran on grass) only apply where the matchup matrix passes FDR significance; otherwise the matchup contribution is zero. Tournament projections run Monte Carlo (10k sims) over the actual published draw.

data

Where the inputs come from

sources

ATP + WTA match results (Jeff Sackmann mirror), official Wimbledon draw JSON (wimbledon.com MS.json / LS.json), Wikidata for WTA DOB backfill, Tennis Abstract MCP for archetype clustering

training

Multi-year per-surface Elo fit; aging curves fit on multi-decade player histories; archetype matchup matrix on full available match panel

holdout

Per-match Brier + tournament champion Brier evaluated post-tournament. The grass-Elo shrinkage cap (w_max=0.5) was set after Wimbledon 2026 audit found K=40 alone over-shrank n<20 players; 5 of 7 flagged calls flipped to the intuitive winner with the cap applied.

calibration

Out-of-sample performance

metric

Per-match Brier + per-tournament champion Brier. Point-in-time OOS for surface shrinkage parameters.

value

Grass-Elo shrinkage K=40, w_max=0.5 (cap added 2026-06-28 after audit)

The shrinkage cap is the lesson from the K=40 lever: K=40 alone correctly un-suppressed Sinner (39 grass matches, raw 1843 → shrunk 1986), but it over-credited cross-surface form on players with <20 grass matches (Fils 11 matches got pulled 78% toward overall). Grass is a specialist surface — overall Elo is a noisy prior, not a clean substitute. The 50% cap keeps Sinner's correction (he was already at w=0.506) and stops the laundering. User explicitly rejected steepening the aging tail as an alternative Sinner fix — that would have been survivorship bias.

key levers

What controls the projection

Surface-specific Elo

Hard / clay / grass Elo fit separately. Players are not equally strong across surfaces — surface-specific Elo captures the gap.

Grass-Elo shrinkage K=40, capped at 50%

Grass tournaments are sparse per player per year. Players with few grass matches get their grass Elo pulled toward their overall (cross-surface) Elo, but never more than halfway. The cap is the honest bound — overall isn't a valid substitute for grass data on a specialist surface; you can borrow at most half your strength.

Per-archetype aging

Aging curves fit per player archetype (Big-Server, Counter-Puncher, Modern Tall Baseliner, etc.). Aging tails differ by playstyle — and rejecting in-sample fits that steepen the tail to fix one player is a principle, not a debatable choice.

FDR-gated archetype matchup

Archetype-vs-archetype edges only apply where the matchup matrix passes FDR significance. On grass, only Modern Baseliner vs Crafty Veteran survives the gate (ATP). For pairs that don't, the matchup contribution is zero — we don't fit noise.

Draw-aware Monte Carlo

Tournament projections sim the actual draw, not an abstract bracket. Every potential matchup is simulated; the /wimbledon page's 'if they meet' picker and per-player champion% both come from the same 10k-sim distribution.

faq

Common questions

Why does the model sometimes pick a less-accomplished player over a name?

Usually because of the small-sample shrinkage + aging-curve interaction. A young player with few grass matches gets their grass Elo pulled up toward their hard/clay overall (capped at 50%), and the aging curve credits players in the grass-peak window (roughly age 22-27). Combined, this can tip a close matchup toward the younger, less-grass-experienced player. The /wimbledon/match page shows exactly which driver is doing the work for any given match — if grass Elo is favoring the veteran but aging flips it, you'll see both contributions side by side.

Why grass-Elo shrinkage K=40 with a 50% cap?

Without shrinkage, sparse grass samples produce artifacts — Sinner's 39 grass matches gave a raw Elo ~280 points below his cross-surface overall, leaving him underrated entering 2026 Wimbledon despite being world #1. K=40 fixed that. But uncapped, K=40 over-corrected on players with <20 grass matches: Fils (11 matches) got pulled 78% toward overall, ending up higher on grass than Shelton (80 matches) — wrong. The 50% cap preserves Sinner's correction (his w was already 0.506) and stops the laundering for everyone else.

Why not steepen the aging tail to handle these cases?

Because that would be survivorship bias — steepening the aging tail to fix one player's projection is fitting noise. The shrinkage cap is OOS-testable and principled. The aging tail stays where the data put it.

How does the model update after each match?

Per-surface Elo updates with each completed match (standard Elo update with surface-specific K). Aging curves and archetype assignments are stable across the tournament. The Wimbledon refresh script (D:/Tennis Projections/scripts/refresh_during_wimbledon.py) pulls the previous day's results, re-derives Elo, re-runs the bracket MC, and re-deploys the bracket + per-pair odds + projected JSON. The /wimbledon page picks up the updated probabilities on the next page load.

What does the model NOT account for?

Injuries (a player carrying a thigh strain isn't surfaced in the projection — we only see match results, not in-tournament withdrawals or load management). Weather (grass plays differently in heat / humidity / under the roof). Match scheduling and accumulated fatigue (back-to-back five-setters). Court assignment effects (Centre vs Court 1 vs outside courts — bounce and feel differ). Recent off-court news (coaching changes, personal). Doubles partnerships (singles only for now). These all matter, but we don't ingest them and we won't pretend otherwise.

How should I read counterintuitive calls?

Open the /wimbledon/match page for that pairing. It shows the model's full breakdown: grass Elo (raw + shrunk + sample size), age + aging contribution, archetype matchup. If a call surprises you, one of those drivers is doing unusual work — usually a small-sample shrinkage on a younger player, or an archetype edge you didn't expect. The model is making a probabilistic statement, not a prediction; a 55% call still loses 45% of the time.

How was the K=40 shrinkage cap derived?

After Wimbledon 2026 draw posted, an audit of seven flagged matches found 5 of them shared the same defect: K=40 was overpowered on small-grass-sample players, hauling their grass Elo 60-80% toward hard-court overall. Capping at w_max=0.5 was the smallest change that fixed all 5 without disturbing the n>=40 specialists. Verified with re-run MC: 5 of 7 calls flipped to the intuitive winner; Sinner's champion% unchanged at 47.6%.

Is WTA the same fidelity as ATP?

Yes. Same engine class, same surface-specific Elo + cap, same per-archetype aging (WTA archetype clustering is K=4 vs ATP's K=7), same draw-aware sim. Equity-first ship — women's tennis is not a junior partner.

What about doubles?

Singles only for now. Doubles projection is on the roadmap but not yet shipped — when it ships, it'll have its own page + OOS calibration in release notes.

apex framework For the platform-wide methodology framework — pre-registration policy, data philosophy, bias controls, and honesty notes — see the apex methodology page.