Country Roads, Take Me Home: Evaluating Home Effect in the Tour de France
This was a fun project derived from my phd microeconoometrics course. I am thankful to my "coauthors": Eren Cerik and Angelo Mimmo. I also want to thank Prof. Michael Lechner and Ana Armendariz for their ideas and comments. We planned to submit this work to European Association of Sport Economics, but eventually we did not. I took most of the guilty because of procastination. Anyways, this would be a good talk to show how to use econometrics to answer interesting questions with publicly available data.
Read the paper: PDF
Get the presentation slides: PDF
The usual "home advantage" story is built on team sports: crowd support, travel fatigue for visitors, referee bias, familiarity with a venue, and so on. Road cycling is not so clear for that framework: it is individual performance inside team tactics, the "venue" moves all the time, and crowds line public roads rather than fixed stadiums.
The core question here is: when a Tour de France stage takes place in a rider's home country (county), does that rider become more likely to perform well in that stage - or less?
1) Why the Tour de France is a special "home advantage" case
The Tour is a 21-24 day, ~3,500km stage race with heterogeneous terrain (flats, hills, mountains, time trials) and heavy strategic interaction inside teams. Not all riders are trying to "win the day" every stage. Many riders play roles: protecting the team leader, controlling breakaways, pacing climbs, or sacrificing their own ranking for a captain's overall position.
That matters for identification because "performance" is not a single-dimensional effort outcome. A rider can be "good" yet intentionally give up stage rank. The paper chooses a concrete metric tightly linked to incentives in the professional cycling labor market: whether the rider finishes in the Top 15 of a stage. Top-15 stage results generate UCI points and visibility, which can feed into contract renewals and future bargaining.
2) Data construction and the key variables
Sample
- Tour de France men's stages, 2020-2024.
- Primary source: ProCyclingStats (PCS), with supplementary pulls due to missing or inconsistent fields.
- Final dataset after cleaning retains 14,639 observations (from 14,744 originally mentioned in the text).
Outcome: "high performance"
Top15 is a binary indicator: 1 if a rider finishes within the top 15 in a stage; 0 otherwise. With ~170-180 starters per stage, Top 15 targets a meaningful right-tail performance event rather than small rank noise.
Treatment: "home rider"
Rider home-country indicator: equals 1 if the stage's country matches the rider's nationality. Conceptually, this is the "home" environment most likely to matter in a road-race context: language, cultural salience, media attention, and crowd identification.
Controls
The empirical design relies on a selection-on-observables assumption, so the control set is intentionally broad:
- Tour features: year indicators; progress through the Tour (share completed), capturing fatigue and selection as riders drop out.
- Rider features: BMI, age, prior-year PCS points (proxy for rider quality and recent success), captain status, and specialization profiles.
- Team features: team strength proxied by the sum of riders' PCS points in the previous year.
- Stage features: distance; whether the stage is a road race vs. time trial; route "profile" categories (flat / hilly / mountain variants).
- Interactions of rider-stage fit:
- Time-trial specialist x time-trial stage match.
- Rider specialty x profile match for current stage; plus matches for previous and next stages (to capture energy expenditure or conservation incentives across adjacent days).
A novel "familiarity" control: area knowledge
The paper tries to separate "country-level home salience" from "route familiarity" by geocoding rider birthplaces and stage start/end points, then computing driving distances. An area knowledge indicator is set to 1 if either the start or end of the stage is within 150km of the rider's birthplace.
A second "home" channel: team home country
Teams have national registrations that reflect sponsorship and licensing structures. A control indicator flags whether the stage country matches the team's country. This aims to avoid conflating "rider is at home" with "team is at home," since the latter could influence media obligations, sponsor activation, and internal incentives.
3) Identification and estimation strategy
Target estimand
The estimand is the causal effect of being a home rider on the probability of achieving a Top-15 stage finish.
Formally, the paper frames this with potential outcomes Y(1) and Y(0), and treatment D.
Assumptions (as stated)
Assumption 1 - Conditional Independence (selection on observables)
Potential outcomes are independent of treatment assignment conditional on observed covariates (and fixed effects / controls). Practically: once you condition on the rich rider/team/stage controls (including match indicators and area knowledge), "home" is treated as as-good-as random.
Assumption 2 - Common support / overlap
For any covariate profile considered, there is a positive probability of being treated and untreated. The paper checks this via propensity scores and trims to the overlap region. Reported overlap is 83.64%; 1,261 observations are removed.
Assumption 3 - SUTVA
A rider's potential outcome depends only on that rider's own treatment assignment (no interference). In cycling, this is a strong assumption because riders interact; the paper proceeds with the standard formulation.
Main estimator: Modified Causal Forest (MCF)
The workhorse estimator is MCF (Lechner). The appeal in this setting is not just an average effect; it is the ability to map heterogeneity at multiple levels:
- ATE: average treatment effect.
- IATE: individualized treatment effects (rider-stage-level heterogeneity).
- GATE: group average treatment effects (captains vs. non-captains; specialties; stage profiles; team-home interactions).
Robustness: logistic regression
A classical binary choice model is run as a check. Some inputs are log-transformed (e.g., distance and points) to address scale differences. Results are reported as marginal effects (percentage-point changes), not log-odds.
4) Main findings
Average effect: home is (slightly) bad on average
The MCF estimate of the average treatment effect is -0.041, interpreted as a 4.1 percentage-point reduction in the probability of finishing Top 15 when the stage is in the rider's home country. The reported p-value is extremely small (given as 0.307%), consistent with statistical significance.
In plain terms: Riders racing "at home" are, on average, less likely to achieve a strong stage result.
Heterogeneity: the average masks wide dispersion
Individualized effects vary substantially. Two summary facts highlighted in the results:
- About 9.98% of individualized treatment effects (IATEs) are positive - i.e., for a minority, home conditions look beneficial.
- About 37.65% of riders exhibit IATEs significantly different from zero at the 5% level (suggesting many riders are meaningfully affected, not just a small tail).
The paper links this dispersion to psychological pressure mechanisms rather than terrain mechanics: crowd presence can raise expectations and attention, turning "support" into performance pressure.
5) Who is hit hardest? Patterns in treatment-effect heterogeneity
Prior success makes "home" more costly
The strongest continuous pattern reported is between IATEs and prior-year PCS points. Riders with higher points (a proxy for recent success and public exposure) show more negative home effects. The proposed mechanism: higher achievement increases expectations and scrutiny, raising the likelihood of choking or over-commitment under home attention.
Age: variance concentrated among mid-young riders
Ages 24-30 show higher variance in individualized effects. Older riders' effects "converge" (fewer extreme outliers), consistent with selection and experience: riders who remain Tour-level into older ages are likely more resilient to pressure, while early-career entrants who reach the Tour may already be highly supported and unusually strong.
Stage distance: not the main moderator, but crowd density may matter
Stage distance itself does not show a strong monotone pattern. However, the results note clustering that plausibly reflects time trials (often in more audience-dense settings), with a fitted relationship suggesting lower average IATEs for TT-like clusters - consistent with the crowd-pressure channel.
Group results (GATEs)
Captains vs. non-captains
- Non-captains: home effect ~ -0.034 (p ~ 1.19%).
- Captains: home effect ~ -0.052 (p ~ 1.94%).
- The difference between the two is not statistically distinguishable, so captain status is not treated as a clean moderator.
Rider specialty
Specialties are derived from PCS profile points (sprinter / hilly / mountainous types). Reported group effects:
- Hilly specialists: ~ -0.035 (p ~ 0.32%).
- Mountain specialists: ~ -0.062 (p ~ 0.02%).
- Sprinters: slightly positive (~ +0.03) but not statistically compelling in the reported comparison.
The interpretation offered is strategic and behavioral: sprint stages are often the most spectator-dense (urban centers, major roads). Sprinters who are selected and succeed in that niche may be precisely those better prepared for the "audience" environment, whereas mountainous terrain can reduce crowd accessibility, changing the pressure profile and who is exposed.
Stage profile (parcours)
Grouping by stage profile yields negative effects that are individually significant, but the differences across profiles are not statistically distinct. This is taken as evidence that "home" operates more through rider-specific channels than through the stage's external physical conditions.
Team-home status
When riders are home riders but not on a home team, they show a negative home effect (reported around -0.041). Riders on a home team do not show the same home penalty. Differences relative to the overall ATE are not sharply estimated, but the pattern is consistent with buffering: teams "at home" may provide better logistics, insulation, or expectation management.
6) Robustness via logistic regression: what matches, what does not
The logistic regression is positioned as a check with more familiar interpretability. The reported marginal effects (percentage points) align with standard cycling priors on performance determinants:
- Prior-year rider points: strongly positive association with Top-15 probability.
- Captain status: positive association (consistent with team support directed toward leaders).
- Age: negative association (older riders less likely to hit Top-15, holding other factors fixed).
- Stage profile icon: negative association (harder profiles reduce Top-15 probability broadly).
- Tour completion share: small positive association in the table (selection and conditioning effects can drive this).
- TT match and specialty match (current stage): strongly positive; specialty match in previous stage negative; next stage match mildly negative - consistent with energy allocation across adjacent stages.
On the "home" variables specifically:
- Rider home country: negative sign but not statistically significant in the logistic specification (reported p-value ~ 0.154).
- Team home country: negative and statistically significant.
The authors treat the lack of significance on rider_home_country in the parametric logit as a limitation of the logit
structure for capturing the effect, and as an argument for the flexibility of MCF (nonparametric conditional expectation,
richer heterogeneity).
7) Interpretation: "home" as pressure, not a tailwind
The report's preferred mechanism is psychological. Crowd support is ambiguous: it can be encouragement, but it can also be a demand for performance. In high-skill settings with narrow margins, heightened attention can increase errors, distort pacing, or push riders into strategically inefficient efforts.
Three pieces of evidence inside the results are consistent with that channel:
- The home penalty becomes more negative for riders with stronger recent performance signals (prior points), where expectations should be highest.
- Heterogeneity is large; the effect is not a uniform "travel/familiarity" mechanism.
- Stage-profile differences are not the main driver; rider-type differences dominate.
8) Bottom line
Across Tours 2020-2024, racing "at home" (stage country matches rider nationality) is associated with a lower probability of a strong stage result, not a higher one, once the analysis conditions on a broad set of rider, team, and stage features and trims to common support.
The average penalty is modest in absolute size (about four percentage points in Top-15 probability), but it is not uniform: effects are dispersed, systematically worse for riders with stronger recent success signals, and plausibly consistent with performance pressure rather than geographic familiarity.