Methodology

Join the Excel LADZ Patreon to download models for Soccer, NBA, NFL, MLB and more - featuring spreadsheets that calculate single-game probabilities and compare them to live bookmaker odds.

Visit Here

Introduction

I believe in absolute transparency, especially when it comes to probability - the foundation of how risk is assessed and taken. Here, I’ll recount exactly how I built my model to forecast the 2026 FIFA World Cup and the reasoning behind it.

Hopefully, this helps you understand both the advantages and disadvantages of the model.

Team vs Player Ratings

The central component of the model is the estimation of attack (ATT) and defence (DEF) ratings for each country. These parameters are used to calculate expected goals (xG) for every match.

A higher attacking rating increases the number of goals a team is expected to score, while a stronger defensive rating reduces the expected goals of its opponent. Together, these ratings allow the model to produce realistic goal expectations that form the foundation for match simulations and probability calculations.

In this model, I’ve decided to calculate the final ATT and DEF ratings based on two components.

The first, representing 80% of the ratings, is derived from historical match performance. This reflects how the team has actually performed against opposition and therefore captures tactical structure, coaching quality, and the ability of the squad to perform collectively.

The remaining 20% is based purely on the current lineup and roster strength of each country. This component accounts for the underlying quality of players available to the national team. By incorporating player-level information, the model can respond more quickly to changes in squad quality while still anchoring the ratings primarily to observed match outcomes.

Below, I’ll go through the calculation of the team-based ratings first. Then, I’ll move onto the player-based ratings, before they are combined and used to project each game in the World Cup.

ELO-Based Model

The first step in the model is to estimate the underlying strength of every national team. To do this, I implemented an ELO rating system, which assigns a numerical rating to each team based on their historical match results and updates that rating after every game.

The objective of this step is to produce a consistent, data-driven ranking of international teams that can be used as the starting point for the tournament simulations.

A major challenge when attempting to rank national teams is that teams face very different strengths of schedule. For example, a European nation may frequently play against top-tier opponents in competitive qualifiers and Nations League matches, while another nation might mostly face weaker opposition in regional competitions.

Simply looking at win percentages or goal differences can therefore be misleading, because those statistics do not account for the strength of the opponent. A team that wins frequently against weak opposition may appear stronger than a team that performs reasonably well against elite teams.

The ELO system addresses this problem by updating ratings based on both the result of the match and the rating of the opponent. When two teams play, their current ratings are compared to determine how surprising the result is.

After the match, the ratings of both teams are adjusted depending on the difference between the actual result and the expected result derived from the rating gap.

The rating update follows the standard ELO adjustment formula:

where Rn is the new rating after the match and Ro is the team’s rating prior to the match.

Next, the parameter W represents the actual result of the match, taking a value of 1 for a win, 0.5 for a draw, and 0 for a loss.

The term We represents the expected result based on the difference in ratings between the two teams. And finally, the factor K controls the magnitude of the rating adjustment and reflects the importance of the match being played.

In the ELO framework, different competitions are assigned different K values to reflect their relative importance. In constructing the model, I took inspiration for these weighting parameters from the World Football ELO Ratings methodology (source: https://www.eloratings.net/).

Below is the complete list of the K values assigned to each competition included in the dataset. In simple terms, more important matches are given larger K values so that they have a greater influence on team ratings - for example, FIFA World Cup matches are worth 60 points, while friendly matches are always worth 20 points because they are less competitive and carry lower stakes.

Also from the ‘World Football ELO Ratings’ website, each country’s starting ELO is listed in a table below.

This is an expectation of the team’s strength before the first game they had ever played. Most countries have played so many games that their starting ELO is insignificant. Note that some countries/areas do not exist anymore.

The rating update is also adjusted to reflect the goal margin in the match. If a team wins by two goals, the K value is increased by half. If a team wins by three goals, the adjustment becomes 3/4, and if the victory margin is four goals or greater, the adjustment becomes 3/4 + (N - 3)/8, where N represents the goal difference.

This modification allows the rating system to capture additional information contained in the scoreline while preventing extremely large rating changes from very high scoring matches.

As of the 15th of April, 2026, the model produces the following ELO ratings for each national team, based on the full historical dataset of international matches processes through the ratings update framework described above. These ratings represent the model’s estimate of each team’s current strength entering the tournament simulations.

Expected Goals

Naturally, teams with higher ELO ratings are considered stronger, as they have consistently achieved better results against strong opposition over time.

Although the ELO model provides a strong method for ranking national teams, it is not suitable for generating expected goals (xG) in a specific matchup. This is because the system produces a single rating representing overall team strength, and as a result, does not distinguish between attacking and defensive ability.

In modelling football games, this distinction is critical because two teams with identical ELO ratings may play very differently. For example, one team may rely on a strong attack but concede many goals, while another may have a strong defensive structure but score relatively few goals.

Since ELO compresses all performance into one rating, it cannot capture these differences. To address this limitation and estimate the ability gap between teams in terms of goals scored, I analysed the historical performance of teams grouped into ELO rating buckets of 50-point intervals.

For each matchup between rating buckets, I examined the historical results (from the year 2000 onwards) to determine the distribution of goals scored by each side. This allows the model to translate differences in ELO ratings into an estimate of the multiple of goals stronger teams typically score against weaker opposition.

The results of this analysis are summarised in the table below.

In general, as expected, larger differences in ELO ratings correspond to higher average goal outputs for the stronger team.

However, the raw averages derived from the historical data do not form a perfectly smooth relationship between ELO difference and expected goals.

For example, a team with an ELO of 2,000 has averaged a rate of scoring 1.38 times that of a side with an ELO of 1,950. However, this multiple drops to 1.32 when facing a weaker side with an ELO of 1,900.

How does this make sense???

Well, it’s purely to do with a small sample size. Only 29 games were played between teams with an ELO of approximately 2,000 and 1,950 since 2000. This figure becomes 39 when considering a matchup between teams with an ELO rating of 2,000 and 1,900.

Ultimately, these small samples create high variance, making the averages (by themselves) unreliable.

As a result, I’ve chosen to smooth the data in order to find a stable distribution of expected goals as a function of a team’s ELO rating.

Least Squares Estimation of the ELO-Goals Function

To implement this smoothing, I've divided the ELO rating range into three ELO intervals:

1700 - 2050
1200 - 1700
400 - 1200

I'll explain the process using the 1200 - 1700 ELO interval as an example. First, the goal multipliers for teams within this ELO range were extracted from the table constructed earlier.

When two teams have identical ELO ratings, the multiplier is equal to 1, meaning each team is expected to score as many goals as each other. As the ratings gap increases, the multiplier adjusts to reflect the change in relative strength.

For example, a 1200-rated team facing a 1700-rated opponent is expected to score 0.15 times the goals of their opponent. On the other hand, a 1700-rated team facing a 1200-rated opponent has the reciprocal multiplier of 6.67, reflecting its substantially stronger scoring expectation.

Next, to ensure the multipliers are comparable across each ELO level, the array is then normalised for each rating. This is done by dividing every value in the array by the corresponding value in the first row, which represents the baseline scoring multipliers for a 1700-rated team against each opponent rating.

By performing this division, the first row is transformed so that all of its values equal 1, establishing a common reference point for the entire interval. The resulting matrix therefore expresses relative scoring strength within each interval, rather than absolute goal multipliers.

Finally, the finished matrix is obtained by dividing the first row by the entire array. This records how the average goals scored for each ELO rating change as the opponent becomes weaker, which is shown moving down the y-axis of the matrix.

The next step is to convert these discrete relative strengths into a smooth exponential function of ELO. For the 1200-1700 interval, the average of each row in the normalised matrix is first calculated.

This produces a single estimate of the relative scoring strength associated with each ELO bucket.

For example, a team with an ELO of 1650 has an average relative strength of 0.851, while a team with an ELO of 1200 has an average relative strength of 0.156. Since the 1700 ELO team is chosen as the anchor, its relative strength is fixed at 1.

To fit an exponential curve, the ELO ratings are then centered around the anchor point of 1700. Hence, if R denotes the ELO rating, the centred ELO variable is:

This gives x = 0 at 1700, x = -50 at 1650, and so on down to x = -500 at 1200.

The modelling assumption is that relative scoring strength follows an exponential form:

where S(R) is the relative strength at rating R, and beta is the exponential decay parameter to be estimated. Because S(1700) = 1, this specification automatically satisfies the anchor condition:

To estimate beta, the model is linearised by taking natural logarithms. Let “Ar” denote the observed average relative strength for rating R. Then:

where epsilon is the fitting error for that ELO bucket. Now, a Log ATT column is calculated as the natural log of the average values.

Since the exponential curve must pass through (x,S) = (0,1), the fitted line in log-space is constrained to pass through the origin. Therefore, no intercept is estimated, and the least squares problem becomes:

where

The least squares estimator for a regression through the origin is:

Screenshot 2026-04-14 at 10.21.39 am.png

This formula finds the value of that best fits the data by minimising the total squared error between the observed values yi and the values predicted by the line beta*xi.

Using the values in the table yields the fitted slope:

This value appears in the LSS column. It represents the exponential rate at which scoring strength declines as ELO decreases within this interval.

Since centered ELO values are negative below 1700, multiplying by a positive produces negative exponents, which ensures that relative strength falls below 1 as ELO declines.

The final smoothed strength function is therefore:

This function translates any ELO rating in the 1200 - 1700 range into a smoothed relative scoring strength. For example, for R = 1650:

This appears in the Relative Strength column. The key advantage of this approach is that it removes the irregularities caused by noisy historical averages and replaces them with a stable monotonic function.

The same procedure was then repeated for the other ELO intervals. For each range, the relative scoring averages were calculated, transformed using the natural logarithm, and a least squares regression through the origin was applied to estimate the exponential decay parameter.

Once the parameters were obtained, I used the fitted exponential functions to project team strength values beyond the observed ranges. Specifically, for ELO ratings between 2050-2200 and 700-400 using the decay parameters 0.00335 and 0.00462 respectively.

These projections extend the model smoothly into the tails of the rating distribution while maintaining the same functional relationship estimated from the data. Note that the tails were excluded from the initial estimation due to extremely small sample sizes.

Throughout this process, the anchor condition S(1700) = 1 was preserved, meaning that all strengths are expressed relative to a team with an ELO rating of 1700. Using the estimated exponential functions, any ELO rating can therefore be translated into a corresponding relative scoring strength.

Attack and Defensive Adjustment

Next, I’m going to add a small attacking and defensive correction so the final xG values for a match are more realistic.

The idea is that raw team strength alone does not fully capture how a country plays. Two teams with similar overall strength may get there in different ways: one may be more dangerous going forward, while another may be more solid defensively.

Firstly, my sample for this adjustment included a country’s matches from 2023 to 2026. For every game, a team’s xG was calculated using the formula:

where:

Si = the strength rating of the country
Sj = the strength rating of the opponent
1.2 = the average number of goals scored by a team in an international match

Next, each team’s attacking and defensive over-performance is measured. Attacking over-performance is calculated as the difference between actual goals scored and expected goals. For each match:

If the country scores more goals than expected for a match, then their attacking delta is positive. If they score less than expected, attacking delta is negative. A team’s attacking delta is then averaged across all games in the sample.

The same thing is then done for a country's defence, but from the perspective of goals conceded versus opponent xG:

For example, if the opponent was expected to score 0.8 xG but scored 2, then the defensive delta for that match would be -1.2. This indicates poor defending. Averaging gives:

The combined performance measure is defined as:

The “av_delta” metric becomes positive when a team’s attacking over-performance exceeds its defensive over-performance, indicating that it is relatively stronger going forward than would be expected given its overall strength.

Conversely, if “av_delta” is negative, it suggests the team’s defensive performance is stronger relative to its attack, meaning it prevents goals more effectively than it creates them.

Finally, the “av_delta” figure can be turned into a multiplicative adjustment.

The square root transformation ensures that a team’s total strength is unchanged; a value multiplied by its reciprocal is exactly 1. Furthermore, the limits of 0.8 and 1.25 ensure that there are not any wild swings caused by a lack of sample data.

Team-Based ATT & DEF Ratings

These ATT & DEF multipliers are then applied to the Unadjusted ATT & DEF Ratings. These unadjusted ratings are derived from a team’s total strength.

The final Team ATT Rating involves multiplying the ATT Multiple by the Unadjusted ATT. On the other hand, the final Team DEF Rating involves dividing the Unadjusted DEF by the DEF Multiple.

Player-Based ATT & DEF Ratings

To complement the performance-based ratings, a second layer of team strength is derived purely from player quality, using squad market values from Transfermarkt.

For each World Cup, the following variables are collected at the team level: country, average squad market value, relative market value (team market value divided by the tournament average), finishing ELO rating, relative team strength (derived from ELO), and year. This dataset allows us to quantify how squad value translates into on-field performance.

A linear regression is then fitted, with relative market value on the x-axis and relative strength on the y-axis. This produces the following relationship:

This linear equation had an R Squared value of 0.5905, indicating a moderate but meaningful relationship between squad value and team strength.

Using this linear model, the 64 teams in World Cup contention are assigned projected strength ratings. For each team, their average market value is converted into a relative market value (market value divided by the average market value of the competition), which is then input into the regression equation to estimate their market value-based strength.

Finally, these projected strength ratings are decomposed into ATT and DEF components using the transformation mentioned above, in the previous section.

Final ATT & DEF Ratings

The final ATT and DEF ratings are constructed as a weighted combination of the team-based ratings (derived from historical performance) and the player-based ratings (derived from squad market values).

A weighting of 80% is assigned to the team-based ratings, with the remaining 20% allocated to the player-based ratings. This reflects my view that actual on-field performance is more predictive of future outcomes, while still allowing squad quality to influence the final ratings.

Expected Goals

The entire purpose of constructing ATT and DEF ratings for each country is to translate team strength into expected goals (xG) for any given matchup. These ratings provide a structured way to estimate how many goals each team is likely to score against a specific opponent.

For a given team, expected goals are calculated as:

In this model:

ATT represents the team's attacking strength
Opp. DEF represents the opponent's defensive strength
Average Goals is set to 1.35, based on historical scoring averages from recent World Cups
Home Advantage is applied where relevant.

Home advantage is incorporated as a multiplicative factor of the square root of 1.12 for the home team, with the reciprocal applied to the away team. This creates a total advantage of 12% for the home team. This value was derived from the international match dataset post 2000, comparing the average goals scored by home and away sides.

Importantly, the home advantage is only applied in specific contexts:

During World Cup qualifying matches, where true home conditions exist.
At the World Cup itself, but only for host nations - USA, Mexico and Canada

Simulating Goals

Once each team’s expected goals have been calculated, match outcomes in the World Cup can be simulated by treating goal scoring as a Poisson process over 90 minutes. Under this approach, the number of goals scored by a team is assumed to follow a Poisson distribution with mean equal to that team’s expected goals value, lambda.

In the spreadsheet, this is implemented using Excel’s BINOM.INV function. The formula simulating the number of goals scored treats a match as 10,000 small goal-scoring trials, each with probability xG/10,000.

Mathematically, this returns the smallest number of “successes” (goals) such that the cumulative binomial probability exceeds a random draw from a uniform distribution. As the number of trials becomes large and the probability becomes small, this binomial setup converges to a Poisson distribution with mean xG, allowing us to simulate realistic goal counts.

While we are using the Poisson distribution due to its simplicity and interpretability, it relies on several key assumptions:

Goals occur independently.
Goals occur at an average rate throughout the match.
Two goals cannot occur at exactly the same time.

However, of course, football games in real-life do not perfectly satisfy the first two assumptions. Goal scoring is not truly independent, and the scoring rate is not constant throughout a match, as teams adjust tactics based on the scoreline, game context, red cards, fatigue, and time remaining.

When testing the appropriateness of the Poisson distribution statistically, these limitations hold up. The Poisson distribution, using a mean of 1.36 goals in the dataset, slightly underestimates both low-scoring outcomes (0 goals) and high-scoring outcomes (6+ goals). The table below compares the observed number of goals scored to the expected values under the Poisson model for all matches since 2000 (excluding games with a margin greater than 10 goals).

This table proves that goals scored in international football fixtures are not perfectly modelled by the Poisson distribution. Despite this imperfection, I’ve chosen to use the Poisson distribution as it still captures the overall shape of football scorelines and is very simple to implement.

Next, if a knockout match is tied after 90 minutes, the simulation then moves to extra time. An additional 30 minutes is simulated by scaling each team’s expected goals to a third of what they would be during normal time.

Goals in extra time are then simulated in the same way as during normal time. If the teams are still level after extra time, the match proceeds to penalties. In the model, penalties are treated as a pure coin flip, so each team is given a 50% chance of progressing, regardless of their expected goals or underlying strength ratings.

This is a deliberate simplifying assumption, reflecting the high randomness of penalty shootouts and avoiding the need to impose a separate penalty skill model.

Conclusion

Simulating individual matches represents the final stage of the model, bringing together the ATT and DEF ratings, expected goals framework, and goal distributions. By extending this process across an entire tournament and repeating it 10,000 times, the model generates a full distribution of outcomes, allowing us to estimate each team’s probability of finishing in any given position.

Thanks for reading!