Expected Goals model: an alternative approach
31/03/16 | SciSports
Goals in football matches are rather unpredictable events and the final result does not always reflect which team was performing best. To get a better reflection of the real performance of teams one can look at the total shots, the shots on target and even (when available) the number of dangerous attacks. However, shot quality is not reflected in match statistics. A method widely used for evaluating shot quality is the calculation of expected goals. Unfortunately, there is no standard method to calculate this.
In recent years various expected goals models were developed, varying from very basic models to complex models taking into account loads of variables. Our model will focus mainly on the strength and skill of the defensive line and the individual attackers. This is different from most other models, but can also produce acceptable results, as will be shown in this article.
Factors included in the model
The first thing that comes to mind is the distance to the goal and the angle between the shot position and both goal posts (or more exact, the angle with the projected point of origin, which takes swerve into account). At Cardiff City they might disagree, but data shows that shooting from long range and at a small angle is highly ineffective.
Defensive positioning is a key factor in the success rate of goal attempts. A goalkeeper, defenders and the occasional teammate blocking part of the goal area significantly reduce the chance of scoring. Currently, no positional information is available in the datasets freely accessible from the internet, other than the players in possession of the ball. It might even be possible to estimate the position of the goalkeeper based on the provided data. However, reconstruction of the positions of defenders is currently impossible to do. Therefore we can only assume that:
- Goal attempts from corners will be harder to convert into goals as the box is most likely filled with players.
- Fast transitions from the first third to the final third will result in more space and an out positioned defence, therefore leading to more successfully converted goal attempts.
- Possession in the final third, with the opposite defence in position, will result in goal attempts being harder to convert.
The value of a striker largely depends on how many goals he scores and most clubs are willing to pay large sums of money for these players (from the 25 most expensive players according to transfermarkt.co.uk, 17 players are attackers, only 1 is a goalkeeper and not a single one is a defender). Although overvalued, let’s assume that the scoring ability of individual players differs. To emphasise this, you would rather have Messi take a shot from (for example) the edge of the box than Coquelin. We will further explain this later on.
Set pieces are a special case, they differ highly from open play attempts and can be divided into three main categories; corners, free kicks and penalties. A penalty provides a free shot with a success rate of around 74%, depending on the quality of both taker and stopper. Free kicks provide an opportunity (mostly) outside the box in which part of the goal area is blocked by a wall.
Several other factors that might influence the success rate are shortly discussed in this paragraph, but are currently not used for the model. One of these is the pass leading to the goal attempt. It is plausible that a player in front of goal only passes a ball to a teammate if it will increase the chance of scoring. In case of a cross the quality of the player taking the cross, cross distance, defensive positioning and the quality of the player receiving the cross play a crucial role in a successful conversion.
Furthermore home advantage (see for instance this research article for penalty success rate in multiple competitions), headers, right or left footed shots, rebounds and dribbles are other factors that might have an effect on the outcome of a chance.
For this method goal attempts from open play, free kicks, corners and penalties are evaluated separately. Furthermore the category ‘open play’ is split up in three different subcategories, depending on the lowest position of the ball on the pitch in the last fifteen seconds. These areas are illustrated in the figure below.
Base xG Maps
For three of the above mentioned categories (penalties use a single number) a base xG map is created based on approximately 240.000 attempts from seven competitions (Premier League, Bundesliga, La Liga, Serie A, Ligue 1, Eredivisie and the Championship). These base xG maps are used as a starting point for both team defensive lines and individual attackers and follow from structuring all available goal attempts. As mentioned earlier, the ‘open play’ category can be divided into three subcategories. Two subcategories, namely fast breaks (own third) and possession based play (final third) around the box, are captured in layers that we use to combine with the ‘open play’ base xG map.
Similar base maps are created for both corners and free kicks. While most of the base xG maps are not yet optimised nor cleaned up properly, it will do for now. These maps will change fairly rapidly and only serve as a starting point as will be shown below.
After each match the xG maps for the individual players and the team’s defensive line are updated, by modifying a small area of the specific xG map. This process is shown in the figure below for Messi.
By doing this the specific behaviour of both attackers and defences is captured. The difference between individual players is illustrated in the figure below, in which the specific xG maps for Luis Suarez (FC Barcelona), Zlatan Ibrahimovic (Paris Saint-Germain), Wissam Ben Yedder (Toulouse) and Bafétimbi Gomis (Swansea City) are presented.
The method used to calculate the expected goals per chance is based on the following formula:
In this formula, x and y stand for position on the pitch, xG is the expected goals value, c is the correction factor for sub categories in ‘open play’ (for set pieces this value equals 1), xGA is the expected goals value for the player taking the shot at position (x,y), xGD is the value for the defensive line and xGbase is the base xG map value.
The accuracy of this model is calculated by combining all teams for the current (running) season. This results in a root mean square error (RMSE) of 0.1875 for goals scored and 0.1850 for goals conceded (penalties and own goals excluded). Separate competitions will produce slightly different values, but nothing substantial. The scatter plot for goals vs expected goals per game is presented below.
This shows a nice fit, even for the best and worst teams, with PSG being the only real outlier here.
Although this model uses a very limited amount of parameters, the results we get from this model are satisfying. These results however apply to long term estimation of xG values. Variance for short term estimation of xG values (per match) shows a rather high variance (see the article of Nils Mackay). Therefore, in the near future, other parameters will be added to the model in order to increase its accuracy for both long and short term estimation of xG.