This repository contains the code for my Master's Thesis, where I try to replicate the Moneyball idea (building a team capable of winning the league, but with a restricted budget) using football statistics coming from FBRef Scouting Reports, Transfermarkt and PlayeRank data together.
If you want a visualization of aggregated PlayeRank data, you can visit the app I built under this link: https://alberto-allegri-moneyball2.herokuapp.com/.
The idea behind the thesis is that we start from a dataset having as independent variables (our X
) all the statistics that you can find in the FBRef scouting report, and as dependent variable (our y
) the average PlayeRank score of a player for the entire season. The data I had was going from season 2017/2018 to season 2020/2021, for all the players that had a scouting report in FBRef in those season. For each and every single of these players, I had their average PlayeRank score, thanks to the data PlayeRank provided me. Unfortunately, the only dataset I can show you is the one with FBRef data, since PlayeRank data are protected by an NDA.
Thanks to the two obtained datasets (one for Goalkeepers, one for moving players, since variables in the scouting report are different between Goalkeepers and non-Goalkeepers), what I decided to do was the following:
Firstly, I split the data in 6 different sets:
● Goalkeepers
● Centre-Backs
● Full-Backs
● Midfielders
● Wingers
● Strikers
Then, to each of these sets, we apply 7 different Variable Selection Algorithms to select which FBRef statistics are the most important to predict PlayeRank score:
● BorutaPy
● Recursive Feature Elimination
● Univariate Feature Selection
● Lasso Regression
● Feature Shuffling
● Feature Performance
● Target Mean Performance
Once the 7 different algorithms were applied, if a variable was selected by at least 5 out of 7
algorithms, then it was included in my final choice (final choice can be seen at this link.
Finally, when the variables were selected, I collected the players that won the league from season 2017/2018 to season 2020/2021 in all the top-5 European Leagues (csv can be found at this link). Then, I calculated the mean values of the important variables within each role, and, by using the NORM function, I calculated which observations in my dataset were the closest to the average statistics of the players who actually won the league. But, to this whole reasoning, I applied a restriction: the selected players had to have a Transfermarkt evaluation that was 1/x (X could be selected, basing on how much money we wanted to save) the average value of the players within a certain role in the winning team.
The final result is a full team that is actually costing way less the total cost of the winning team, but that actually had very close performances for the most important statistics to evaluate the total performance of a player in his role!
To retrieve the team for a precise league/season we just have to run low_cost_winners(season, league, factor_to_save).
Within the repository, you can see four main folders:
- 01_Datasets
- 02_Machine Learning
- 03_Data Analytics
- 04_App
In this folder, you can find the scripts I wrote to build my final dataset. Moreover, in the section Leagues Comparison you can see the code I applied to standardize statistics across leagues, producing the following map of European Leagues:
How was this computed? I took the data regarding all European matches in the last 12 years (from 2010 to 2022), and I generated a contact matrix, with this reasoning (example following):
Given four matches between a Serie A team and a Ligue 1 team with these results (Serie A 2-1 Ligue 1, Ligue 1 3-0 Serie A, Serie A 6-2 Ligue 1, Ligue 1 1-4 Serie A), the starting matrix would be:
Serie A | Ligue 1 | |
---|---|---|
Serie A | 0 | 3 |
Ligue 1 | 8 | 0 |
Where Ci,j is the difference with which league j has beaten league i in the matches that j won against i, while Cj,i is the difference with which league i has beaten league j in the matches that i won against j. Then, to standardize everything and make it consistent, I divided both Ci,j and Cj,i by the total amount of matches they played one against the other (here, 4). So the resulting matrix, in our case, would be:
Serie A | Ligue 1 | |
---|---|---|
Serie A | 0 | 0.75 |
Ligue 1 | 2 | 0 |
By doing like this, we could create a directed graph where the weight of the directed link from node i to node j was equal to Ci,j (in our case, link FROM Ligue 1 TO Serie A would have a weight of 2, while link FROM Serie A TO Ligue 1 would have a weight of 0.75), meaning that a link of a certain weight from node i to node j is generated to represent how many goals (on average) league j scores against league i. So, by looking together at Ci,j and Cj,i, we can infer who's the strongest league by looking at their coefficients.
Finally, to represent everything in the plot, the node size is depending on the average degree of the in-edges. That's how I built the network, with node size representing how difficult
(and for this reason powerful) a certain league is. Then, other calculation was applied to obtain 5 final difficulty coefficients
, represented in the following table:
League | Difficulty Index |
---|---|
La Liga | 1 |
Premier League | 0.996 |
Fußball-Bundesliga | 0.931 |
Serie A | 0.904 |
Ligue 1 | 0.856 |
The rest of the scripts (the one in Player Statistics folder) are related to building the dataset with the statistics coming both from FBRef and PlayeRank. Once this dataset was defined, I multiplied the FBRef statistics with the difficulty coefficients, to standardize statistics across leagues.
In this folder, you can find the script I built to obtain the chosen variables per each role, applying the 7 different variable selection algorithms.
In this folder, you can find both the scripts I wrote to obtain the winners dataset and the jupyter notebook containing the function to get the most similar players to the ones that won the league. Moreover, we built a "Trade-off" value by dividing the Percentage of Money we're saving with the similarity coefficient (Similarity is measured as the distance from statistics of a player to the average statistics of league winners, computed with norm function) of the low-cost team.
In this final folder, there is the script for building the app that can be found at this link. Code can be found in the notebook "bqplot.ipynb". Then, voila and Heroku were used for the deploy.
This app was created just to have a first view of PlayeRank data, mixed with TransferMarkt valuation, and it will show you:
● A preview of PlayeRank data with a boxplot
● A scatterplot with PlayeRank score on the x-axis and TransferMarkt valuation on the y-axis, with size of
the scatter dependant on a "Likability" parameter (computed with PlayeRank index/TransferMarkt valuation)
● A pitch where the best 11 players by the likability parameter (divided by role) are represented.
Here, a preview with some screenshots of the three plots (various interactive filters can be applied to plots):
The below tables details the similarity coefficient, the percentage savings and the trade off coefficient for the top 5 values obtained for trade off coefficient. Obviously, the higher is trade-off, the best is the combination of similarities of statistics and money we're saving.
Season | League | Similarity Coeff. | Savings | Trade Off |
---|---|---|---|---|
2020-2021 | Serie A | 4.424397 | 74.68% | 16.880303 |
2020-2021 | Ligue 1 | 4.480457 | 75.25% | 16.796812 |
2019-2020 | Serie A | 5.254009 | 80.59% | 15.338965 |
2020-2021 | Bundesliga | 5.364341 | 82.05% | 15.296018 |
2018-2019 | Ligue 1 | 5.763644 | 84.11% | 14.59353 |
By looking at the resulting value, it seems that the best trade-off is found in Serie A 2020/2021, with a 4.424397 similarity coefficient and a 74.68% save on the budget. Looking at absolute amounts, the built team costs €152.941.662 (very close to the total value of Cagliari Calcio, the 12th team for total cost of the squad in Serie A 2020/2021).
If we look at two of the coefficients we have in our dataset (Likability of the signing and average PlayeRank index), we discover that the two teams are very similar in the total sum of the PlayeRank score (2.96 for the low-cost players, 3.02 for the winners), but low-cost players have a way higher total likability (957.38 vs 851.27). In the following table, the low-cost team for Serie A, 2020/2021:
Similarity | Player | Valuation | Position |
---|---|---|---|
4.293194 | Claudio Bravo | €1.000.000 | Goalkeeper |
3.511304 | Niklas Stark | €10.750.000 | Centre-Back |
3.824119 | Amir Rrahmani | €13.333.333 | Centre-Back |
4.419091 | Jordan Torunarigha | €11.333.333 | Centre-Back |
2.988132 | Bruno Peres | €2.666.666 | Full-Back |
3.917845 | Andrea Conti | €7.166.666 | Full-Back |
3.961353 | Maxime Busi | €4.250.000 | Full-Back |
5.423287 | Óscar De Marcos | €2.333.333 | Full-Back |
3.433324 | Roberto Gagliardini | €14.666.666 | Midfielder |
3.889508 | Jasmin Kurtič | €2.833.333 | Midfielder |
4.096985 | Otávio | €7.000.000 | Midfielder |
5.068749 | Marko Rog | €11.666.666 | Midfielder |
5.765697 | Ivan Ilić | €5.333.333 | Midfielder |
6.038868 | Danilo Cataldi | €2.900.000 | Midfielder |
3.687857 | Patrik Schick | €24.333.333 | Striker |
5.250205 | Saša Kalajdžić | €12.625.000 | Striker |
6.221749 | Sehrou Guirassy | €13.000.000 | Striker |
3.847888 | Karim Bellarabi | €5.750.000 | Winger |
On the other side, if we try to take a look at who was the best substitute for Romelu Lukaku, when F.C. Internazionale lost him, we would receive the following table:
Similarity | Player | Valuation | Position |
---|---|---|---|
5.680932 | Tammy Abraham | €39.333.333 | Striker |
7.554334 | Krzysztof Piątek | €16.000.000 | Striker |
7.809493 | Joel Pohjanpalo | €2.166.666 | Striker |
8.500613 | Maxi Gómez | €29.250.000 | Striker |
8.565245 | Sehrou Guirassy | €13.000.000 | Striker |
As we can see, Tammy Abraham was his best substitute, looking at data from 2020/2021. In the 2021 summer, he was bought by A.S. Roma for €40.000.000, recording 27 goals and 5 assists in 53 games (1 goal every 157 minutes). Also Krzysztof Piątek was bought in January from A.C.F. Fiorentina, playing 948 minutes and scoring 7 goals (1 goal every 158 minutes). While they have a similarity of 5.68 and 7.55 with Lukaku, the actual signing of F.C. Internazionale had an average similarity of 18.95 (16.20 for Edin Džeko and 21.70 for Joaquín Correa), and were bought for a total of €30.000.000. Together, they scored 1 goal every 203 minutes (23 goals in 4688 minutes), and none of them goes close to the goal per minutes statistic of Abraham or Piątek.
This repository is by Alberto Allegri ([email protected]) as part of the Master's Thesis in Data Science & Business Analytics at Boccony University, with the supervision of professor Carlo Ambrogio Favero.