Skip to content

Alberto199810/Master-s-thesis-Moneyball-II

Repository files navigation


Thesis logo

Statistics and Sport: The Moneyball idea applied to Football world

Alberto Allegri, Bocconi University, Master's Thesis in Data Science & Business Analytics

Tech Stack:
R Python

Table of Contents

  1. General Info
  2. Methodology
  3. Repository Structure
  4. Results
  5. Credits

General Info

This repository contains the code for my Master's Thesis, where I try to replicate the Moneyball idea (building a team capable of winning the league, but with a restricted budget) using football statistics coming from FBRef Scouting Reports, Transfermarkt and PlayeRank data together.

If you want a visualization of aggregated PlayeRank data, you can visit the app I built under this link: https://alberto-allegri-moneyball2.herokuapp.com/.

Methodology

The idea behind the thesis is that we start from a dataset having as independent variables (our X) all the statistics that you can find in the FBRef scouting report, and as dependent variable (our y) the average PlayeRank score of a player for the entire season. The data I had was going from season 2017/2018 to season 2020/2021, for all the players that had a scouting report in FBRef in those season. For each and every single of these players, I had their average PlayeRank score, thanks to the data PlayeRank provided me. Unfortunately, the only dataset I can show you is the one with FBRef data, since PlayeRank data are protected by an NDA.

Thanks to the two obtained datasets (one for Goalkeepers, one for moving players, since variables in the scouting report are different between Goalkeepers and non-Goalkeepers), what I decided to do was the following:

Firstly, I split the data in 6 different sets:

● Goalkeepers
● Centre-Backs
● Full-Backs
● Midfielders
● Wingers
● Strikers

Then, to each of these sets, we apply 7 different Variable Selection Algorithms to select which FBRef statistics are the most important to predict PlayeRank score:

● BorutaPy
● Recursive Feature Elimination
● Univariate Feature Selection
● Lasso Regression
● Feature Shuffling
● Feature Performance
● Target Mean Performance

Once the 7 different algorithms were applied, if a variable was selected by at least 5 out of 7 algorithms, then it was included in my final choice (final choice can be seen at this link.

Finally, when the variables were selected, I collected the players that won the league from season 2017/2018 to season 2020/2021 in all the top-5 European Leagues (csv can be found at this link). Then, I calculated the mean values of the important variables within each role, and, by using the NORM function, I calculated which observations in my dataset were the closest to the average statistics of the players who actually won the league. But, to this whole reasoning, I applied a restriction: the selected players had to have a Transfermarkt evaluation that was 1/x (X could be selected, basing on how much money we wanted to save) the average value of the players within a certain role in the winning team.

The final result is a full team that is actually costing way less the total cost of the winning team, but that actually had very close performances for the most important statistics to evaluate the total performance of a player in his role!

To retrieve the team for a precise league/season we just have to run low_cost_winners(season, league, factor_to_save).

Repository Structure

Within the repository, you can see four main folders:

  1. 01_Datasets
  2. 02_Machine Learning
  3. 03_Data Analytics
  4. 04_App

01_Datasets

In this folder, you can find the scripts I wrote to build my final dataset. Moreover, in the section Leagues Comparison you can see the code I applied to standardize statistics across leagues, producing the following map of European Leagues:

How was this computed? I took the data regarding all European matches in the last 12 years (from 2010 to 2022), and I generated a contact matrix, with this reasoning (example following):

Given four matches between a Serie A team and a Ligue 1 team with these results (Serie A 2-1 Ligue 1, Ligue 1 3-0 Serie A, Serie A 6-2 Ligue 1, Ligue 1 1-4 Serie A), the starting matrix would be:

Serie A Ligue 1
Serie A 0 3
Ligue 1 8 0

Where Ci,j is the difference with which league j has beaten league i in the matches that j won against i, while Cj,i is the difference with which league i has beaten league j in the matches that i won against j. Then, to standardize everything and make it consistent, I divided both Ci,j and Cj,i by the total amount of matches they played one against the other (here, 4). So the resulting matrix, in our case, would be:

Serie A Ligue 1
Serie A 0 0.75
Ligue 1 2 0

By doing like this, we could create a directed graph where the weight of the directed link from node i to node j was equal to Ci,j (in our case, link FROM Ligue 1 TO Serie A would have a weight of 2, while link FROM Serie A TO Ligue 1 would have a weight of 0.75), meaning that a link of a certain weight from node i to node j is generated to represent how many goals (on average) league j scores against league i. So, by looking together at Ci,j and Cj,i, we can infer who's the strongest league by looking at their coefficients.

Finally, to represent everything in the plot, the node size is depending on the average degree of the in-edges. That's how I built the network, with node size representing how difficult (and for this reason powerful) a certain league is. Then, other calculation was applied to obtain 5 final difficulty coefficients, represented in the following table:

League Difficulty Index
La Liga 1
Premier League 0.996
Fußball-Bundesliga 0.931
Serie A 0.904
Ligue 1 0.856

The rest of the scripts (the one in Player Statistics folder) are related to building the dataset with the statistics coming both from FBRef and PlayeRank. Once this dataset was defined, I multiplied the FBRef statistics with the difficulty coefficients, to standardize statistics across leagues.

02_Machine Learning

In this folder, you can find the script I built to obtain the chosen variables per each role, applying the 7 different variable selection algorithms.

03_Data Analytics

In this folder, you can find both the scripts I wrote to obtain the winners dataset and the jupyter notebook containing the function to get the most similar players to the ones that won the league. Moreover, we built a "Trade-off" value by dividing the Percentage of Money we're saving with the similarity coefficient (Similarity is measured as the distance from statistics of a player to the average statistics of league winners, computed with norm function) of the low-cost team.

04_App

In this final folder, there is the script for building the app that can be found at this link. Code can be found in the notebook "bqplot.ipynb". Then, voila and Heroku were used for the deploy.

This app was created just to have a first view of PlayeRank data, mixed with TransferMarkt valuation, and it will show you:

● A preview of PlayeRank data with a boxplot
● A scatterplot with PlayeRank score on the x-axis and TransferMarkt valuation on the y-axis, with size of 
  the scatter dependant on a "Likability" parameter (computed with PlayeRank index/TransferMarkt valuation)
● A pitch where the best 11 players by the likability parameter (divided by role) are represented.

Here, a preview with some screenshots of the three plots (various interactive filters can be applied to plots):

Results

The below tables details the similarity coefficient, the percentage savings and the trade off coefficient for the top 5 values obtained for trade off coefficient. Obviously, the higher is trade-off, the best is the combination of similarities of statistics and money we're saving.

Season League Similarity Coeff. Savings Trade Off
2020-2021 Serie A 4.424397 74.68% 16.880303
2020-2021 Ligue 1 4.480457 75.25% 16.796812
2019-2020 Serie A 5.254009 80.59% 15.338965
2020-2021 Bundesliga 5.364341 82.05% 15.296018
2018-2019 Ligue 1 5.763644 84.11% 14.59353

By looking at the resulting value, it seems that the best trade-off is found in Serie A 2020/2021, with a 4.424397 similarity coefficient and a 74.68% save on the budget. Looking at absolute amounts, the built team costs €152.941.662 (very close to the total value of Cagliari Calcio, the 12th team for total cost of the squad in Serie A 2020/2021).

If we look at two of the coefficients we have in our dataset (Likability of the signing and average PlayeRank index), we discover that the two teams are very similar in the total sum of the PlayeRank score (2.96 for the low-cost players, 3.02 for the winners), but low-cost players have a way higher total likability (957.38 vs 851.27). In the following table, the low-cost team for Serie A, 2020/2021:

Similarity Player Valuation Position
4.293194 Claudio Bravo €1.000.000 Goalkeeper
3.511304 Niklas Stark €10.750.000 Centre-Back
3.824119 Amir Rrahmani €13.333.333 Centre-Back
4.419091 Jordan Torunarigha €11.333.333 Centre-Back
2.988132 Bruno Peres €2.666.666 Full-Back
3.917845 Andrea Conti €7.166.666 Full-Back
3.961353 Maxime Busi €4.250.000 Full-Back
5.423287 Óscar De Marcos €2.333.333 Full-Back
3.433324 Roberto Gagliardini €14.666.666 Midfielder
3.889508 Jasmin Kurtič €2.833.333 Midfielder
4.096985 Otávio €7.000.000 Midfielder
5.068749 Marko Rog €11.666.666 Midfielder
5.765697 Ivan Ilić €5.333.333 Midfielder
6.038868 Danilo Cataldi €2.900.000 Midfielder
3.687857 Patrik Schick €24.333.333 Striker
5.250205 Saša Kalajdžić €12.625.000 Striker
6.221749 Sehrou Guirassy €13.000.000 Striker
3.847888 Karim Bellarabi €5.750.000 Winger

On the other side, if we try to take a look at who was the best substitute for Romelu Lukaku, when F.C. Internazionale lost him, we would receive the following table:

Similarity Player Valuation Position
5.680932 Tammy Abraham €39.333.333 Striker
7.554334 Krzysztof Piątek €16.000.000 Striker
7.809493 Joel Pohjanpalo €2.166.666 Striker
8.500613 Maxi Gómez €29.250.000 Striker
8.565245 Sehrou Guirassy €13.000.000 Striker

As we can see, Tammy Abraham was his best substitute, looking at data from 2020/2021. In the 2021 summer, he was bought by A.S. Roma for €40.000.000, recording 27 goals and 5 assists in 53 games (1 goal every 157 minutes). Also Krzysztof Piątek was bought in January from A.C.F. Fiorentina, playing 948 minutes and scoring 7 goals (1 goal every 158 minutes). While they have a similarity of 5.68 and 7.55 with Lukaku, the actual signing of F.C. Internazionale had an average similarity of 18.95 (16.20 for Edin Džeko and 21.70 for Joaquín Correa), and were bought for a total of €30.000.000. Together, they scored 1 goal every 203 minutes (23 goals in 4688 minutes), and none of them goes close to the goal per minutes statistic of Abraham or Piątek.

Credits

This repository is by Alberto Allegri ([email protected]) as part of the Master's Thesis in Data Science & Business Analytics at Boccony University, with the supervision of professor Carlo Ambrogio Favero.

About

Moneyball idea applied to Football

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages