From 2f8e84840ebcb55bf6abba0dfe57c302cb65ad89 Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Tue, 5 Nov 2024 19:42:47 +0530 Subject: [PATCH 1/2] Fixes #776 --- ...a-ml-model-fight-predictions-ufc-259.ipynb | 1754 +++++++++++++++++ 1 file changed, 1754 insertions(+) create mode 100644 Prediction Models/MMA_Fight_prediction/mma-ml-model-fight-predictions-ufc-259.ipynb diff --git a/Prediction Models/MMA_Fight_prediction/mma-ml-model-fight-predictions-ufc-259.ipynb b/Prediction Models/MMA_Fight_prediction/mma-ml-model-fight-predictions-ufc-259.ipynb new file mode 100644 index 00000000..dc6ce51a --- /dev/null +++ b/Prediction Models/MMA_Fight_prediction/mma-ml-model-fight-predictions-ufc-259.ipynb @@ -0,0 +1,1754 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mixed Martial Arts and the UFC\n", + "\n", + "\n", + "The UFC is the largest MMA promotion company in the world and features some of the highest-level fighters in the sport. As of 2020 the UFC has held over 500 events features fighters in 12 different weight divisions. The data set is a collection of over 5000 fights from the years 1993 to 2019.\n", + "\n", + "Being a huge fan of MMA, I wanted to design some Machine Learning Models to experiment with the avaiable data. The goal is to make a model to predict fight outcomes, and see if it has any usefulness in real world application.\n", + "\n", + "In this particular notebook I reduce the data down to (what I felt was) core stats, so despite this dataset having over 145 features, I reduce it down to height, weight, reach, win streak, lose streak, total wins, total losses, and total draws. In the future I will apply more features to see if the model accuracy improves at all.\n", + "\n", + "In this notebook I use the following algorithms for model building:\n", + "* Gaussian Naive Bayes\n", + "* Logistic Regression\n", + "* Decision Tree\n", + "* KNN\n", + "* Random Forest\n", + "* Support Vector Classifier\n", + "* XGBoost\n", + "* Artificial Neural Network\n", + "\n", + "The models with the highest accuracy score (Using k-fold cross-validation) on the training data are then accessed on the testing data.\n", + "\n", + "Finally the models that performed well are then applied to the upcoming event (March 6th 2021), to make predictions on fight winners." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", + "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "\n", + "\n", + "from sklearn.model_selection import KFold\n", + "from sklearn import tree\n", + "from sklearn.metrics import accuracy_score\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from sklearn.model_selection import StratifiedKFold\n", + "from sklearn.model_selection import cross_val_score\n", + "from sklearn.naive_bayes import GaussianNB\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.svm import SVC\n", + "import xgboost\n", + "from xgboost import XGBClassifier\n", + "\n", + "import keras \n", + "from keras.models import Sequential\n", + "from keras.layers import Dense\n", + "from keras import layers, models, optimizers\n", + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "from sklearn.metrics import confusion_matrix\n", + "from sklearn.metrics import plot_confusion_matrix\n", + "from sklearn.metrics import accuracy_score\n", + "\n", + "import os\n", + "for dirname, _, filenames in os.walk('/kaggle/input'):\n", + " for filename in filenames:\n", + " print(os.path.join(dirname, filename))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import and clean data for use in models" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "data_df = pd.read_csv('../input/ufcdata/data.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 5144 entries, 0 to 5143\n", + "Columns: 145 entries, R_fighter to R_age\n", + "dtypes: bool(1), float64(134), int64(1), object(9)\n", + "memory usage: 5.7+ MB\n" + ] + } + ], + "source": [ + "data_df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "df = data_df.dropna()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 3202 entries, 0 to 5008\n", + "Columns: 145 entries, R_fighter to R_age\n", + "dtypes: bool(1), float64(134), int64(1), object(9)\n", + "memory usage: 3.5+ MB\n" + ] + } + ], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "columns=df.select_dtypes(include='object').columns" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['R_fighter', 'B_fighter', 'Referee', 'date', 'location', 'Winner',\n", + " 'weight_class', 'B_Stance', 'R_Stance'],\n", + " dtype='object')" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "columns" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: \n", + "A value is trying to be set on a copy of a slice from a DataFrame\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + " errors=errors,\n" + ] + } + ], + "source": [ + "df.drop(columns=['R_fighter', 'B_fighter', 'Referee', 'date', 'location','weight_class'], inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
WinnerB_StanceR_Stance
0RedOrthodoxOrthodox
1RedOrthodoxSouthpaw
2RedOrthodoxOrthodox
3BlueSwitchOrthodox
4BlueSouthpawSouthpaw
............
4887RedOrthodoxOrthodox
4901RedOrthodoxOrthodox
4923RedOrthodoxOrthodox
4967RedOrthodoxOrthodox
5008RedSouthpawOrthodox
\n", + "

3202 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " Winner B_Stance R_Stance\n", + "0 Red Orthodox Orthodox\n", + "1 Red Orthodox Southpaw\n", + "2 Red Orthodox Orthodox\n", + "3 Blue Switch Orthodox\n", + "4 Blue Southpaw Southpaw\n", + "... ... ... ...\n", + "4887 Red Orthodox Orthodox\n", + "4901 Red Orthodox Orthodox\n", + "4923 Red Orthodox Orthodox\n", + "4967 Red Orthodox Orthodox\n", + "5008 Red Southpaw Orthodox\n", + "\n", + "[3202 rows x 3 columns]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.select_dtypes(include='object')" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: \n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + " \n", + "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: \n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + " This is separate from the ipykernel package so we can avoid doing imports until\n", + "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: \n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + " \n" + ] + } + ], + "source": [ + "map_stance = {'Orthodox': 0, 'Switch': 1, 'Southpaw': 2, 'Open Stance': 3}\n", + "df['B_Stance'] = df['B_Stance'].replace(map_stance)\n", + "df['R_Stance'] = df['R_Stance'].replace(map_stance)\n", + "\n", + "map_winner = {'Red': 0, 'Blue': 1, 'Draw': 2}\n", + "df['Winner'] = df['Winner'].replace(map_winner)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 3202 entries, 0 to 5008\n", + "Columns: 138 entries, Winner to R_age\n", + "dtypes: float64(134), int64(4)\n", + "memory usage: 3.4 MB\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: \n", + "A value is trying to be set on a copy of a slice from a DataFrame\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + " errors=errors,\n" + ] + } + ], + "source": [ + "df.drop(columns=df.select_dtypes(include='bool').columns, inplace=True)\n", + "df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0, 1, 2])" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['Winner'].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [], + "source": [ + "df = df[df['Winner'] != 2]" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 3151 entries, 0 to 5008\n", + "Columns: 138 entries, Winner to R_age\n", + "dtypes: float64(134), int64(4)\n", + "memory usage: 3.3 MB\n" + ] + } + ], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [], + "source": [ + "X = df.drop(columns=['Winner'])\n", + "Y = df['Winner']" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training size = 2363\n", + "Testing size = 788\n" + ] + } + ], + "source": [ + "print(\"Training size = \" + str(X_train.shape[0]))\n", + "print(\"Testing size = \" + str(X_test.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model Training and Evaluation on Data using k-fold cross validation" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [], + "source": [ + "seed = 404\n", + "np.random.seed(seed)" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Gaussian Naive Bayes K-fold Scores:\n", + "[0.60337553 0.59915612 0.6371308 0.51271186 0.59322034 0.55508475\n", + " 0.58898305 0.6059322 0.58050847 0.5720339 ]\n", + "\n", + "Gaussian Naive Bayes Average Score:\n", + "0.584813702352857\n", + "\n" + ] + } + ], + "source": [ + "from sklearn.model_selection import StratifiedKFold\n", + "from sklearn.model_selection import cross_val_score\n", + "from sklearn.naive_bayes import GaussianNB\n", + "\n", + "gnb = GaussianNB()\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(gnb, X_train, y_train.values.ravel(), cv=kfold)\n", + "gnb_score = cv_score.mean()\n", + "print('Gaussian Naive Bayes K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Gaussian Naive Bayes Average Score:')\n", + "print(gnb_score)\n", + "print()" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):\n", + "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", + "\n", + "Increase the number of iterations (max_iter) or scale the data as shown in:\n", + " https://scikit-learn.org/stable/modules/preprocessing.html\n", + "Please also refer to the documentation for alternative solver options:\n", + " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", + " extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Logistic Regression K-fold Scores (training):\n", + "[0.6835443 0.67088608 0.62447257 0.65677966 0.63559322 0.69915254\n", + " 0.6779661 0.68220339 0.61864407 0.62288136]\n", + "\n", + "Logistic Regression Average Score:\n", + "0.6572123292569548\n" + ] + } + ], + "source": [ + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "lr = LogisticRegression(max_iter = 10000)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(lr, X_train, y_train.values.ravel(), cv=kfold)\n", + "lr_score = cv_score.mean()\n", + "print('Logistic Regression K-fold Scores (training):')\n", + "print(cv_score)\n", + "print()\n", + "print('Logistic Regression Average Score:')\n", + "print(lr_score)" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Decision Tree K-fold Scores:\n", + "[0.5907173 0.56962025 0.55696203 0.55508475 0.58050847 0.63559322\n", + " 0.54661017 0.55932203 0.55508475 0.52966102]\n", + "\n", + "Decision Tree Average Score:\n", + "0.5679163984838732\n" + ] + } + ], + "source": [ + "from sklearn import tree\n", + "\n", + "dt = tree.DecisionTreeClassifier(random_state = 1)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(dt, X_train, y_train.values.ravel(), cv=kfold)\n", + "dt_score = cv_score.mean()\n", + "print('Decision Tree K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Decision Tree Average Score:')\n", + "print(dt_score)" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "KNN K-fold Scores):\n", + "[0.60337553 0.5443038 0.61603376 0.59322034 0.59322034 0.54661017\n", + " 0.56355932 0.61016949 0.58474576 0.58898305]\n", + "\n", + "KNN Average Score:\n", + "0.5844221554745047\n" + ] + } + ], + "source": [ + "from sklearn.neighbors import KNeighborsClassifier\n", + "\n", + "knn = KNeighborsClassifier()\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(knn, X_train, y_train.values.ravel(), cv=kfold)\n", + "knn_score = cv_score.mean()\n", + "print('KNN K-fold Scores):')\n", + "print(cv_score)\n", + "print()\n", + "print('KNN Average Score:')\n", + "print(knn_score)" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Random Forest K-fold Scores:\n", + "[0.67510549 0.62869198 0.63291139 0.6440678 0.63559322 0.64830508\n", + " 0.65677966 0.68220339 0.62711864 0.6779661 ]\n", + "\n", + "Random Forest Average Score:\n", + "0.6508742759064579\n" + ] + } + ], + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "rf = RandomForestClassifier(random_state = 1)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(rf, X_train, y_train.values.ravel(), cv=kfold)\n", + "rf_score = cv_score.mean()\n", + "print('Random Forest K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Random Forest Average Score:')\n", + "print(rf_score)" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Support Vector Classification K-fold Scores:\n", + "[0.64978903 0.64978903 0.64978903 0.64830508 0.64830508 0.64830508\n", + " 0.65254237 0.65254237 0.65254237 0.65254237]\n", + "\n", + "Support Vector Classification Average Score:\n", + "0.6504451834370307\n" + ] + } + ], + "source": [ + "from sklearn.svm import SVC\n", + "\n", + "svc = SVC(probability = True)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(svc, X_train, y_train.values.ravel(), cv=kfold)\n", + "svc_score = cv_score.mean()\n", + "print('Support Vector Classification K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Support Vector Classification Average Score:')\n", + "print(svc_score)" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[11:43:35] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:38] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:41] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:44] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:46] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:48] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:51] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:54] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:56] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "[11:43:59] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", + "XGBoost Classifier K-fold Scores:\n", + "[0.65400844 0.61181435 0.5907173 0.6440678 0.63559322 0.64830508\n", + " 0.65254237 0.66949153 0.62711864 0.65254237]\n", + "\n", + "XGBoost Classifier Average Score:\n", + "0.6386201101337339\n" + ] + } + ], + "source": [ + "import xgboost\n", + "from xgboost import XGBClassifier\n", + "\n", + "xgb = XGBClassifier(objective='binary:logistic',random_state =1, use_label_encoder=False)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(xgb, X_train, y_train.values.ravel(), cv=kfold)\n", + "xgb_score = cv_score.mean()\n", + "print('XGBoost Classifier K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('XGBoost Classifier Average Score:')\n", + "print(xgb_score)" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [], + "source": [ + "from keras.utils import np_utils\n", + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "encoder = LabelEncoder()\n", + "encoder.fit(y_train)\n", + "encoded_Y = encoder.transform(y_train)\n", + "y_Train = np_utils.to_categorical(encoded_Y)\n", + "\n", + "encoder = LabelEncoder()\n", + "encoder.fit(y_test)\n", + "y_Test = encoder.transform(y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "metadata": {}, + "outputs": [], + "source": [ + "import keras \n", + "from keras.models import Sequential\n", + "from keras.layers import Dense\n", + "# from keras import layers, models, optimizers\n", + "\n", + "\n", + "def create_model():\n", + " model = Sequential()\n", + " \n", + " model.add(Dense(X_train.shape[1], input_dim=X_train.shape[1], activation='relu'))\n", + " model.add(Dense(64, activation='tanh'))\n", + " model.add(Dense(128, activation='tanh'))\n", + " model.add(Dense(128, activation='tanh')) \n", + " model.add(Dense(32, activation='relu'))\n", + " model.add(Dense(2, activation='sigmoid'))\n", + "\n", + " model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n", + " return model" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Neural Network K-fold Scores:\n", + "[0.65400844 0.61181435 0.5907173 0.6440678 0.63559322 0.64830508\n", + " 0.65254237 0.66949153 0.62711864 0.65254237]\n", + "\n", + "Neural Network Average Score:\n", + "0.6386201101337339\n" + ] + } + ], + "source": [ + "from keras.wrappers.scikit_learn import KerasClassifier\n", + "from sklearn.model_selection import KFold\n", + "\n", + "seed = 7\n", + "np.random.seed(seed)\n", + "\n", + "model = KerasClassifier(build_fn=create_model, epochs=150, batch_size=10, verbose=0)\n", + "\n", + "\n", + "kfold = KFold(n_splits=10, shuffle=True)\n", + "results = cross_val_score(model, X_train, y_Train, cv=kfold)\n", + "nn_score = cv_score.mean()\n", + "print('Neural Network K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Neural Network Average Score:')\n", + "print(nn_score)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Best performing models\n", + "\n", + "With the training accuracy in mind, we will grab the top 3 models and evaluate them on the testing set" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ModelScore Average
0Gaussian Naive Bayes0.584814
1Logistic Regression0.657212
2Random Forest0.650874
3Decision Tree0.567916
4K-Nearest Neighbor0.584422
5Support Vector Classifier0.650445
6XGBoost0.638620
7Neural Network0.638620
\n", + "
" + ], + "text/plain": [ + " Model Score Average\n", + "0 Gaussian Naive Bayes 0.584814\n", + "1 Logistic Regression 0.657212\n", + "2 Random Forest 0.650874\n", + "3 Decision Tree 0.567916\n", + "4 K-Nearest Neighbor 0.584422\n", + "5 Support Vector Classifier 0.650445\n", + "6 XGBoost 0.638620\n", + "7 Neural Network 0.638620" + ] + }, + "execution_count": 92, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scores = [['Gaussian Naive Bayes', gnb_score],\n", + " ['Logistic Regression', lr_score],\n", + " ['Random Forest', rf_score],\n", + " ['Decision Tree', dt_score],\n", + " ['K-Nearest Neighbor', knn_score],\n", + " ['Support Vector Classifier', svc_score],\n", + " ['XGBoost', xgb_score],\n", + " ['Neural Network', nn_score]]\n", + "\n", + "df_scores = pd.DataFrame(scores,\n", + " columns = ['Model', 'Score Average']\n", + " )\n", + "df_scores" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Gaussian Naive Bayes Model Accuracy (on testing set): \n", + "0.618020304568528\n" + ] + } + ], + "source": [ + "from sklearn.metrics import plot_confusion_matrix\n", + "from sklearn.metrics import accuracy_score\n", + "\n", + "GNB = GaussianNB()\n", + "GNB_model = GNB.fit(X_train, y_train.values.ravel())\n", + "y_pred = GNB_model.predict(X_test)\n", + "\n", + "disp = plot_confusion_matrix(GNB_model, X_test, y_test)\n", + "disp.ax_.set_title('Gaussian Naive Bayes Confusion Matrix')\n", + "\n", + "plt.show()\n", + "print('Gaussian Naive Bayes Model Accuracy (on testing set): ')\n", + "print(accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Logistic Regression Model Accuracy (on testing set): \n", + "0.6573604060913706\n" + ] + } + ], + "source": [ + "lr = LogisticRegression(max_iter = 10000)\n", + "lr_model = lr.fit(X_train, y_train.values.ravel())\n", + "y_pred = lr_model.predict(X_test)\n", + "\n", + "disp = plot_confusion_matrix(lr_model, X_test, y_test)\n", + "disp.ax_.set_title('Logistic Regression Confusion Matrix')\n", + "\n", + "plt.show()\n", + "print('Logistic Regression Model Accuracy (on testing set): ')\n", + "print(accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Random Forest Model Accuracy (on testing set): \n", + "0.6307106598984772\n" + ] + } + ], + "source": [ + "rf = RandomForestClassifier(random_state = 1)\n", + "rf_model = rf.fit(X_train, y_train.values.ravel())\n", + "y_pred = rf_model.predict(X_test)\n", + "disp = plot_confusion_matrix(rf_model, X_test, y_test)\n", + "disp.ax_.set_title('Random Forest Confusion Matrix')\n", + "\n", + "plt.show()\n", + "\n", + "print('Random Forest Model Accuracy (on testing set): ')\n", + "print(accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SVC Model Accuracy (on testing set): \n", + "0.6078680203045685\n" + ] + } + ], + "source": [ + "svc = SVC(probability = True)\n", + "svc_model = svc.fit(X_train, y_train.values.ravel())\n", + "y_pred = svc_model.predict(X_test)\n", + "disp = plot_confusion_matrix(svc_model, X_test, y_test)\n", + "disp.ax_.set_title('Support Vector Classifier Confusion Matrix')\n", + "\n", + "plt.show()\n", + "\n", + "print('SVC Model Accuracy (on testing set): ')\n", + "print(accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Prediction Time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# UFC 259: Blachowicz vs Adesanya\n", + "\n", + "### Title fights (5 rounds):\n", + "Jan Blachowicz vs Israel Adesanya\n", + "\n", + "Amanda Nunes vs Megan Anderson\n", + "\n", + "Petr Yan vs Aljamain Sterling\n", + "\n", + "### 3 round fights:\n", + "Islam Makhachev vs Drew Dober\n", + "\n", + "Thiago Santos vs Aleksandar Rakic\n", + "\n", + "\n", + "## The Stats: ([https://www.espn.co.uk/mma/fightcenter/_/id/600001860/league/ufc](http://))\n", + "### Fight 1\n", + "#### Jan Blachowicz (Blue):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 4\n", + "* Draws: 0\n", + "* Losses: 8\n", + "* Wins: 27\n", + "* Stance: Orthodox\n", + "* Height: 188\n", + "* Reach: 198\n", + "* Weight: 205\n", + "\n", + "#### Israel Adesanya (Red):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 20\n", + "* Draws: 0\n", + "* Losses: 0\n", + "* Wins: 20\n", + "* Stance: Switch\n", + "* Height: 193\n", + "* Reach: 203\n", + "* Weight: 193 (speculation based on interview, weigh-ins to come)\n", + "\n", + "\n", + "### Fight 2\n", + "#### Amanda Nunes (Blue):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 11\n", + "* Draws: 0\n", + "* Losses: 4\n", + "* Wins: 20\n", + "* Stance: Orthodox\n", + "* Height: 173\n", + "* Reach: 165\n", + "* Weight: 145\n", + "\n", + "#### Megan Anderson (Red):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 2\n", + "* Draws: 0\n", + "* Losses: 4\n", + "* Wins: 11\n", + "* Stance: Orthodox\n", + "* Height: 183\n", + "* Reach: 183\n", + "* Weight: 145\n", + "\n", + "### Fight 3\n", + "#### Petr Yan (Blue):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 10\n", + "* Draws: 0\n", + "* Losses: 1\n", + "* Wins: 15\n", + "* Stance: Switch\n", + "* Height: 170\n", + "* Reach: 170\n", + "* Weight: 135\n", + "\n", + "#### Aljamain Sterling (Red):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 5\n", + "* Draws: 0\n", + "* Losses: 3\n", + "* Wins: 19\n", + "* Stance: Orthodox\n", + "* Height: 170\n", + "* Reach: 180\n", + "* Weight: 135\n", + "\n", + "### Fight 4\n", + "#### Islam Makhachev (Blue):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 6\n", + "* Draws: 0\n", + "* Losses: 1\n", + "* Wins: 18\n", + "* Stance: Orthodox\n", + "* Height: 178\n", + "* Reach: 178\n", + "* Weight: 155\n", + "\n", + "#### Drew Dober (Red):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 3\n", + "* Draws: 0\n", + "* Losses: 9\n", + "* Wins: 23\n", + "* Stance: Southpaw\n", + "* Height: 173\n", + "* Reach: 178\n", + "* Weight: 155\n", + "\n", + "### Fight 5\n", + "#### Thiago Santos (Blue):\n", + "* Current Lose Streak: 2\n", + "* Current Win Streak: 0\n", + "* Draws: 0\n", + "* Losses: 8\n", + "* Wins: 21\n", + "* Stance: Orthodox\n", + "* Height: 188\n", + "* Reach: 193\n", + "* Weight: 205\n", + "\n", + "#### Aleksandar Rakic (Red):\n", + "* Current Lose Streak: 0\n", + "* Current Win Streak: 1\n", + "* Draws: 0\n", + "* Losses: 2\n", + "* Wins: 13\n", + "* Stance: Orthodox\n", + "* Height: 193\n", + "* Reach: 198\n", + "* Weight: 205" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With all the stats available to us, it can create a data frame to feed into our models and get predictions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns = X.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fight1 = [5, 0, 4, 0, 8, 27, 0, 188, 198, 205, 0, 20, 0, 0, 20, 1, 193, 293, 193]\n", + "fight2 = [5, 0, 11, 0, 4, 20, 0, 173, 165, 145, 0, 2, 0, 4, 11, 0, 183, 183, 145]\n", + "fight3 = [5, 0, 10, 0, 1, 15, 1, 170, 170, 135, 0, 5, 0, 3, 19, 0, 170, 180, 135]\n", + "fight4 = [3, 0, 6, 0, 1, 18, 0, 178, 178, 155, 0, 3, 0, 9, 23, 2, 173, 178, 155]\n", + "fight5 = [3, 2, 0, 0, 8, 21, 0, 188, 193, 205, 0, 1, 0, 2, 13, 0, 193, 198, 205]\n", + "\n", + "df1 = pd.DataFrame(np.array([fight1, fight2, fight3, fight4, fight5]), columns = columns)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Model Predictions: (0 indicates Red Fighter Wins, 1 indicates Blue Fighter Wins)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Support Vector Classifier:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "svc_model.predict(df1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Predicted Winners:\n", + "* Fight 1 - Israel Adesanya\n", + "* Fight 2 - Megan Anderson\n", + "* Fight 3 - Aljamain Sterling\n", + "* Fight 4 - Drew Dober\n", + "* Fight 5 - Aleksandar Rakic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Random Forest:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rf_model.predict(df1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Predicted Winners:\n", + "* Fight 1 - Israel Adesanya\n", + "* Fight 2 - Megan Anderson\n", + "* Fight 3 - Aljamain Sterling\n", + "* Fight 4 - Islam Makhachev\n", + "* Fight 5 - Aleksandar Rakic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Logistic Regression:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lr_model.predict(df1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Predicted Winners:\n", + "* Fight 1 - Israel Adesanya\n", + "* Fight 2 - Megan Anderson\n", + "* Fight 3 - Aljamain Sterling\n", + "* Fight 4 - Drew Dober\n", + "* Fight 5 - Aleksandar Rakic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Full Feature Modeling\n", + "We'll reduce the problem to a binary classification problem, especially with the new scoring system draws don't occur in the UFC anymore." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df['Winner'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = df[df['Winner'] != 'Draw']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = df.drop(columns=['R_fighter', 'B_fighter', 'Referee', 'date', 'location', 'title_bout', 'weight_class', 'B_draw', 'R_draw'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mapping = {'Orthodox': 0, 'Switch': 1, 'Southpaw': 2, 'Open Stance': 3}\n", + "df['B_Stance'] = df['B_Stance'].replace(mapping)\n", + "df['R_Stance'] = df['R_Stance'].replace(mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X = df.drop(columns=['Winner'])\n", + "Y = df['Winner']\n", + "mapping = {'Red': 0, 'Blue': 1}\n", + "Y = Y.replace(mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seed = 404\n", + "np.random.seed(seed)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gnb = GaussianNB()\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(gnb, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('Gaussian Naive Bayes K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Gaussian Naive Bayes Average Score:')\n", + "print(cv_score.mean())\n", + "print()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lr = LogisticRegression(max_iter = 10000)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(lr, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('Logistic Regression K-fold Scores (training):')\n", + "print(cv_score)\n", + "print()\n", + "print('Logistic Regression Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dt = tree.DecisionTreeClassifier(random_state = 1)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(dt, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('Decision Tree K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Decision Tree Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "knn = KNeighborsClassifier()\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(knn, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('KNN K-fold Scores):')\n", + "print(cv_score)\n", + "print()\n", + "print('KNN Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rf = RandomForestClassifier(random_state = 1)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(rf, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('Random Forest K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Random Forest Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "svc = SVC(probability = True)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(svc, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('Support Vector Classification K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Support Vector Classification Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "xgb = XGBClassifier(objective='binary:logistic',random_state =1, use_label_encoder=False)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(xgb, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('XGBoost Classifier K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('XGBoost Classifier Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from keras.wrappers.scikit_learn import KerasClassifier\n", + "from keras.utils import np_utils\n", + "\n", + "encoder = LabelEncoder()\n", + "encoder.fit(y_train)\n", + "encoded_Y = encoder.transform(y_train)\n", + "y_Train = np_utils.to_categorical(encoded_Y)\n", + "\n", + "encoder = LabelEncoder()\n", + "encoder.fit(y_test)\n", + "y_Test = encoder.transform(y_test)\n", + "\n", + "\n", + "def create_model():\n", + " model = Sequential()\n", + " \n", + " model.add(Dense(X_train.shape[1], input_dim=X_train.shape[1], activation='relu'))\n", + " model.add(Dense(X_train.shape[1]*2, activation='tanh'))\n", + " model.add(Dense(X_train.shape[1]*4, activation='tanh'))\n", + " model.add(Dense(X_train.shape[1]*2, activation='tanh')) \n", + " model.add(Dense(X_train.shape[1], activation='relu'))\n", + " model.add(Dense(2, activation='softmax'))\n", + "\n", + " model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", + " return model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seed = 7\n", + "np.random.seed(seed)\n", + "\n", + "model = KerasClassifier(build_fn=create_model, epochs=150, batch_size=10, verbose=0)\n", + "\n", + "\n", + "kfold = KFold(n_splits=10, shuffle=True)\n", + "results = cross_val_score(model, X_train, y_Train, cv=kfold)\n", + "print('Neural Network K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('Neural Network Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lr = LogisticRegression(max_iter = 2000)\n", + "lr_model = lr.fit(X_train, y_train.values.ravel())\n", + "y_pred = lr_model.predict(X_test)\n", + "\n", + "disp = plot_confusion_matrix(lr_model, X_test, y_test)\n", + "disp.ax_.set_title('Logistic Regression Confusion Matrix')\n", + "\n", + "plt.show()\n", + "print('Logistic Regression Model Accuracy (on testing set): ')\n", + "print(accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rf = RandomForestClassifier(random_state = 1)\n", + "rf_model = rf.fit(X_train, y_train.values.ravel())\n", + "y_pred = rf_model.predict(X_test)\n", + "disp = plot_confusion_matrix(rf_model, X_test, y_test)\n", + "disp.ax_.set_title('Random Forest Confusion Matrix')\n", + "\n", + "plt.show()\n", + "\n", + "print('Random Forest Model Accuracy (on testing set): ')\n", + "print(accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "svc = SVC(probability = True)\n", + "svc_model = svc.fit(X_train, y_train.values.ravel())\n", + "y_pred = svc_model.predict(X_test)\n", + "disp = plot_confusion_matrix(svc_model, X_test, y_test)\n", + "disp.ax_.set_title('Support Vector Classifier Confusion Matrix')\n", + "\n", + "plt.show()\n", + "\n", + "print('SVC Model Accuracy (on testing set): ')\n", + "print(accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "xgb = XGBClassifier(objective='binary:logistic',random_state =1, use_label_encoder=False)\n", + "kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)\n", + "cv_score = cross_val_score(xgb, X_train, y_train.values.ravel(), cv=kfold)\n", + "print('XGBoost Classifier K-fold Scores:')\n", + "print(cv_score)\n", + "print()\n", + "print('XGBoost Classifier Average Score:')\n", + "print(cv_score.mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "# Conclusion\n", + "\n", + "If we look at the current under/over odds in the betting world, most agree with the model predictions for Fight 1 and Fight 5, but only the Random Forest is in line with the odds for Fight 4. Fight 3 has the odds at -110 to -110, so Vegas seems to be split evenly on this fight.\n", + "\n", + "I hope to expand the models to take in more features, and as a fan I can't help but think strike %, take down %, take down defence %, and a number of the various other features definitely come into play when accessing winner outcome.\n", + "\n", + "I hope you enjoyed this!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 14159da7083687d43a21dad9175e10b6f034aea0 Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Tue, 5 Nov 2024 19:44:00 +0530 Subject: [PATCH 2/2] Create Readme.md --- .../MMA_Fight_prediction/Readme.md | 45 +++++++++++++++++++ 1 file changed, 45 insertions(+) create mode 100644 Prediction Models/MMA_Fight_prediction/Readme.md diff --git a/Prediction Models/MMA_Fight_prediction/Readme.md b/Prediction Models/MMA_Fight_prediction/Readme.md new file mode 100644 index 00000000..4ef4300b --- /dev/null +++ b/Prediction Models/MMA_Fight_prediction/Readme.md @@ -0,0 +1,45 @@ +# MMA Fight Prediction Model + +This repository contains a machine learning model designed to predict the outcome of MMA fights. Using historical data and various fighter statistics, the model aims to determine the probability of each fighter winning a given matchup. + +## Table of Contents +- [Introduction](#introduction) +- [Problem Statement](#problem-statement) +- [Model Overview](#model-overview) +- [Data](#data) +- [Installation](#installation) +- [Usage](#usage) +- [Results](#results) +- [Contributing](#contributing) +- [License](#license) + +## Introduction + +Predicting the outcome of MMA fights is challenging due to the high variability of the sport. Factors such as a fighter's style, reach, weight, previous fight record, and current form all influence the fight outcome. This project explores using machine learning techniques to analyze fight data and predict the probability of each fighter winning a matchup. + +## Problem Statement + +MMA fight outcomes are influenced by numerous factors, many of which are dynamic and hard to quantify. This project aims to address: +- **Outcome Variability**: Accurately predicting outcomes amid unpredictable events and variations. +- **Data Limitations**: Working with potentially sparse or incomplete historical data. +- **Feature Engineering**: Identifying significant features to improve prediction accuracy. +- **Generalization**: Ensuring the model works well across various fighters and events. + +## Model Overview + +The model uses historical fight data and fighter statistics to predict fight outcomes. It leverages a mix of machine learning techniques, such as: +- Logistic Regression +- Decision Trees +- Ensemble Methods + +Key features include fighter-specific data like win-loss records, average fight time, strike accuracy, and takedown success rate. By training on historical fight outcomes, the model aims to generalize to future fight predictions. + +## Data + +The model is built using historical MMA data, which includes: +- Fighter statistics: strikes, takedowns, reach, weight, etc. +- Fight records: win/loss record, recent performance, and historical matchup outcomes. +- Event details: location, fight date, and weight class. + +**Note**: Data files should be placed in the `data/` directory in CSV format. Example data files are provided in the repository. +