From a4042bcacff19b04633883493db4981bdd411041 Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Sat, 9 Nov 2024 18:13:22 +0530 Subject: [PATCH 1/4] fixes 848 --- ...diction-with-80-accuracy-in-dl-model.ipynb | 1021 +++++++++++++++++ 1 file changed, 1021 insertions(+) create mode 100644 Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb diff --git a/Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb b/Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb new file mode 100644 index 00000000..57e497a2 --- /dev/null +++ b/Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb @@ -0,0 +1,1021 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "_uuid": "6e01264e76113262d78c91b0f0a173a494deb75d" + }, + "source": [ + "# How can use player name to predict World Cup with 80% accuracy?\n", + "\n", + "![](https://upload.wikimedia.org/wikipedia/en/thumb/6/67/2018_FIFA_World_Cup.svg/227px-2018_FIFA_World_Cup.svg.png)\n", + "Photo Credit: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup\n", + "\n", + "In this kernel, I am going to demonstrate how can I use player name with Tensorflow to predict historical world cup result with 80% accuracy. Later on, I tried to predict World Cup 2018 result and seems like there is some interest findings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", + "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "# This Python 3 environment comes with many helpful analytics libraries installed\n", + "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", + "# For example, here's several helpful packages to load in \n", + "\n", + "import numpy as np # linear algebra\n", + "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", + "\n", + "# Input data files are available in the \"../input/\" directory.\n", + "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", + "\n", + "import os\n", + "print(os.listdir(\"../input\"))\n", + "\n", + "# Any results you write to the current directory are saved as output." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "529b611e7f925b7bdb38803636dba9d84317fc39" + }, + "source": [ + "# Stage\n", + "1. Data Ingestion: First of all, I got 3 dataframe from given data set and building World Cup 2018 data manully from website\n", + "2. Preprocessing: Preprocessing the needed feature for building characeter embedding input\n", + "3. Transformation: Transform the processed data for model input\n", + "4. Model Building: Build Character Embedding model\n", + "5. Evaluation: Try to predict World Cup 2018 result\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "eff2e82e7d082597c2051f4e4c5b275ad364dd15" + }, + "source": [ + "# Data Ingestion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "07b8e2651463ab0c72fea1f23fc330a57e8714fe", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "summary_df = pd.read_csv('../input/WorldCups.csv')\n", + "summary_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "a7684d9576da866f230d8fa59f558f2e3b467a83", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "match_df = pd.read_csv('../input/WorldCupMatches.csv')\n", + "match_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", + "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "player_df = pd.read_csv('../input/WorldCupPlayers.csv')\n", + "player_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "fec1faa7b94e977905f97817416a5b51614bcdf5" + }, + "source": [ + "### Get 2018 player list" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "b3db5111ea1697ff56b14ff0d84aac21feacd26a", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "# Players Source :http://www.goal.com/en/news/revealed-every-world-cup-2018-squad-23-man-preliminary-lists/oa0atsduflsv1nsf6oqk576rb\n", + "# Coach Source: https://www.fifa.com/worldcup/players/coaches/\n", + "\n", + "worldcup_2018_data = []\n", + "worldcup_2018_data.extend([{\n", + " 'country': 'russia',\n", + " 'group': 'A',\n", + " 'coach': 'Cherchesov Stanislav',\n", + " 'name': 'Igor Akinfeev, Vladimir Gabulov, Andrey Lunev; Sergei Ignashevich, Mario Fernandes, Vladimir Granat, Fyodor Kudryashov, Andrei Semyonov, Igor Smolnikov, Ilya Kutepov, Aleksandr Yerokhin, Yuri Zhirkov, Daler Kuzyaev, Aleksandr Golovin, Alan Dzagoev, Roman Zobnin, Aleksandr Samedov, Yuri Gazinsky, Anton Miranchuk, Denis Cheryshev, Artyom Dzyuba, Aleksei Miranchuk, Fyodor Smolov'\n", + "}, {\n", + " 'country': 'saudi_arabia',\n", + " 'group': 'A',\n", + " 'coach': 'Pizzi Juan Antonio',\n", + " 'name': 'Mohammed Al-Owais, Yasser Al-Musailem, Abdullah Al-Mayuf; Mansoor Al-Harbi, Yasser Al-Shahrani, Mohammed Al-Burayk, Motaz Hawsawi, Osama Hawsawi, Ali Al-Bulaihi, Omar Othman; Abdullah Alkhaibari, Abdulmalek Alkhaibri, Abdullah Otayf, Taiseer Al-Jassam, Hussain Al-Moqahwi, Salman Al-Faraj, Mohamed Kanno, Hatan Bahbir, Salem Al-Dawsari, Yahia Al-Shehri, Fahad Al-Muwallad, Mohammad Al-Sahlawi, Muhannad Assiri'\n", + "}, {\n", + " 'country': 'egypt',\n", + " 'group': 'A',\n", + " 'coach': 'Cuper Hector',\n", + " 'name': 'Essam El Hadary, Mohamed El-Shennawy, Sherif Ekramy; Ahmed Fathi, Abdallah Said, Saad Samir, Ayman Ashraf, Mohamed Abdel-Shafy, Ahmed Hegazi, Ali Gabr, Ahmed Elmohamady, Omar Gaber; Tarek Hamed, Mahmoud Shikabala, Sam Morsy, Mohamed Elneny, Mahmoud Kahraba, Ramadan Sobhi, Trezeguet, Amr Warda; Marwan Mohsen, Mohamed Salah, Mahmoud Elwensh'\n", + "}, {\n", + " 'country': 'uruguay',\n", + " 'group': 'A',\n", + " 'coach': 'Tabarez Oscar',\n", + " 'name': 'Fernando Muslera, Martin Silva, Martin Campana, Diego Godin, Sebastian Coates, Jose Maria Gimenez, Maximiliano Pereira, Gaston Silva, Martin Caceres, Guillermo Varela, Nahitan Nandez, Lucas Torreira, Matias Vecino, Rodrigo Bentancur, Carlos Sanchez, Giorgian De Arrascaeta, Diego Laxalt, Cristian Rodriguez, Jonathan Urretaviscaya, Cristhian Stuani, Maximiliano Gomez, Edinson Cavani, Luis Suarez'\n", + "}, {\n", + " 'country': 'portugal',\n", + " 'group': 'B',\n", + " 'coach': 'Santos Fernando',\n", + " 'name': 'Anthony Lopes, Beto, Rui Patricio, Bruno Alves, Cedric Soares, Jose Fonte, Mario Rui, Pepe, Raphael Guerreiro, Ricardo Pereira, Ruben Dias, Adrien Silva, Bruno Fernandes, Joao Mario, Joao Moutinho, Manuel Fernandes, William Carvalho, Andre Silva, Bernardo Silva, Cristiano Ronaldo, Gelson Martins, Goncalo Guedes, Ricardo Quaresma'\n", + "}, {\n", + " 'country': 'spain',\n", + " 'group': 'B',\n", + " 'coach': 'Hierro Fernando',\n", + " 'name': 'David de Gea, Pepe Reina, Kepa Arrizabalaga; Dani Carvajal, Alvaro Odriozola, Gerard Pique, Sergio Ramos, Nacho, Cesar Azpilicueta, Jordi Alba, Nacho Monreal; Sergio Busquets, Saul Niquez, Koke, Thiago Alcantara, Andres Iniesta, David Silva; Isco, Marcio Asensio, Lucas Vazquez, Iago Aspas, Rodrigo, Diego Costa'\n", + "}, {\n", + " 'country': 'morocco',\n", + " 'group': 'B',\n", + " 'coach': 'Renard Herve',\n", + " 'name': \"Mounir El Kajoui, Yassine Bounou, Ahmad Reda Tagnaouti, Mehdi Benatia, Romain Saiss, Manuel Da Costa, Badr Benoun, Nabil Dirar, Achraf Hakimi, Hamza Mendyl; M'bark Boussoufa, Karim El Ahmadi, Youssef Ait Bennasser, Sofyan Amrabat, Younes Belhanda, Faycal Fajr, Amine Harit; Khalid Boutaib, Aziz Bouhaddouz, Ayoub El Kaabi, Nordin Amrabat, Mehdi Carcela, Hakim Ziyech\"\n", + "}, {\n", + " 'country': 'iran',\n", + " 'group': 'B',\n", + " 'coach': 'Queiroz Carlos',\n", + " 'name': 'Alireza Beiranvand, Rashid Mazaheri, Amir Abedzadeh; Ramin Rezaeian, Mohammad Reza Khanzadeh, Morteza Pouraliganji, Pejman Montazeri, Seyed Majid Hosseini, Milad Mohammadi, Roozbeh Cheshmi; Saeid Ezatolahi, Masoud Shojaei, Saman Ghoddos, Mehdi Torabi, Ashkan Dejagah, Omid Ebrahimi, Ehsan Hajsafi, Vahid Amiri; Alireza Jahanbakhsh, Karim Ansarifard, Mahdi Taremi, Sardar Azmoun, Reza Ghoochannejhad'\n", + "}, {\n", + " 'country': 'france',\n", + " 'group': 'C',\n", + " 'coach': 'Deschamps Didier',\n", + " 'name': \"Alphonse Areola, Hugo Lloris, Steve Mandanda; Lucas Hernandez, Presnel Kimpembe, Benjamin Mendy, Benjamin Pavard, Adil Rami, Djibril Sidibe, Samuel Umtiti, Raphael Varane; N'Golo Kante, Blaise Matuidi, Steven N'Zonzi, Paul Pogba, Corentin Tolisso, Ousmane Dembele, Nabil Fekir; Olivier Giroud, Antoine Griezmann, Thomas Lemar, Kylian Mbappe, Florian Thauvin\"\n", + "}, {\n", + " 'country': 'australia',\n", + " 'group': 'C',\n", + " 'coach': 'Van Marwur bert',\n", + " 'name': 'Brad Jones, Mat Ryan, Danny Vukovic; Aziz Behich, Milos Degenek, Matthew Jurman, James Meredith, Josh Risdon, Trent Sainsbury; Jackson Irvine, Mile Jedinak, Robbie Kruse, Massimo Luongo, Mark Milligan, Aaron Mooy, Tom Rogic; Daniel Arzani, Tim Cahill, Tomi Juric, Mathew Leckie, Andrew Nabbout, Dimitri Petratos, Jamie Maclaren'\n", + "}, {\n", + " 'country': 'peru',\n", + " 'group': 'C',\n", + " 'coach': 'Gareca Ricardo',\n", + " 'name': 'Carlos Caceda, Jose Carvallo, Pedro Gallese, Luis Advincula, Pedro Aquino, Miguel Araujo, Andre Carrillo, Wilder Cartagena, Aldo Corzo, Christian Cueva, Jefferson Farfan, Edison Flores, Paolo Hurtado, Nilson Loyola, Andy Polo, Christian Ramos, Alberto Rodriguez, Raul Ruidiaz, Anderson Santamaria, Renato Tapia, Miguel Trauco, Yoshimar Yotun, Paolo Guerrero'\n", + "}, {\n", + " 'country': 'denmark',\n", + " 'group': 'C',\n", + " 'coach': 'Hareide Age',\n", + " 'name': 'Kasper Schmeichel, Jonas Lossl, Frederik Ronow; Simon Kjaer, Andreas Christensen, Mathias Jorgensen, Jannik Vestergaard, Henrik Dalsgaard, Jens Stryger, Jonas Knudsen; William Kvist, Thomas Delaney, Lukas Lerager, Lasse Schone, Christian Eriksen, Michael Krohn-Dehli; Pione Sisto, Martin Braithwaite, Andreas Cornelius, Viktor Fischer, Yussuf Poulsen, Nicolai Jorgensen, Kasper Dolberg'\n", + "}, {\n", + " 'country': 'argentina',\n", + " 'group': 'D',\n", + " 'coach': 'Sampaoli Jorge',\n", + " 'name': 'Nahuel Guzmán, Willy Caballero, Franco Armani; Gabriel Mercado, Nicolas Otamendi, Federico Fazio, Nicolas Tagliafico, Marcos Rojo, Marcos Acuna, Cristian Ansaldi, Eduardo Salvio; Javier Mascherano, Angel Di Maria, Ever Banega, Lucas Biglia, Manuel Lanzini, Gio Lo Celso, Maximiliano Meza; Lionel Messi, Sergio Aguero, Gonzalo Higuain, Paulo Dybala, Cristian Pavon'\n", + "}, {\n", + " 'country': 'iceland',\n", + " 'group': 'D',\n", + " 'coach': 'Hallgrimsson Heimir',\n", + " 'name': 'Hannes Thor Halldorsson, Runar Alex Runarsson, Frederik Schram; Kari Arnason, Ari Freyr Skulason, Birkir Mar Saevarsson, Sverrir Ingi Ingason, Hordur Magnusson, Holmar Orn Eyjolfsson, Ragnar Sigurdsson; Johann Berg Gudmundsson, Birkir Bjarnason, Arnor Ingvi Traustason, Emil Hallfredsson, Gylfi Sigurdsson, Olafur Ingi Skulason, Rurik Gislason, Samuel Fridjonsson, Aron Gunnarsson; Alfred Finnbogason, Bjorn Bergmann Sigurdarson, Jon Dadi Bodvarsson, Albert Gudmundsson'\n", + "}, {\n", + " 'country': 'croatia',\n", + " 'group': 'D',\n", + " 'coach': 'Dalic Zlatko',\n", + " 'name': 'Danijel Subasic, Lovre Kalinic, Dominik Livakovic; Vedran Corluka, Domagoj Vida, Ivan Strinic, Dejan Lovren, Sime Vrsaljko, Josip Pivaric, Tin Jedvaj, Duje Caleta-Car; Luka Modric, Ivan Rakitic, Mateo Kovacic, Milan Badelj, Marcelo Brozovic, Filip Bradaric; Mario Mandzukic, Ivan Perisic, Nikola Kalinic, Andrej Kramaric, Marko Pjaca, Ante Rebic'\n", + "}, {\n", + " 'country': 'nigeria',\n", + " 'group': 'D',\n", + " 'coach': 'Rohr Gernot',\n", + " 'name': 'Ikechukwu Ezenwa, Daniel Akpeyi, Francis Uzoho; William Troost-Ekong, Leon Balogun, Kenneth Omeruo, Bryan Idowu, Chidozie Awaziem, Abdullahi Shehu, Elderson Echiejile, Tyronne Ebuehi; John Obi Mikel, Ogenyi Onazi, John Ogu, Wilfred Ndidi, Oghenekaro Etebo, Joel Obi; Odion Ighalo, Ahmed Musa, Victor Moses, Alex Iwobi, Kelechi Iheanacho, Simeon Nwankwo'\n", + "}, {\n", + " 'country': 'brazil',\n", + " 'group': 'E',\n", + " 'coach': 'Tite',\n", + " 'name': ' Alisson, Ederson, Cassio; Danilo, Fagner, Marcelo, Filipe Luis, Thiago Silva, Marquinhos, Miranda, Pedro Geromel; Casemiro, Fernandinho, Paulinho, Fred, Renato Augusto, Philippe Coutinho, Willian, Douglas Costa; Neymar, Taison, Gabriel Jesus, Roberto Firmino'\n", + "}, {\n", + " 'country': 'switzerland',\n", + " 'group': 'E',\n", + " 'coach': 'Petkovic Vladimir',\n", + " 'name': 'Roman Burki, Yvon Mvogo, Yann Sommer; Manuel Akanji, Johan Djourou, Nico Elvedi, Michael Lang, Stephan Lichtsteiner, Jacques-Francois Moubandje, Ricardo Rodriguez, Fabian Schaer; Valon Behrami, Blerim Dzemaili, Gelson Fernandes, Remo Freuler, Xherdan Shaqiri, Granit Xhaka, Steven Zuber, Denis Zakaria; Josip Drmic, Breel Embolo, Mario Gavranovic, Haris Seferovic'\n", + "}, {\n", + " 'country': 'costa_rica',\n", + " 'group': 'E',\n", + " 'coach': 'Ranurez Oscar',\n", + " 'name': 'Keylor Navas, Patrick Pemberton, Leonel Moreira, Cristian Gamboa, Ian Smith, Ronald Matarrita, Bryan Oviedo, Oscar Duarte, Giancarlo Gonzalez, Francisco Calvo, Kendall Waston, Johnny Acosta, David Guzman, Yeltsin Tejeda, Celso Borges, Randall Azofeifa, Rodney Wallace, Bryan Ruiz, Daniel Colindres, Christian Bolanos, Johan Venegas, Joel Campbell, Marco Urena'\n", + "}, {\n", + " 'country': 'serbia',\n", + " 'group': 'E',\n", + " 'coach': 'Krstajic Mladen',\n", + " 'name': ' Vladimir Stojkovic, Predrag Rajkovic, Marko Dmitrovic, Aleksandar Kolarov, Antonio Rukavina, Milan Rodic, Branislav Ivanovic, Uros Spajic, Milos Veljkovic, Dusko Tosic, Nikola Milenkovic; Nemanja Matic, Luka Milivojevic, Marko Grujic, Dusan Tadic, Andrija Zivkovic, Filip Kostic, Nemanja Radonjic, Sergej Milinkovic-Savic, Adem Ljajic; Aleksandar Mitrovic, Aleksandar Prijovic, Luka Jovic'\n", + "}, {\n", + " 'country': 'germany',\n", + " 'group': 'F',\n", + " 'coach': 'Low Joachim',\n", + " 'name': 'Manuel Neuer, Marc-Andre ter Stegen, Kevin Trapp; Jerome Boateng, Matthias Ginter, Jonas Hector, Mats Hummels, Joshua Kimmich, Marvin Plattenhardt, Antonio Rudiger, Niklas Sule; Julian Brandt, Julian Draxler, Mario Gomez, Leon Goretzka, Ilkay Gundogan, Sami Khedira, Toni Kroos, Thomas Muller, Mesut Ozil, Marco Reus, Sebastian Rudy, Timo Werner'\n", + "}, {\n", + " 'country': 'mexico',\n", + " 'group': 'F',\n", + " 'coach': 'Osorio Juan Carlos',\n", + " 'name': 'Jesus Corona, Alfredo Talavera, Guillermo Ochoa; Hugo Ayala, Carlos Salcedo, Diego Reyes, Miguel Layun, Hector Moreno, Edson Alvarez; Rafael Marquez, Jonathan dos Santos, Marco Fabian, Giovani dos Santos, Hector Herrera, Andres Guardado; Raul Jimenez, Carlos Vela, Javier Hernandez, Jesus Corona, Oribe Peralta, Javier Aquino, Hirving Lozano'\n", + "}, {\n", + " 'country': 'sweden',\n", + " 'group': 'F',\n", + " 'coach': 'Andersson Janne',\n", + " 'name': 'Robin Olsen, Karl-Johan Johnsson, Kristoffer Nordfeldt, Mikael Lustig, Victor Lindelof, Andreas Granqvist, Martin Olsson, Ludwig Augustinsson, Filip Helander, Emil Krafth, Pontus Jansson, Sebastian Larsson, Albin Ekdal, Emil Forsberg, Gustav Svensson, Oscar Hiljemark, Viktor Claesson, Marcus Rohden, Jimmy Durmaz, Marcus Berg, John Guidetti, Ola Toivonen, Isaac Kiese Thelin'\n", + "}, {\n", + " 'country': 'south_korea',\n", + " 'group': 'F',\n", + " 'coach': 'Shin Taeyong',\n", + " 'name': 'Kim Seunggyu, Kim Jinhyeon, Cho Hyeonwoo, Kim Younggwon, Jang Hyunsoo, Jeong Seunghyeon, Yun Yeongseon, Oh Bansuk, Kim Minwoo, Park Jooho, Hong Chul, Go Yohan, Lee Yong, Ki Sungyueng, Jeong Wooyoung, Ju Sejong, Koo Jacheol, Lee Jaesung, Lee Seungwoo, Moon Sunmin, Kim Shinwook, Son Heungmin, Hwang Heechan'\n", + "}, {\n", + " 'country': 'belgium',\n", + " 'group': 'G',\n", + " 'coach': 'Martinez Roberto',\n", + " 'name': 'Koen Casteels, Thibaut Courtois, Simon Mignolet; Toby Alderweireld, Dedryck Boyata, Vincent Kompany, Thomas Meunier, Thomas Vermaelen, Jan Vertonghen; Nacer Chadli, Kevin De Bruyne, Mousa Dembele, Leander Dendoncker, Marouane Fellaini, Youri Tielemans, Axel Witsel; Michy Batshuayi, Yannick Carrasco, Eden Hazard, Thorgan Hazard, Adnan Januzaj, Romelu Lukaku, Dries Mertens'\n", + "}, {\n", + " 'country': 'panama',\n", + " 'group': 'G',\n", + " 'coach': 'Gomez Hernan',\n", + " 'name': 'Jose Calderon, Jaime Penedo, Alex Rodríguez; Felipe Baloy, Harold Cummings, Eric Davis, Fidel Escobar, Adolfo Machado, Michael Murillo, Luis Ovalle, Roman Torres; Edgar Barcenas, Armando Cooper, Anibal Godoy, Gabriel Gomez, Valentin Pimentel, Alberto Quintero, Jose Luis Rodriguez; Abdiel Arroyo, Ismael Diaz, Blas Perez, Luis Tejada, Gabriel Torres'\n", + "}, {\n", + " 'country': 'tunisia',\n", + " 'group': 'G',\n", + " 'coach': 'Maaloul Nabil',\n", + " 'name': 'Farouk Ben Mustapha, Moez Hassen, Aymen Mathlouthi, Rami Bedoui, Yohan Benalouane, Syam Ben Youssef, Dylan Bronn, Oussama Haddadi, Ali Maaloul, Yassine Meriah, Hamdi Nagguez, Anice Badri, Mohamed Amine Ben Amor, Ghaylene Chaalali, Ahmed Khalil, Saifeddine Khaoui, Ferjani Sassi, Ellyes Skhiri, Naim Sliti, Bassem Srarfi, Fakhreddine Ben Youssef, Saber Khalifa, Wahbi Khazri'\n", + "}, {\n", + " 'country': 'england',\n", + " 'group': 'G',\n", + " 'coach': 'Southgate Gareth',\n", + " 'name': 'Jack Butland, Nick Pope, Jordan Pickford; Fabian Delph, Danny Rose, Eric Dier, Kyle Walker, Kieran Trippier, Trent Alexander-Arnold, Harry Maguire, John Stones, Phil Jones, Gary Cahill; Jordan Henderson, Jesse Lingard, Ruben Loftus-Cheek, Ashley Young, Dele Alli, Raheem Sterling; Harry Kane, Jamie Vardy, Marcus Rashford, Danny Welbeck'\n", + "}, {\n", + " 'country': 'poland',\n", + " 'group': 'H',\n", + " 'coach': 'Nawalka Adam',\n", + " 'name': 'Bartosz Bialkowski, Lukasz Fabianski, Wojciech Szczesny; Jan Bednarek, Bartosz Bereszynski, Thiago Cionek, Kamil Glik, Artur Jedrzejczyk, Michal Pazdan, Lukasz Piszczek; Jakub Blaszczykowski, Jacek Goralski, Kamil Goricki, Grzegorz Krychowiak, Slawomir Peszko, Maciej Rybus, Piotr Zielinski, Rafal Kurzawa, Karol Linetty; Dawid Kownacki, Robert Lewandowski, Arkadiusz Milik, Lukasz Teodorczyk'\n", + "}, {\n", + " 'country': 'senegal',\n", + " 'group': 'H',\n", + " 'coach': 'Cisse Aliou',\n", + " 'name': 'Abdoulaye Diallo, Khadim Ndiaye, Alfred Gomis, Lamine Gassama, Moussa Wague, Saliou Ciss, Youssouf Sabaly, Kalidou Koulibaly, Salif Sane, Cheikhou Kouyate, Kara Mbodji, Idrisa Gana Gueye, Cheikh Ndoye, Alfred Ndiaye, Pape Alioune Ndiaye, Moussa Sow, Moussa Konate, Diafra Sakho, Sadio Mane, Ismaila Sarr, Mame Biram Diouf, Mbaye Niang, Diao Keita Balde'\n", + "}, {\n", + " 'country': 'colombia',\n", + " 'group': 'H',\n", + " 'coach': 'Pekerman Jose',\n", + " 'name': 'David Ospina, Camilo Vargas, Jose Fernando Cuadrado; Cristian Zapata, Davinson Sanchez, Santiago Arias, Oscar Murillo, Frank Fabra, Johan Mojica, Yerry Mina; Wilmar Barrios, Carlos Sanchez, Jefferson Lerma, Jose Izquierdo, James Rodriguez, Abel Aguilar, Juan Fernando Quintero, Mateus Uribe, Juan Guillermo Cuadrado; Radamel Falcao Garcia, Miguel Borja, Carlos Bacca, Luis Fernando Muriel'\n", + "}, {\n", + " 'country': 'japan',\n", + " 'group': 'H',\n", + " 'coach': 'Nishino Akira',\n", + " 'name': 'Eiji Kawashima, Masaaki Higashiguchi, Kosuke Nakamura, Yuto Nagatomo, Tomoaki Makino, Maya Yoshida, Hiroki Sakai, Gotoku Sakai, Gen Shoji, Wataru Endo, Naomichi Ueda, Makoto Hasebe, Keisuke Honda, Takashi Inui, Shinji Kagawa, Hotaru Yamaguchi, Genki Haraguchi, Takashi Usami, Gaku Shibasaki, Ryota Oshima, Shinji Okazaki, Yuya Osako, Yoshinori Muto'\n", + "}])\n", + "\n", + "worldcup_2018_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "58343c8a068358511995efd9a0c08a5b5d27999d" + }, + "source": [ + "# Preprocessing" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "4230dcf4378460efc5d86070b7ce885896a5f028" + }, + "source": [ + "### clean 2018 data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "0b5f50a88ab1e4ddf9e94cf199451c40a8cc3645", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "worldcup_2018_df = pd.DataFrame(worldcup_2018_data)\n", + "\n", + "def clean_merge(row):\n", + " name = row['name'].replace(';', ',').strip()\n", + " names = name.split(',')\n", + " names.append(row['coach'])\n", + " names = sorted(names)\n", + " return ' '.join(names)\n", + "\n", + "worldcup_2018_df['participants'] = worldcup_2018_df.apply(clean_merge, axis=1)\n", + "\n", + "worldcup_2018_df['label'] = 0\n", + "worldcup_2018_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "74c5d8e5d2b4caf11697cc59baaca508d50878f0" + }, + "source": [ + " ### Get map of country code and name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "e169c6a453f7102be4ac2565ed30982cab3eaeeb", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "def country_name_code_mapping(df):\n", + " code2name = {}\n", + " name2code = {}\n", + " name2pos = {}\n", + " pos2name = {}\n", + "\n", + " working_df = df[['Home Team Name', 'Home Team Initials']].drop_duplicates()\n", + "\n", + " for i, row in working_df.iterrows():\n", + " code2name[row['Home Team Initials']] = row['Home Team Name']\n", + " name2code[row['Home Team Name']] = row['Home Team Initials']\n", + " \n", + " for i, name in enumerate(name2code):\n", + " name2pos[name] = i\n", + " pos2name[i] = name\n", + " \n", + " print('Name to Code Sample')\n", + " for x in name2code:\n", + " print(x, name2code[x])\n", + " break\n", + "\n", + " print('Code to Name Sample')\n", + " for x in code2name:\n", + " print(x, code2name[x])\n", + " break\n", + " \n", + " print('Name to Position Sample')\n", + " for x in name2pos:\n", + " print(x, name2pos[x])\n", + " break\n", + " \n", + " print('Position to Name Sample')\n", + " for x in pos2name:\n", + " print(x, pos2name[x])\n", + " break\n", + " \n", + " return code2name, name2code, name2pos, pos2name\n", + "\n", + "code2name, name2code, name2pos, pos2name = country_name_code_mapping(match_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "e1db34fc15dbbac46608ac6e6d12acd6b24ade18" + }, + "source": [ + "### Get all participants per match**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "5b1e11b23589b57f1e9cc86caebbfda8600b325b", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "def join_participants(match_id, team):\n", + " working_df = player_df[\n", + " (player_df['MatchID'] == match_id)\n", + " & (player_df['Team Initials'] == team)\n", + " ]\n", + " coachs = working_df['Coach Name'].unique().tolist()\n", + " players = working_df['Player Name'].unique().tolist()\n", + " \n", + " return coachs + players\n", + "\n", + "match_df['home_participants'] = match_df.apply(lambda x: join_participants(x['MatchID'], x['Home Team Initials']), axis=1)\n", + "match_df['away_participants'] = match_df.apply(lambda x: join_participants(x['MatchID'], x['Away Team Initials']), axis=1)\n", + "match_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "b2f8930084a2d3cb1e9fd9e929fcf2219ad27b14" + }, + "source": [ + "# Transformation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "b5676322aace7d598dd7248e265f34f58bce7900", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "# code2name, name2code, name2pos, pos2name\n", + "\n", + "# results = []\n", + "\n", + "def _build_record(year, team_name):\n", + " list_of_home_participants = match_df[\n", + " (match_df['Year'] == year)\n", + " & (match_df['Home Team Name'] == team_name)\n", + " ]['home_participants'].tolist()\n", + " \n", + " list_of_away_participants = match_df[\n", + " (match_df['Year'] == year)\n", + " & (match_df['Away Team Name'] == team_name)\n", + " ]['away_participants'].tolist()\n", + " \n", + " participants = []\n", + " for ps in list_of_home_participants + list_of_away_participants:\n", + " participants.extend(ps)\n", + " participants = sorted(list(set(participants)))\n", + " \n", + " return ' '.join(participants)\n", + "\n", + "def get_non_first_fouth_team(year, positive_team_names):\n", + " home_names = match_df[match_df['Year'] == year]['Home Team Name'].unique().tolist()\n", + " away_names = match_df[match_df['Year'] == year]['Away Team Name'].unique().tolist()\n", + " non_winner_names = list(set(home_names + away_names))\n", + " for name in positive_team_names:\n", + " non_winner_names.remove(name)\n", + " \n", + " return non_winner_names\n", + "\n", + "def build_negative(year, positive_team_names):\n", + " non_winner_names = get_non_first_fouth_team(year, positive_team_names)\n", + " \n", + " results = []\n", + " for name in non_winner_names:\n", + " results.append({\n", + " 'label': 0,\n", + " 'name': _build_record(year, name)\n", + " })\n", + " \n", + " return results\n", + "\n", + "trainin_data = []\n", + "for i, row in summary_df.iterrows():\n", + " # positve sample\n", + " trainin_data.append({\n", + " 'label': 1,\n", + " 'name': _build_record(row['Year'], row['Winner'])\n", + " })\n", + " trainin_data.append({\n", + " 'label': 2,\n", + " 'name': _build_record(row['Year'], row['Runners-Up'])\n", + " })\n", + " trainin_data.append({\n", + " 'label': 3,\n", + " 'name': _build_record(row['Year'], row['Third'])\n", + " })\n", + " trainin_data.append({\n", + " 'label': 4,\n", + " 'name': _build_record(row['Year'], row['Fourth'])\n", + " })\n", + " \n", + " winner_names = [row['Winner'], row['Runners-Up'], row['Third'], row['Fourth']]\n", + " \n", + " # negative sample\n", + " results = build_negative(row['Year'], winner_names)\n", + " trainin_data.extend(results)\n", + " \n", + "training_df = pd.DataFrame(trainin_data)\n", + "\n", + "print('Number of Training Record: %d' % len(training_df))\n", + "training_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "48de565289b935ac42d7ce3826a715d330d4318b" + }, + "source": [ + "# Model Building" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "d79c8e4c13ea24622e4a2699aac1279ee857deb3" + }, + "source": [ + "You may think that Character Embedding seems like meaninless but there is some reason behind of using Character as a feature. If you want to understand more about Character Embedding. You may check out this article for reference\n", + "https://medium.com/@makcedward/besides-word-embedding-why-you-need-to-know-character-embedding-6096a34a3b10\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "a702beda72b94c33f1cd10dff73139178834d1bd", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "from nltk.tokenize import sent_tokenize\n", + "\n", + "class CharCNN:\n", + " CHAR_DICT = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .!?:,\\'%-\\(\\)/$|&;[]\"'\n", + " \n", + " def __init__(self, max_len_of_sentence, max_num_of_setnence, verbose=10):\n", + " self.max_len_of_sentence = max_len_of_sentence\n", + " self.max_num_of_setnence = max_num_of_setnence\n", + " self.verbose = verbose\n", + " \n", + " self.num_of_char = 0\n", + " self.num_of_label = 0\n", + " self.unknown_label = ''\n", + " \n", + " def build_char_dictionary(self, char_dict=None, unknown_label='UNK'):\n", + " \"\"\"\n", + " Define possbile char set. Using \"UNK\" if character does not exist in this set\n", + " \"\"\" \n", + " \n", + " if char_dict is None:\n", + " char_dict = self.CHAR_DICT\n", + " \n", + " self.unknown_label = unknown_label\n", + "\n", + " chars = []\n", + "\n", + " for c in char_dict:\n", + " chars.append(c)\n", + "\n", + " chars = list(set(chars))\n", + " \n", + " chars.insert(0, unknown_label)\n", + "\n", + " self.num_of_char = len(chars)\n", + " self.char_indices = dict((c, i) for i, c in enumerate(chars))\n", + " self.indices_char = dict((i, c) for i, c in enumerate(chars))\n", + " \n", + " if self.verbose > 5:\n", + " print('Totoal number of chars:', self.num_of_char)\n", + "\n", + " print('First 3 char_indices sample:', {k: self.char_indices[k] for k in list(self.char_indices)[:3]})\n", + " print('First 3 indices_char sample:', {k: self.indices_char[k] for k in list(self.indices_char)[:3]})\n", + " \n", + "\n", + " return self.char_indices, self.indices_char, self.num_of_char\n", + " \n", + " def convert_labels(self, labels):\n", + " \"\"\"\n", + " Convert label to numeric\n", + " \"\"\"\n", + " self.label2indexes = dict((l, i) for i, l in enumerate(labels))\n", + " self.index2labels = dict((i, l) for i, l in enumerate(labels))\n", + "\n", + " if self.verbose > 5:\n", + " print('Label to Index: ', self.label2indexes)\n", + " print('Index to Label: ', self.index2labels)\n", + " \n", + " self.num_of_label = len(self.label2indexes)\n", + "\n", + " return self.label2indexes, self.index2labels\n", + " \n", + " def _transform_raw_data(self, df, x_col, y_col, label2indexes=None, sample_size=None):\n", + " \"\"\"\n", + " ##### Transform raw data to list\n", + " \"\"\"\n", + " \n", + " x = []\n", + " y = []\n", + "\n", + " actual_max_sentence = 0\n", + " \n", + " if sample_size is None:\n", + " sample_size = len(df)\n", + "\n", + " for i, row in df.head(sample_size).iterrows():\n", + " x_data = row[x_col]\n", + " y_data = row[y_col]\n", + "\n", + " sentences = sent_tokenize(x_data)\n", + " x.append(sentences)\n", + "\n", + " if len(sentences) > actual_max_sentence:\n", + " actual_max_sentence = len(sentences)\n", + "\n", + " y.append(label2indexes[y_data])\n", + "\n", + " if self.verbose > 5:\n", + " print('Number of news: %d' % (len(x)))\n", + " print('Actual max sentence: %d' % actual_max_sentence)\n", + "\n", + " return x, y\n", + " \n", + " def _transform_training_data(self, x_raw, y_raw, max_len_of_sentence=None, max_num_of_setnence=None):\n", + " \"\"\"\n", + " ##### Transform preorcessed data to numpy\n", + " \"\"\"\n", + " unknown_value = self.char_indices[self.unknown_label]\n", + " \n", + " x = np.ones((len(x_raw), max_num_of_setnence, max_len_of_sentence), dtype=np.int64) * unknown_value\n", + " y = np.array(y_raw)\n", + " \n", + " if max_len_of_sentence is None:\n", + " max_len_of_sentence = self.max_len_of_sentence\n", + " if max_num_of_setnence is None:\n", + " max_num_of_setnence = self.max_num_of_setnence\n", + "\n", + " for i, doc in enumerate(x_raw):\n", + " for j, sentence in enumerate(doc):\n", + " if j < max_num_of_setnence:\n", + " for t, char in enumerate(sentence[-max_len_of_sentence:]):\n", + " if char not in self.char_indices:\n", + " x[i, j, (max_len_of_sentence-1-t)] = self.char_indices['UNK']\n", + " else:\n", + " x[i, j, (max_len_of_sentence-1-t)] = self.char_indices[char]\n", + "\n", + " return x, y\n", + "\n", + " def _build_character_block(self, block, dropout=0.3, filters=[64, 100], kernel_size=[3, 3], \n", + " pool_size=[2, 2], padding='valid', activation='relu', \n", + " kernel_initializer='glorot_normal'):\n", + " \n", + " for i in range(len(filters)):\n", + " block = Conv1D(\n", + " filters=filters[i], kernel_size=kernel_size[i],\n", + " padding=padding, activation=activation, kernel_initializer=kernel_initializer)(block)\n", + "\n", + " block = Dropout(dropout)(block)\n", + " block = MaxPooling1D(pool_size=pool_size[i])(block)\n", + "\n", + " block = GlobalMaxPool1D()(block)\n", + " block = Dense(128, activation='relu')(block)\n", + " return block\n", + " \n", + " def _build_sentence_block(self, max_len_of_sentence, max_num_of_setnence, \n", + " char_dimension=16,\n", + " filters=[[3, 5, 7], [200, 300, 300], [300, 400, 400]], \n", + "# filters=[[100, 200, 200], [200, 300, 300], [300, 400, 400]], \n", + " kernel_sizes=[[4, 3, 3], [5, 3, 3], [6, 3, 3]], \n", + " pool_sizes=[[2, 2, 2], [2, 2, 2], [2, 2, 2]],\n", + " dropout=0.4):\n", + " \n", + " sent_input = Input(shape=(max_len_of_sentence, ), dtype='int64')\n", + " embedded = Embedding(self.num_of_char, char_dimension, input_length=max_len_of_sentence)(sent_input)\n", + " \n", + " blocks = []\n", + " for i, filter_layers in enumerate(filters):\n", + " blocks.append(\n", + " self._build_character_block(\n", + " block=embedded, filters=filters[i], kernel_size=kernel_sizes[i], pool_size=pool_sizes[i])\n", + " )\n", + "\n", + " sent_output = concatenate(blocks, axis=-1)\n", + " sent_output = Dropout(dropout)(sent_output)\n", + " sent_encoder = Model(inputs=sent_input, outputs=sent_output)\n", + "\n", + " return sent_encoder\n", + " \n", + " def _build_document_block(self, sent_encoder, max_len_of_sentence, max_num_of_setnence, \n", + " num_of_label, dropout=0.3, \n", + " loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']):\n", + " doc_input = Input(shape=(max_num_of_setnence, max_len_of_sentence), dtype='int64')\n", + " doc_output = TimeDistributed(sent_encoder)(doc_input)\n", + "\n", + " doc_output = Bidirectional(LSTM(128, return_sequences=False, dropout=dropout))(doc_output)\n", + "\n", + " doc_output = Dropout(dropout)(doc_output)\n", + " doc_output = Dense(128, activation='relu')(doc_output)\n", + " doc_output = Dropout(dropout)(doc_output)\n", + " doc_output = Dense(num_of_label, activation='sigmoid')(doc_output)\n", + "\n", + " doc_encoder = Model(inputs=doc_input, outputs=doc_output)\n", + " doc_encoder.compile(loss=loss, optimizer=optimizer, metrics=metrics)\n", + " return doc_encoder\n", + " \n", + " def preporcess(self, labels, char_dict=None, unknown_label='UNK'):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: preprocess')\n", + " \n", + " self.build_char_dictionary(char_dict, unknown_label)\n", + " self.convert_labels(labels)\n", + " \n", + " def process(self, df, x_col, y_col, \n", + " max_len_of_sentence=None, max_num_of_setnence=None, label2indexes=None, sample_size=None):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: process')\n", + " \n", + " if sample_size is None:\n", + " sample_size = 1000\n", + " if label2indexes is None:\n", + " if self.label2indexes is None:\n", + " raise Exception('Does not initalize label2indexes. Please invoke preprocess step first')\n", + " label2indexes = self.label2indexes\n", + " if max_len_of_sentence is None:\n", + " max_len_of_sentence = self.max_len_of_sentence\n", + " if max_num_of_setnence is None:\n", + " max_num_of_setnence = self.max_num_of_setnence\n", + "\n", + " x_preprocess, y_preprocess = self._transform_raw_data(\n", + " df=df, x_col=x_col, y_col=y_col, label2indexes=label2indexes)\n", + " \n", + " x_preprocess, y_preprocess = self._transform_training_data(\n", + " x_raw=x_preprocess, y_raw=y_preprocess,\n", + " max_len_of_sentence=max_len_of_sentence, max_num_of_setnence=max_num_of_setnence)\n", + " \n", + " if self.verbose > 5:\n", + " print('Shape: ', x_preprocess.shape, y_preprocess.shape)\n", + "\n", + " return x_preprocess, y_preprocess\n", + " \n", + " def build_model(self, char_dimension=16, display_summary=False, display_architecture=False, \n", + " loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: build model')\n", + " \n", + " sent_encoder = self._build_sentence_block(\n", + " char_dimension=char_dimension,\n", + " max_len_of_sentence=self.max_len_of_sentence, max_num_of_setnence=self.max_num_of_setnence)\n", + " \n", + " doc_encoder = self._build_document_block(\n", + " sent_encoder=sent_encoder, num_of_label=self.num_of_label,\n", + " max_len_of_sentence=self.max_len_of_sentence, max_num_of_setnence=self.max_num_of_setnence, \n", + " loss=loss, optimizer=optimizer, metrics=metrics)\n", + " \n", + " if display_architecture:\n", + " print('Sentence Architecture')\n", + " IPython.display.display(SVG(model_to_dot(sent_encoder).create(prog='dot', format='svg')))\n", + " print()\n", + " print('Document Architecture')\n", + " IPython.display.display(SVG(model_to_dot(doc_encoder).create(prog='dot', format='svg')))\n", + " \n", + " if display_summary:\n", + " print(doc_encoder.summary())\n", + " \n", + " \n", + " self.model = {\n", + " 'sent_encoder': sent_encoder,\n", + " 'doc_encoder': doc_encoder\n", + " }\n", + " \n", + " return doc_encoder\n", + " \n", + " def train(self, x_train, y_train, x_test, y_test, batch_size=128, epochs=1, shuffle=True):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: train model')\n", + " \n", + " self.get_model().fit(\n", + " x_train, y_train, validation_data=(x_test, y_test), \n", + " batch_size=batch_size, epochs=epochs, shuffle=shuffle)\n", + " \n", + "# return self.model['doc_encoder']\n", + "\n", + " def predict(self, x, model=None, return_prob=False):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: predict')\n", + " \n", + " if model is None:\n", + " model = self.get_model()\n", + " \n", + " if return_prob:\n", + " return model.predict(x_test)\n", + " \n", + " return model.predict(x_test).argmax(axis=-1)\n", + " \n", + " def get_model(self):\n", + " return self.model['doc_encoder']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "97425d0a69723820392eedfb1d4a2624d440b0ec", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "train_df, test_df = train_test_split(training_df, test_size=0.2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "7fc8d75dc5f14237a0eba2b8a34b6b4347e11964", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "char_cnn = CharCNN(max_len_of_sentence=256, max_num_of_setnence=1)\n", + "char_cnn.preporcess(labels=training_df['label'].unique())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "931f26a83e15ae8f617e25836502e212c8b5b321", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "x_train, y_train = char_cnn.process(\n", + " df=train_df, x_col='name', y_col='label')\n", + "x_test, y_test = char_cnn.process(\n", + " df=test_df, x_col='name', y_col='label')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "60a3811d3c8ab657a898110948aaa67948d581d4", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "import keras\n", + "from keras.models import Model, load_model\n", + "from keras.layers import Dense, Input, Dropout, MaxPooling1D, Conv1D, GlobalMaxPool1D, Bidirectional\n", + "from keras.layers import LSTM, Lambda, Bidirectional, concatenate, BatchNormalization, Embedding\n", + "from keras.layers import TimeDistributed\n", + "from keras.optimizers import Adam\n", + "import tensorflow as tf\n", + "import keras.backend as K\n", + "\n", + "char_cnn.build_model()\n", + "char_cnn.train(x_train, y_train, x_test, y_test, batch_size=32, epochs=10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "1ddf3f1b5982d8852903b841ec0df804b4bf68a1" + }, + "source": [ + "# Evaluations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "aa78af278ad827f9a8b3e6f0745592c456434802", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + }, + "scrolled": true + }, + "outputs": [], + "source": [ + "# Passing dummpy label and getting dummpy y_real just because try to reuse defined function to convert input\n", + "x_real, y_real = char_cnn.process(\n", + " df=worldcup_2018_df, x_col='participants', y_col='label')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "41e16db1bdc3f29eb2fffa817884d3c55896a460" + }, + "source": [ + "Encoded result. 4 means it is 0 actually. 0 means lost." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "18fcf0fd02415f69e1aaf25b8b15f3e8f676f18d", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "char_cnn.predict(x_real)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "91e195c8ed3b20722fc39a0f2bcd43d2d7604c02" + }, + "source": [ + "# Conclusion\n", + "\n", + "Thank you for reading this meaningless model. Actually, I would like to demonstrate how is the importance of feature but not model architecture. \n", + "\n", + "* When people talk about \"We are using Machine Learning\", \"Applied Deep Neural Network\", you may better ask about feature and data instead of asking about model architecture. Of course, model architecture is important but feature and data are also important as well. Please remember that **GARBAGE IN, GARBAGE OUT**.\n", + "* When people talk about having 80% or even 90% accuracy. You may better check whether it is a result in **experiment or actual**. Lots of model is overfit in experiment stage although data scientist believe that they already prevent it very well.\n", + "* For the measurement, better understand **other metric** but not just accuracy. For example, we also have precision and recall in classification. We have BELU in machine translation. \n", + "* As a Data Scientist, rather talking about using CNN, LSTM bla bla bla, **spend more time on understanding your feature and data**.\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "93f9edbaf910a41290ea0226926fa392b7450768" + }, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 8c4771c08891aa9d45191f3c1b81d836379edaf2 Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Sat, 9 Nov 2024 18:14:05 +0530 Subject: [PATCH 2/4] Delete Prediction Models/World_cup_Prtediction directory --- ...diction-with-80-accuracy-in-dl-model.ipynb | 1021 ----------------- 1 file changed, 1021 deletions(-) delete mode 100644 Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb diff --git a/Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb b/Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb deleted file mode 100644 index 57e497a2..00000000 --- a/Prediction Models/World_cup_Prtediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb +++ /dev/null @@ -1,1021 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "_uuid": "6e01264e76113262d78c91b0f0a173a494deb75d" - }, - "source": [ - "# How can use player name to predict World Cup with 80% accuracy?\n", - "\n", - "![](https://upload.wikimedia.org/wikipedia/en/thumb/6/67/2018_FIFA_World_Cup.svg/227px-2018_FIFA_World_Cup.svg.png)\n", - "Photo Credit: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup\n", - "\n", - "In this kernel, I am going to demonstrate how can I use player name with Tensorflow to predict historical world cup result with 80% accuracy. Later on, I tried to predict World Cup 2018 result and seems like there is some interest findings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", - "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "# This Python 3 environment comes with many helpful analytics libraries installed\n", - "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", - "# For example, here's several helpful packages to load in \n", - "\n", - "import numpy as np # linear algebra\n", - "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", - "\n", - "# Input data files are available in the \"../input/\" directory.\n", - "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", - "\n", - "import os\n", - "print(os.listdir(\"../input\"))\n", - "\n", - "# Any results you write to the current directory are saved as output." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "529b611e7f925b7bdb38803636dba9d84317fc39" - }, - "source": [ - "# Stage\n", - "1. Data Ingestion: First of all, I got 3 dataframe from given data set and building World Cup 2018 data manully from website\n", - "2. Preprocessing: Preprocessing the needed feature for building characeter embedding input\n", - "3. Transformation: Transform the processed data for model input\n", - "4. Model Building: Build Character Embedding model\n", - "5. Evaluation: Try to predict World Cup 2018 result\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "eff2e82e7d082597c2051f4e4c5b275ad364dd15" - }, - "source": [ - "# Data Ingestion" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "07b8e2651463ab0c72fea1f23fc330a57e8714fe", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "summary_df = pd.read_csv('../input/WorldCups.csv')\n", - "summary_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "a7684d9576da866f230d8fa59f558f2e3b467a83", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "match_df = pd.read_csv('../input/WorldCupMatches.csv')\n", - "match_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", - "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "player_df = pd.read_csv('../input/WorldCupPlayers.csv')\n", - "player_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "fec1faa7b94e977905f97817416a5b51614bcdf5" - }, - "source": [ - "### Get 2018 player list" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "b3db5111ea1697ff56b14ff0d84aac21feacd26a", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "# Players Source :http://www.goal.com/en/news/revealed-every-world-cup-2018-squad-23-man-preliminary-lists/oa0atsduflsv1nsf6oqk576rb\n", - "# Coach Source: https://www.fifa.com/worldcup/players/coaches/\n", - "\n", - "worldcup_2018_data = []\n", - "worldcup_2018_data.extend([{\n", - " 'country': 'russia',\n", - " 'group': 'A',\n", - " 'coach': 'Cherchesov Stanislav',\n", - " 'name': 'Igor Akinfeev, Vladimir Gabulov, Andrey Lunev; Sergei Ignashevich, Mario Fernandes, Vladimir Granat, Fyodor Kudryashov, Andrei Semyonov, Igor Smolnikov, Ilya Kutepov, Aleksandr Yerokhin, Yuri Zhirkov, Daler Kuzyaev, Aleksandr Golovin, Alan Dzagoev, Roman Zobnin, Aleksandr Samedov, Yuri Gazinsky, Anton Miranchuk, Denis Cheryshev, Artyom Dzyuba, Aleksei Miranchuk, Fyodor Smolov'\n", - "}, {\n", - " 'country': 'saudi_arabia',\n", - " 'group': 'A',\n", - " 'coach': 'Pizzi Juan Antonio',\n", - " 'name': 'Mohammed Al-Owais, Yasser Al-Musailem, Abdullah Al-Mayuf; Mansoor Al-Harbi, Yasser Al-Shahrani, Mohammed Al-Burayk, Motaz Hawsawi, Osama Hawsawi, Ali Al-Bulaihi, Omar Othman; Abdullah Alkhaibari, Abdulmalek Alkhaibri, Abdullah Otayf, Taiseer Al-Jassam, Hussain Al-Moqahwi, Salman Al-Faraj, Mohamed Kanno, Hatan Bahbir, Salem Al-Dawsari, Yahia Al-Shehri, Fahad Al-Muwallad, Mohammad Al-Sahlawi, Muhannad Assiri'\n", - "}, {\n", - " 'country': 'egypt',\n", - " 'group': 'A',\n", - " 'coach': 'Cuper Hector',\n", - " 'name': 'Essam El Hadary, Mohamed El-Shennawy, Sherif Ekramy; Ahmed Fathi, Abdallah Said, Saad Samir, Ayman Ashraf, Mohamed Abdel-Shafy, Ahmed Hegazi, Ali Gabr, Ahmed Elmohamady, Omar Gaber; Tarek Hamed, Mahmoud Shikabala, Sam Morsy, Mohamed Elneny, Mahmoud Kahraba, Ramadan Sobhi, Trezeguet, Amr Warda; Marwan Mohsen, Mohamed Salah, Mahmoud Elwensh'\n", - "}, {\n", - " 'country': 'uruguay',\n", - " 'group': 'A',\n", - " 'coach': 'Tabarez Oscar',\n", - " 'name': 'Fernando Muslera, Martin Silva, Martin Campana, Diego Godin, Sebastian Coates, Jose Maria Gimenez, Maximiliano Pereira, Gaston Silva, Martin Caceres, Guillermo Varela, Nahitan Nandez, Lucas Torreira, Matias Vecino, Rodrigo Bentancur, Carlos Sanchez, Giorgian De Arrascaeta, Diego Laxalt, Cristian Rodriguez, Jonathan Urretaviscaya, Cristhian Stuani, Maximiliano Gomez, Edinson Cavani, Luis Suarez'\n", - "}, {\n", - " 'country': 'portugal',\n", - " 'group': 'B',\n", - " 'coach': 'Santos Fernando',\n", - " 'name': 'Anthony Lopes, Beto, Rui Patricio, Bruno Alves, Cedric Soares, Jose Fonte, Mario Rui, Pepe, Raphael Guerreiro, Ricardo Pereira, Ruben Dias, Adrien Silva, Bruno Fernandes, Joao Mario, Joao Moutinho, Manuel Fernandes, William Carvalho, Andre Silva, Bernardo Silva, Cristiano Ronaldo, Gelson Martins, Goncalo Guedes, Ricardo Quaresma'\n", - "}, {\n", - " 'country': 'spain',\n", - " 'group': 'B',\n", - " 'coach': 'Hierro Fernando',\n", - " 'name': 'David de Gea, Pepe Reina, Kepa Arrizabalaga; Dani Carvajal, Alvaro Odriozola, Gerard Pique, Sergio Ramos, Nacho, Cesar Azpilicueta, Jordi Alba, Nacho Monreal; Sergio Busquets, Saul Niquez, Koke, Thiago Alcantara, Andres Iniesta, David Silva; Isco, Marcio Asensio, Lucas Vazquez, Iago Aspas, Rodrigo, Diego Costa'\n", - "}, {\n", - " 'country': 'morocco',\n", - " 'group': 'B',\n", - " 'coach': 'Renard Herve',\n", - " 'name': \"Mounir El Kajoui, Yassine Bounou, Ahmad Reda Tagnaouti, Mehdi Benatia, Romain Saiss, Manuel Da Costa, Badr Benoun, Nabil Dirar, Achraf Hakimi, Hamza Mendyl; M'bark Boussoufa, Karim El Ahmadi, Youssef Ait Bennasser, Sofyan Amrabat, Younes Belhanda, Faycal Fajr, Amine Harit; Khalid Boutaib, Aziz Bouhaddouz, Ayoub El Kaabi, Nordin Amrabat, Mehdi Carcela, Hakim Ziyech\"\n", - "}, {\n", - " 'country': 'iran',\n", - " 'group': 'B',\n", - " 'coach': 'Queiroz Carlos',\n", - " 'name': 'Alireza Beiranvand, Rashid Mazaheri, Amir Abedzadeh; Ramin Rezaeian, Mohammad Reza Khanzadeh, Morteza Pouraliganji, Pejman Montazeri, Seyed Majid Hosseini, Milad Mohammadi, Roozbeh Cheshmi; Saeid Ezatolahi, Masoud Shojaei, Saman Ghoddos, Mehdi Torabi, Ashkan Dejagah, Omid Ebrahimi, Ehsan Hajsafi, Vahid Amiri; Alireza Jahanbakhsh, Karim Ansarifard, Mahdi Taremi, Sardar Azmoun, Reza Ghoochannejhad'\n", - "}, {\n", - " 'country': 'france',\n", - " 'group': 'C',\n", - " 'coach': 'Deschamps Didier',\n", - " 'name': \"Alphonse Areola, Hugo Lloris, Steve Mandanda; Lucas Hernandez, Presnel Kimpembe, Benjamin Mendy, Benjamin Pavard, Adil Rami, Djibril Sidibe, Samuel Umtiti, Raphael Varane; N'Golo Kante, Blaise Matuidi, Steven N'Zonzi, Paul Pogba, Corentin Tolisso, Ousmane Dembele, Nabil Fekir; Olivier Giroud, Antoine Griezmann, Thomas Lemar, Kylian Mbappe, Florian Thauvin\"\n", - "}, {\n", - " 'country': 'australia',\n", - " 'group': 'C',\n", - " 'coach': 'Van Marwur bert',\n", - " 'name': 'Brad Jones, Mat Ryan, Danny Vukovic; Aziz Behich, Milos Degenek, Matthew Jurman, James Meredith, Josh Risdon, Trent Sainsbury; Jackson Irvine, Mile Jedinak, Robbie Kruse, Massimo Luongo, Mark Milligan, Aaron Mooy, Tom Rogic; Daniel Arzani, Tim Cahill, Tomi Juric, Mathew Leckie, Andrew Nabbout, Dimitri Petratos, Jamie Maclaren'\n", - "}, {\n", - " 'country': 'peru',\n", - " 'group': 'C',\n", - " 'coach': 'Gareca Ricardo',\n", - " 'name': 'Carlos Caceda, Jose Carvallo, Pedro Gallese, Luis Advincula, Pedro Aquino, Miguel Araujo, Andre Carrillo, Wilder Cartagena, Aldo Corzo, Christian Cueva, Jefferson Farfan, Edison Flores, Paolo Hurtado, Nilson Loyola, Andy Polo, Christian Ramos, Alberto Rodriguez, Raul Ruidiaz, Anderson Santamaria, Renato Tapia, Miguel Trauco, Yoshimar Yotun, Paolo Guerrero'\n", - "}, {\n", - " 'country': 'denmark',\n", - " 'group': 'C',\n", - " 'coach': 'Hareide Age',\n", - " 'name': 'Kasper Schmeichel, Jonas Lossl, Frederik Ronow; Simon Kjaer, Andreas Christensen, Mathias Jorgensen, Jannik Vestergaard, Henrik Dalsgaard, Jens Stryger, Jonas Knudsen; William Kvist, Thomas Delaney, Lukas Lerager, Lasse Schone, Christian Eriksen, Michael Krohn-Dehli; Pione Sisto, Martin Braithwaite, Andreas Cornelius, Viktor Fischer, Yussuf Poulsen, Nicolai Jorgensen, Kasper Dolberg'\n", - "}, {\n", - " 'country': 'argentina',\n", - " 'group': 'D',\n", - " 'coach': 'Sampaoli Jorge',\n", - " 'name': 'Nahuel Guzmán, Willy Caballero, Franco Armani; Gabriel Mercado, Nicolas Otamendi, Federico Fazio, Nicolas Tagliafico, Marcos Rojo, Marcos Acuna, Cristian Ansaldi, Eduardo Salvio; Javier Mascherano, Angel Di Maria, Ever Banega, Lucas Biglia, Manuel Lanzini, Gio Lo Celso, Maximiliano Meza; Lionel Messi, Sergio Aguero, Gonzalo Higuain, Paulo Dybala, Cristian Pavon'\n", - "}, {\n", - " 'country': 'iceland',\n", - " 'group': 'D',\n", - " 'coach': 'Hallgrimsson Heimir',\n", - " 'name': 'Hannes Thor Halldorsson, Runar Alex Runarsson, Frederik Schram; Kari Arnason, Ari Freyr Skulason, Birkir Mar Saevarsson, Sverrir Ingi Ingason, Hordur Magnusson, Holmar Orn Eyjolfsson, Ragnar Sigurdsson; Johann Berg Gudmundsson, Birkir Bjarnason, Arnor Ingvi Traustason, Emil Hallfredsson, Gylfi Sigurdsson, Olafur Ingi Skulason, Rurik Gislason, Samuel Fridjonsson, Aron Gunnarsson; Alfred Finnbogason, Bjorn Bergmann Sigurdarson, Jon Dadi Bodvarsson, Albert Gudmundsson'\n", - "}, {\n", - " 'country': 'croatia',\n", - " 'group': 'D',\n", - " 'coach': 'Dalic Zlatko',\n", - " 'name': 'Danijel Subasic, Lovre Kalinic, Dominik Livakovic; Vedran Corluka, Domagoj Vida, Ivan Strinic, Dejan Lovren, Sime Vrsaljko, Josip Pivaric, Tin Jedvaj, Duje Caleta-Car; Luka Modric, Ivan Rakitic, Mateo Kovacic, Milan Badelj, Marcelo Brozovic, Filip Bradaric; Mario Mandzukic, Ivan Perisic, Nikola Kalinic, Andrej Kramaric, Marko Pjaca, Ante Rebic'\n", - "}, {\n", - " 'country': 'nigeria',\n", - " 'group': 'D',\n", - " 'coach': 'Rohr Gernot',\n", - " 'name': 'Ikechukwu Ezenwa, Daniel Akpeyi, Francis Uzoho; William Troost-Ekong, Leon Balogun, Kenneth Omeruo, Bryan Idowu, Chidozie Awaziem, Abdullahi Shehu, Elderson Echiejile, Tyronne Ebuehi; John Obi Mikel, Ogenyi Onazi, John Ogu, Wilfred Ndidi, Oghenekaro Etebo, Joel Obi; Odion Ighalo, Ahmed Musa, Victor Moses, Alex Iwobi, Kelechi Iheanacho, Simeon Nwankwo'\n", - "}, {\n", - " 'country': 'brazil',\n", - " 'group': 'E',\n", - " 'coach': 'Tite',\n", - " 'name': ' Alisson, Ederson, Cassio; Danilo, Fagner, Marcelo, Filipe Luis, Thiago Silva, Marquinhos, Miranda, Pedro Geromel; Casemiro, Fernandinho, Paulinho, Fred, Renato Augusto, Philippe Coutinho, Willian, Douglas Costa; Neymar, Taison, Gabriel Jesus, Roberto Firmino'\n", - "}, {\n", - " 'country': 'switzerland',\n", - " 'group': 'E',\n", - " 'coach': 'Petkovic Vladimir',\n", - " 'name': 'Roman Burki, Yvon Mvogo, Yann Sommer; Manuel Akanji, Johan Djourou, Nico Elvedi, Michael Lang, Stephan Lichtsteiner, Jacques-Francois Moubandje, Ricardo Rodriguez, Fabian Schaer; Valon Behrami, Blerim Dzemaili, Gelson Fernandes, Remo Freuler, Xherdan Shaqiri, Granit Xhaka, Steven Zuber, Denis Zakaria; Josip Drmic, Breel Embolo, Mario Gavranovic, Haris Seferovic'\n", - "}, {\n", - " 'country': 'costa_rica',\n", - " 'group': 'E',\n", - " 'coach': 'Ranurez Oscar',\n", - " 'name': 'Keylor Navas, Patrick Pemberton, Leonel Moreira, Cristian Gamboa, Ian Smith, Ronald Matarrita, Bryan Oviedo, Oscar Duarte, Giancarlo Gonzalez, Francisco Calvo, Kendall Waston, Johnny Acosta, David Guzman, Yeltsin Tejeda, Celso Borges, Randall Azofeifa, Rodney Wallace, Bryan Ruiz, Daniel Colindres, Christian Bolanos, Johan Venegas, Joel Campbell, Marco Urena'\n", - "}, {\n", - " 'country': 'serbia',\n", - " 'group': 'E',\n", - " 'coach': 'Krstajic Mladen',\n", - " 'name': ' Vladimir Stojkovic, Predrag Rajkovic, Marko Dmitrovic, Aleksandar Kolarov, Antonio Rukavina, Milan Rodic, Branislav Ivanovic, Uros Spajic, Milos Veljkovic, Dusko Tosic, Nikola Milenkovic; Nemanja Matic, Luka Milivojevic, Marko Grujic, Dusan Tadic, Andrija Zivkovic, Filip Kostic, Nemanja Radonjic, Sergej Milinkovic-Savic, Adem Ljajic; Aleksandar Mitrovic, Aleksandar Prijovic, Luka Jovic'\n", - "}, {\n", - " 'country': 'germany',\n", - " 'group': 'F',\n", - " 'coach': 'Low Joachim',\n", - " 'name': 'Manuel Neuer, Marc-Andre ter Stegen, Kevin Trapp; Jerome Boateng, Matthias Ginter, Jonas Hector, Mats Hummels, Joshua Kimmich, Marvin Plattenhardt, Antonio Rudiger, Niklas Sule; Julian Brandt, Julian Draxler, Mario Gomez, Leon Goretzka, Ilkay Gundogan, Sami Khedira, Toni Kroos, Thomas Muller, Mesut Ozil, Marco Reus, Sebastian Rudy, Timo Werner'\n", - "}, {\n", - " 'country': 'mexico',\n", - " 'group': 'F',\n", - " 'coach': 'Osorio Juan Carlos',\n", - " 'name': 'Jesus Corona, Alfredo Talavera, Guillermo Ochoa; Hugo Ayala, Carlos Salcedo, Diego Reyes, Miguel Layun, Hector Moreno, Edson Alvarez; Rafael Marquez, Jonathan dos Santos, Marco Fabian, Giovani dos Santos, Hector Herrera, Andres Guardado; Raul Jimenez, Carlos Vela, Javier Hernandez, Jesus Corona, Oribe Peralta, Javier Aquino, Hirving Lozano'\n", - "}, {\n", - " 'country': 'sweden',\n", - " 'group': 'F',\n", - " 'coach': 'Andersson Janne',\n", - " 'name': 'Robin Olsen, Karl-Johan Johnsson, Kristoffer Nordfeldt, Mikael Lustig, Victor Lindelof, Andreas Granqvist, Martin Olsson, Ludwig Augustinsson, Filip Helander, Emil Krafth, Pontus Jansson, Sebastian Larsson, Albin Ekdal, Emil Forsberg, Gustav Svensson, Oscar Hiljemark, Viktor Claesson, Marcus Rohden, Jimmy Durmaz, Marcus Berg, John Guidetti, Ola Toivonen, Isaac Kiese Thelin'\n", - "}, {\n", - " 'country': 'south_korea',\n", - " 'group': 'F',\n", - " 'coach': 'Shin Taeyong',\n", - " 'name': 'Kim Seunggyu, Kim Jinhyeon, Cho Hyeonwoo, Kim Younggwon, Jang Hyunsoo, Jeong Seunghyeon, Yun Yeongseon, Oh Bansuk, Kim Minwoo, Park Jooho, Hong Chul, Go Yohan, Lee Yong, Ki Sungyueng, Jeong Wooyoung, Ju Sejong, Koo Jacheol, Lee Jaesung, Lee Seungwoo, Moon Sunmin, Kim Shinwook, Son Heungmin, Hwang Heechan'\n", - "}, {\n", - " 'country': 'belgium',\n", - " 'group': 'G',\n", - " 'coach': 'Martinez Roberto',\n", - " 'name': 'Koen Casteels, Thibaut Courtois, Simon Mignolet; Toby Alderweireld, Dedryck Boyata, Vincent Kompany, Thomas Meunier, Thomas Vermaelen, Jan Vertonghen; Nacer Chadli, Kevin De Bruyne, Mousa Dembele, Leander Dendoncker, Marouane Fellaini, Youri Tielemans, Axel Witsel; Michy Batshuayi, Yannick Carrasco, Eden Hazard, Thorgan Hazard, Adnan Januzaj, Romelu Lukaku, Dries Mertens'\n", - "}, {\n", - " 'country': 'panama',\n", - " 'group': 'G',\n", - " 'coach': 'Gomez Hernan',\n", - " 'name': 'Jose Calderon, Jaime Penedo, Alex Rodríguez; Felipe Baloy, Harold Cummings, Eric Davis, Fidel Escobar, Adolfo Machado, Michael Murillo, Luis Ovalle, Roman Torres; Edgar Barcenas, Armando Cooper, Anibal Godoy, Gabriel Gomez, Valentin Pimentel, Alberto Quintero, Jose Luis Rodriguez; Abdiel Arroyo, Ismael Diaz, Blas Perez, Luis Tejada, Gabriel Torres'\n", - "}, {\n", - " 'country': 'tunisia',\n", - " 'group': 'G',\n", - " 'coach': 'Maaloul Nabil',\n", - " 'name': 'Farouk Ben Mustapha, Moez Hassen, Aymen Mathlouthi, Rami Bedoui, Yohan Benalouane, Syam Ben Youssef, Dylan Bronn, Oussama Haddadi, Ali Maaloul, Yassine Meriah, Hamdi Nagguez, Anice Badri, Mohamed Amine Ben Amor, Ghaylene Chaalali, Ahmed Khalil, Saifeddine Khaoui, Ferjani Sassi, Ellyes Skhiri, Naim Sliti, Bassem Srarfi, Fakhreddine Ben Youssef, Saber Khalifa, Wahbi Khazri'\n", - "}, {\n", - " 'country': 'england',\n", - " 'group': 'G',\n", - " 'coach': 'Southgate Gareth',\n", - " 'name': 'Jack Butland, Nick Pope, Jordan Pickford; Fabian Delph, Danny Rose, Eric Dier, Kyle Walker, Kieran Trippier, Trent Alexander-Arnold, Harry Maguire, John Stones, Phil Jones, Gary Cahill; Jordan Henderson, Jesse Lingard, Ruben Loftus-Cheek, Ashley Young, Dele Alli, Raheem Sterling; Harry Kane, Jamie Vardy, Marcus Rashford, Danny Welbeck'\n", - "}, {\n", - " 'country': 'poland',\n", - " 'group': 'H',\n", - " 'coach': 'Nawalka Adam',\n", - " 'name': 'Bartosz Bialkowski, Lukasz Fabianski, Wojciech Szczesny; Jan Bednarek, Bartosz Bereszynski, Thiago Cionek, Kamil Glik, Artur Jedrzejczyk, Michal Pazdan, Lukasz Piszczek; Jakub Blaszczykowski, Jacek Goralski, Kamil Goricki, Grzegorz Krychowiak, Slawomir Peszko, Maciej Rybus, Piotr Zielinski, Rafal Kurzawa, Karol Linetty; Dawid Kownacki, Robert Lewandowski, Arkadiusz Milik, Lukasz Teodorczyk'\n", - "}, {\n", - " 'country': 'senegal',\n", - " 'group': 'H',\n", - " 'coach': 'Cisse Aliou',\n", - " 'name': 'Abdoulaye Diallo, Khadim Ndiaye, Alfred Gomis, Lamine Gassama, Moussa Wague, Saliou Ciss, Youssouf Sabaly, Kalidou Koulibaly, Salif Sane, Cheikhou Kouyate, Kara Mbodji, Idrisa Gana Gueye, Cheikh Ndoye, Alfred Ndiaye, Pape Alioune Ndiaye, Moussa Sow, Moussa Konate, Diafra Sakho, Sadio Mane, Ismaila Sarr, Mame Biram Diouf, Mbaye Niang, Diao Keita Balde'\n", - "}, {\n", - " 'country': 'colombia',\n", - " 'group': 'H',\n", - " 'coach': 'Pekerman Jose',\n", - " 'name': 'David Ospina, Camilo Vargas, Jose Fernando Cuadrado; Cristian Zapata, Davinson Sanchez, Santiago Arias, Oscar Murillo, Frank Fabra, Johan Mojica, Yerry Mina; Wilmar Barrios, Carlos Sanchez, Jefferson Lerma, Jose Izquierdo, James Rodriguez, Abel Aguilar, Juan Fernando Quintero, Mateus Uribe, Juan Guillermo Cuadrado; Radamel Falcao Garcia, Miguel Borja, Carlos Bacca, Luis Fernando Muriel'\n", - "}, {\n", - " 'country': 'japan',\n", - " 'group': 'H',\n", - " 'coach': 'Nishino Akira',\n", - " 'name': 'Eiji Kawashima, Masaaki Higashiguchi, Kosuke Nakamura, Yuto Nagatomo, Tomoaki Makino, Maya Yoshida, Hiroki Sakai, Gotoku Sakai, Gen Shoji, Wataru Endo, Naomichi Ueda, Makoto Hasebe, Keisuke Honda, Takashi Inui, Shinji Kagawa, Hotaru Yamaguchi, Genki Haraguchi, Takashi Usami, Gaku Shibasaki, Ryota Oshima, Shinji Okazaki, Yuya Osako, Yoshinori Muto'\n", - "}])\n", - "\n", - "worldcup_2018_data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "58343c8a068358511995efd9a0c08a5b5d27999d" - }, - "source": [ - "# Preprocessing" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "4230dcf4378460efc5d86070b7ce885896a5f028" - }, - "source": [ - "### clean 2018 data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "0b5f50a88ab1e4ddf9e94cf199451c40a8cc3645", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "worldcup_2018_df = pd.DataFrame(worldcup_2018_data)\n", - "\n", - "def clean_merge(row):\n", - " name = row['name'].replace(';', ',').strip()\n", - " names = name.split(',')\n", - " names.append(row['coach'])\n", - " names = sorted(names)\n", - " return ' '.join(names)\n", - "\n", - "worldcup_2018_df['participants'] = worldcup_2018_df.apply(clean_merge, axis=1)\n", - "\n", - "worldcup_2018_df['label'] = 0\n", - "worldcup_2018_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "74c5d8e5d2b4caf11697cc59baaca508d50878f0" - }, - "source": [ - " ### Get map of country code and name" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "e169c6a453f7102be4ac2565ed30982cab3eaeeb", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "def country_name_code_mapping(df):\n", - " code2name = {}\n", - " name2code = {}\n", - " name2pos = {}\n", - " pos2name = {}\n", - "\n", - " working_df = df[['Home Team Name', 'Home Team Initials']].drop_duplicates()\n", - "\n", - " for i, row in working_df.iterrows():\n", - " code2name[row['Home Team Initials']] = row['Home Team Name']\n", - " name2code[row['Home Team Name']] = row['Home Team Initials']\n", - " \n", - " for i, name in enumerate(name2code):\n", - " name2pos[name] = i\n", - " pos2name[i] = name\n", - " \n", - " print('Name to Code Sample')\n", - " for x in name2code:\n", - " print(x, name2code[x])\n", - " break\n", - "\n", - " print('Code to Name Sample')\n", - " for x in code2name:\n", - " print(x, code2name[x])\n", - " break\n", - " \n", - " print('Name to Position Sample')\n", - " for x in name2pos:\n", - " print(x, name2pos[x])\n", - " break\n", - " \n", - " print('Position to Name Sample')\n", - " for x in pos2name:\n", - " print(x, pos2name[x])\n", - " break\n", - " \n", - " return code2name, name2code, name2pos, pos2name\n", - "\n", - "code2name, name2code, name2pos, pos2name = country_name_code_mapping(match_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "e1db34fc15dbbac46608ac6e6d12acd6b24ade18" - }, - "source": [ - "### Get all participants per match**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "5b1e11b23589b57f1e9cc86caebbfda8600b325b", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "def join_participants(match_id, team):\n", - " working_df = player_df[\n", - " (player_df['MatchID'] == match_id)\n", - " & (player_df['Team Initials'] == team)\n", - " ]\n", - " coachs = working_df['Coach Name'].unique().tolist()\n", - " players = working_df['Player Name'].unique().tolist()\n", - " \n", - " return coachs + players\n", - "\n", - "match_df['home_participants'] = match_df.apply(lambda x: join_participants(x['MatchID'], x['Home Team Initials']), axis=1)\n", - "match_df['away_participants'] = match_df.apply(lambda x: join_participants(x['MatchID'], x['Away Team Initials']), axis=1)\n", - "match_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "b2f8930084a2d3cb1e9fd9e929fcf2219ad27b14" - }, - "source": [ - "# Transformation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "b5676322aace7d598dd7248e265f34f58bce7900", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "# code2name, name2code, name2pos, pos2name\n", - "\n", - "# results = []\n", - "\n", - "def _build_record(year, team_name):\n", - " list_of_home_participants = match_df[\n", - " (match_df['Year'] == year)\n", - " & (match_df['Home Team Name'] == team_name)\n", - " ]['home_participants'].tolist()\n", - " \n", - " list_of_away_participants = match_df[\n", - " (match_df['Year'] == year)\n", - " & (match_df['Away Team Name'] == team_name)\n", - " ]['away_participants'].tolist()\n", - " \n", - " participants = []\n", - " for ps in list_of_home_participants + list_of_away_participants:\n", - " participants.extend(ps)\n", - " participants = sorted(list(set(participants)))\n", - " \n", - " return ' '.join(participants)\n", - "\n", - "def get_non_first_fouth_team(year, positive_team_names):\n", - " home_names = match_df[match_df['Year'] == year]['Home Team Name'].unique().tolist()\n", - " away_names = match_df[match_df['Year'] == year]['Away Team Name'].unique().tolist()\n", - " non_winner_names = list(set(home_names + away_names))\n", - " for name in positive_team_names:\n", - " non_winner_names.remove(name)\n", - " \n", - " return non_winner_names\n", - "\n", - "def build_negative(year, positive_team_names):\n", - " non_winner_names = get_non_first_fouth_team(year, positive_team_names)\n", - " \n", - " results = []\n", - " for name in non_winner_names:\n", - " results.append({\n", - " 'label': 0,\n", - " 'name': _build_record(year, name)\n", - " })\n", - " \n", - " return results\n", - "\n", - "trainin_data = []\n", - "for i, row in summary_df.iterrows():\n", - " # positve sample\n", - " trainin_data.append({\n", - " 'label': 1,\n", - " 'name': _build_record(row['Year'], row['Winner'])\n", - " })\n", - " trainin_data.append({\n", - " 'label': 2,\n", - " 'name': _build_record(row['Year'], row['Runners-Up'])\n", - " })\n", - " trainin_data.append({\n", - " 'label': 3,\n", - " 'name': _build_record(row['Year'], row['Third'])\n", - " })\n", - " trainin_data.append({\n", - " 'label': 4,\n", - " 'name': _build_record(row['Year'], row['Fourth'])\n", - " })\n", - " \n", - " winner_names = [row['Winner'], row['Runners-Up'], row['Third'], row['Fourth']]\n", - " \n", - " # negative sample\n", - " results = build_negative(row['Year'], winner_names)\n", - " trainin_data.extend(results)\n", - " \n", - "training_df = pd.DataFrame(trainin_data)\n", - "\n", - "print('Number of Training Record: %d' % len(training_df))\n", - "training_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "48de565289b935ac42d7ce3826a715d330d4318b" - }, - "source": [ - "# Model Building" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "d79c8e4c13ea24622e4a2699aac1279ee857deb3" - }, - "source": [ - "You may think that Character Embedding seems like meaninless but there is some reason behind of using Character as a feature. If you want to understand more about Character Embedding. You may check out this article for reference\n", - "https://medium.com/@makcedward/besides-word-embedding-why-you-need-to-know-character-embedding-6096a34a3b10\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "a702beda72b94c33f1cd10dff73139178834d1bd", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "from nltk.tokenize import sent_tokenize\n", - "\n", - "class CharCNN:\n", - " CHAR_DICT = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .!?:,\\'%-\\(\\)/$|&;[]\"'\n", - " \n", - " def __init__(self, max_len_of_sentence, max_num_of_setnence, verbose=10):\n", - " self.max_len_of_sentence = max_len_of_sentence\n", - " self.max_num_of_setnence = max_num_of_setnence\n", - " self.verbose = verbose\n", - " \n", - " self.num_of_char = 0\n", - " self.num_of_label = 0\n", - " self.unknown_label = ''\n", - " \n", - " def build_char_dictionary(self, char_dict=None, unknown_label='UNK'):\n", - " \"\"\"\n", - " Define possbile char set. Using \"UNK\" if character does not exist in this set\n", - " \"\"\" \n", - " \n", - " if char_dict is None:\n", - " char_dict = self.CHAR_DICT\n", - " \n", - " self.unknown_label = unknown_label\n", - "\n", - " chars = []\n", - "\n", - " for c in char_dict:\n", - " chars.append(c)\n", - "\n", - " chars = list(set(chars))\n", - " \n", - " chars.insert(0, unknown_label)\n", - "\n", - " self.num_of_char = len(chars)\n", - " self.char_indices = dict((c, i) for i, c in enumerate(chars))\n", - " self.indices_char = dict((i, c) for i, c in enumerate(chars))\n", - " \n", - " if self.verbose > 5:\n", - " print('Totoal number of chars:', self.num_of_char)\n", - "\n", - " print('First 3 char_indices sample:', {k: self.char_indices[k] for k in list(self.char_indices)[:3]})\n", - " print('First 3 indices_char sample:', {k: self.indices_char[k] for k in list(self.indices_char)[:3]})\n", - " \n", - "\n", - " return self.char_indices, self.indices_char, self.num_of_char\n", - " \n", - " def convert_labels(self, labels):\n", - " \"\"\"\n", - " Convert label to numeric\n", - " \"\"\"\n", - " self.label2indexes = dict((l, i) for i, l in enumerate(labels))\n", - " self.index2labels = dict((i, l) for i, l in enumerate(labels))\n", - "\n", - " if self.verbose > 5:\n", - " print('Label to Index: ', self.label2indexes)\n", - " print('Index to Label: ', self.index2labels)\n", - " \n", - " self.num_of_label = len(self.label2indexes)\n", - "\n", - " return self.label2indexes, self.index2labels\n", - " \n", - " def _transform_raw_data(self, df, x_col, y_col, label2indexes=None, sample_size=None):\n", - " \"\"\"\n", - " ##### Transform raw data to list\n", - " \"\"\"\n", - " \n", - " x = []\n", - " y = []\n", - "\n", - " actual_max_sentence = 0\n", - " \n", - " if sample_size is None:\n", - " sample_size = len(df)\n", - "\n", - " for i, row in df.head(sample_size).iterrows():\n", - " x_data = row[x_col]\n", - " y_data = row[y_col]\n", - "\n", - " sentences = sent_tokenize(x_data)\n", - " x.append(sentences)\n", - "\n", - " if len(sentences) > actual_max_sentence:\n", - " actual_max_sentence = len(sentences)\n", - "\n", - " y.append(label2indexes[y_data])\n", - "\n", - " if self.verbose > 5:\n", - " print('Number of news: %d' % (len(x)))\n", - " print('Actual max sentence: %d' % actual_max_sentence)\n", - "\n", - " return x, y\n", - " \n", - " def _transform_training_data(self, x_raw, y_raw, max_len_of_sentence=None, max_num_of_setnence=None):\n", - " \"\"\"\n", - " ##### Transform preorcessed data to numpy\n", - " \"\"\"\n", - " unknown_value = self.char_indices[self.unknown_label]\n", - " \n", - " x = np.ones((len(x_raw), max_num_of_setnence, max_len_of_sentence), dtype=np.int64) * unknown_value\n", - " y = np.array(y_raw)\n", - " \n", - " if max_len_of_sentence is None:\n", - " max_len_of_sentence = self.max_len_of_sentence\n", - " if max_num_of_setnence is None:\n", - " max_num_of_setnence = self.max_num_of_setnence\n", - "\n", - " for i, doc in enumerate(x_raw):\n", - " for j, sentence in enumerate(doc):\n", - " if j < max_num_of_setnence:\n", - " for t, char in enumerate(sentence[-max_len_of_sentence:]):\n", - " if char not in self.char_indices:\n", - " x[i, j, (max_len_of_sentence-1-t)] = self.char_indices['UNK']\n", - " else:\n", - " x[i, j, (max_len_of_sentence-1-t)] = self.char_indices[char]\n", - "\n", - " return x, y\n", - "\n", - " def _build_character_block(self, block, dropout=0.3, filters=[64, 100], kernel_size=[3, 3], \n", - " pool_size=[2, 2], padding='valid', activation='relu', \n", - " kernel_initializer='glorot_normal'):\n", - " \n", - " for i in range(len(filters)):\n", - " block = Conv1D(\n", - " filters=filters[i], kernel_size=kernel_size[i],\n", - " padding=padding, activation=activation, kernel_initializer=kernel_initializer)(block)\n", - "\n", - " block = Dropout(dropout)(block)\n", - " block = MaxPooling1D(pool_size=pool_size[i])(block)\n", - "\n", - " block = GlobalMaxPool1D()(block)\n", - " block = Dense(128, activation='relu')(block)\n", - " return block\n", - " \n", - " def _build_sentence_block(self, max_len_of_sentence, max_num_of_setnence, \n", - " char_dimension=16,\n", - " filters=[[3, 5, 7], [200, 300, 300], [300, 400, 400]], \n", - "# filters=[[100, 200, 200], [200, 300, 300], [300, 400, 400]], \n", - " kernel_sizes=[[4, 3, 3], [5, 3, 3], [6, 3, 3]], \n", - " pool_sizes=[[2, 2, 2], [2, 2, 2], [2, 2, 2]],\n", - " dropout=0.4):\n", - " \n", - " sent_input = Input(shape=(max_len_of_sentence, ), dtype='int64')\n", - " embedded = Embedding(self.num_of_char, char_dimension, input_length=max_len_of_sentence)(sent_input)\n", - " \n", - " blocks = []\n", - " for i, filter_layers in enumerate(filters):\n", - " blocks.append(\n", - " self._build_character_block(\n", - " block=embedded, filters=filters[i], kernel_size=kernel_sizes[i], pool_size=pool_sizes[i])\n", - " )\n", - "\n", - " sent_output = concatenate(blocks, axis=-1)\n", - " sent_output = Dropout(dropout)(sent_output)\n", - " sent_encoder = Model(inputs=sent_input, outputs=sent_output)\n", - "\n", - " return sent_encoder\n", - " \n", - " def _build_document_block(self, sent_encoder, max_len_of_sentence, max_num_of_setnence, \n", - " num_of_label, dropout=0.3, \n", - " loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']):\n", - " doc_input = Input(shape=(max_num_of_setnence, max_len_of_sentence), dtype='int64')\n", - " doc_output = TimeDistributed(sent_encoder)(doc_input)\n", - "\n", - " doc_output = Bidirectional(LSTM(128, return_sequences=False, dropout=dropout))(doc_output)\n", - "\n", - " doc_output = Dropout(dropout)(doc_output)\n", - " doc_output = Dense(128, activation='relu')(doc_output)\n", - " doc_output = Dropout(dropout)(doc_output)\n", - " doc_output = Dense(num_of_label, activation='sigmoid')(doc_output)\n", - "\n", - " doc_encoder = Model(inputs=doc_input, outputs=doc_output)\n", - " doc_encoder.compile(loss=loss, optimizer=optimizer, metrics=metrics)\n", - " return doc_encoder\n", - " \n", - " def preporcess(self, labels, char_dict=None, unknown_label='UNK'):\n", - " if self.verbose > 3:\n", - " print('-----> Stage: preprocess')\n", - " \n", - " self.build_char_dictionary(char_dict, unknown_label)\n", - " self.convert_labels(labels)\n", - " \n", - " def process(self, df, x_col, y_col, \n", - " max_len_of_sentence=None, max_num_of_setnence=None, label2indexes=None, sample_size=None):\n", - " if self.verbose > 3:\n", - " print('-----> Stage: process')\n", - " \n", - " if sample_size is None:\n", - " sample_size = 1000\n", - " if label2indexes is None:\n", - " if self.label2indexes is None:\n", - " raise Exception('Does not initalize label2indexes. Please invoke preprocess step first')\n", - " label2indexes = self.label2indexes\n", - " if max_len_of_sentence is None:\n", - " max_len_of_sentence = self.max_len_of_sentence\n", - " if max_num_of_setnence is None:\n", - " max_num_of_setnence = self.max_num_of_setnence\n", - "\n", - " x_preprocess, y_preprocess = self._transform_raw_data(\n", - " df=df, x_col=x_col, y_col=y_col, label2indexes=label2indexes)\n", - " \n", - " x_preprocess, y_preprocess = self._transform_training_data(\n", - " x_raw=x_preprocess, y_raw=y_preprocess,\n", - " max_len_of_sentence=max_len_of_sentence, max_num_of_setnence=max_num_of_setnence)\n", - " \n", - " if self.verbose > 5:\n", - " print('Shape: ', x_preprocess.shape, y_preprocess.shape)\n", - "\n", - " return x_preprocess, y_preprocess\n", - " \n", - " def build_model(self, char_dimension=16, display_summary=False, display_architecture=False, \n", - " loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']):\n", - " if self.verbose > 3:\n", - " print('-----> Stage: build model')\n", - " \n", - " sent_encoder = self._build_sentence_block(\n", - " char_dimension=char_dimension,\n", - " max_len_of_sentence=self.max_len_of_sentence, max_num_of_setnence=self.max_num_of_setnence)\n", - " \n", - " doc_encoder = self._build_document_block(\n", - " sent_encoder=sent_encoder, num_of_label=self.num_of_label,\n", - " max_len_of_sentence=self.max_len_of_sentence, max_num_of_setnence=self.max_num_of_setnence, \n", - " loss=loss, optimizer=optimizer, metrics=metrics)\n", - " \n", - " if display_architecture:\n", - " print('Sentence Architecture')\n", - " IPython.display.display(SVG(model_to_dot(sent_encoder).create(prog='dot', format='svg')))\n", - " print()\n", - " print('Document Architecture')\n", - " IPython.display.display(SVG(model_to_dot(doc_encoder).create(prog='dot', format='svg')))\n", - " \n", - " if display_summary:\n", - " print(doc_encoder.summary())\n", - " \n", - " \n", - " self.model = {\n", - " 'sent_encoder': sent_encoder,\n", - " 'doc_encoder': doc_encoder\n", - " }\n", - " \n", - " return doc_encoder\n", - " \n", - " def train(self, x_train, y_train, x_test, y_test, batch_size=128, epochs=1, shuffle=True):\n", - " if self.verbose > 3:\n", - " print('-----> Stage: train model')\n", - " \n", - " self.get_model().fit(\n", - " x_train, y_train, validation_data=(x_test, y_test), \n", - " batch_size=batch_size, epochs=epochs, shuffle=shuffle)\n", - " \n", - "# return self.model['doc_encoder']\n", - "\n", - " def predict(self, x, model=None, return_prob=False):\n", - " if self.verbose > 3:\n", - " print('-----> Stage: predict')\n", - " \n", - " if model is None:\n", - " model = self.get_model()\n", - " \n", - " if return_prob:\n", - " return model.predict(x_test)\n", - " \n", - " return model.predict(x_test).argmax(axis=-1)\n", - " \n", - " def get_model(self):\n", - " return self.model['doc_encoder']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "97425d0a69723820392eedfb1d4a2624d440b0ec", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "train_df, test_df = train_test_split(training_df, test_size=0.2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "7fc8d75dc5f14237a0eba2b8a34b6b4347e11964", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "char_cnn = CharCNN(max_len_of_sentence=256, max_num_of_setnence=1)\n", - "char_cnn.preporcess(labels=training_df['label'].unique())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "931f26a83e15ae8f617e25836502e212c8b5b321", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "x_train, y_train = char_cnn.process(\n", - " df=train_df, x_col='name', y_col='label')\n", - "x_test, y_test = char_cnn.process(\n", - " df=test_df, x_col='name', y_col='label')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "60a3811d3c8ab657a898110948aaa67948d581d4", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "import keras\n", - "from keras.models import Model, load_model\n", - "from keras.layers import Dense, Input, Dropout, MaxPooling1D, Conv1D, GlobalMaxPool1D, Bidirectional\n", - "from keras.layers import LSTM, Lambda, Bidirectional, concatenate, BatchNormalization, Embedding\n", - "from keras.layers import TimeDistributed\n", - "from keras.optimizers import Adam\n", - "import tensorflow as tf\n", - "import keras.backend as K\n", - "\n", - "char_cnn.build_model()\n", - "char_cnn.train(x_train, y_train, x_test, y_test, batch_size=32, epochs=10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "1ddf3f1b5982d8852903b841ec0df804b4bf68a1" - }, - "source": [ - "# Evaluations" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "aa78af278ad827f9a8b3e6f0745592c456434802", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - }, - "scrolled": true - }, - "outputs": [], - "source": [ - "# Passing dummpy label and getting dummpy y_real just because try to reuse defined function to convert input\n", - "x_real, y_real = char_cnn.process(\n", - " df=worldcup_2018_df, x_col='participants', y_col='label')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "41e16db1bdc3f29eb2fffa817884d3c55896a460" - }, - "source": [ - "Encoded result. 4 means it is 0 actually. 0 means lost." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_uuid": "18fcf0fd02415f69e1aaf25b8b15f3e8f676f18d", - "collapsed": true, - "jupyter": { - "outputs_hidden": true - } - }, - "outputs": [], - "source": [ - "char_cnn.predict(x_real)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "91e195c8ed3b20722fc39a0f2bcd43d2d7604c02" - }, - "source": [ - "# Conclusion\n", - "\n", - "Thank you for reading this meaningless model. Actually, I would like to demonstrate how is the importance of feature but not model architecture. \n", - "\n", - "* When people talk about \"We are using Machine Learning\", \"Applied Deep Neural Network\", you may better ask about feature and data instead of asking about model architecture. Of course, model architecture is important but feature and data are also important as well. Please remember that **GARBAGE IN, GARBAGE OUT**.\n", - "* When people talk about having 80% or even 90% accuracy. You may better check whether it is a result in **experiment or actual**. Lots of model is overfit in experiment stage although data scientist believe that they already prevent it very well.\n", - "* For the measurement, better understand **other metric** but not just accuracy. For example, we also have precision and recall in classification. We have BELU in machine translation. \n", - "* As a Data Scientist, rather talking about using CNN, LSTM bla bla bla, **spend more time on understanding your feature and data**.\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "_uuid": "93f9edbaf910a41290ea0226926fa392b7450768" - }, - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.7" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} From 9d455ccad973f5fe008fb2469285d19658dc0d54 Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Sat, 9 Nov 2024 18:14:33 +0530 Subject: [PATCH 3/4] fixes 848 --- ...diction-with-80-accuracy-in-dl-model.ipynb | 1021 +++++++++++++++++ 1 file changed, 1021 insertions(+) create mode 100644 Prediction Models/World_cup_Prediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb diff --git a/Prediction Models/World_cup_Prediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb b/Prediction Models/World_cup_Prediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb new file mode 100644 index 00000000..57e497a2 --- /dev/null +++ b/Prediction Models/World_cup_Prediction/world-cup-prediction-with-80-accuracy-in-dl-model.ipynb @@ -0,0 +1,1021 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "_uuid": "6e01264e76113262d78c91b0f0a173a494deb75d" + }, + "source": [ + "# How can use player name to predict World Cup with 80% accuracy?\n", + "\n", + "![](https://upload.wikimedia.org/wikipedia/en/thumb/6/67/2018_FIFA_World_Cup.svg/227px-2018_FIFA_World_Cup.svg.png)\n", + "Photo Credit: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup\n", + "\n", + "In this kernel, I am going to demonstrate how can I use player name with Tensorflow to predict historical world cup result with 80% accuracy. Later on, I tried to predict World Cup 2018 result and seems like there is some interest findings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", + "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "# This Python 3 environment comes with many helpful analytics libraries installed\n", + "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", + "# For example, here's several helpful packages to load in \n", + "\n", + "import numpy as np # linear algebra\n", + "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", + "\n", + "# Input data files are available in the \"../input/\" directory.\n", + "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", + "\n", + "import os\n", + "print(os.listdir(\"../input\"))\n", + "\n", + "# Any results you write to the current directory are saved as output." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "529b611e7f925b7bdb38803636dba9d84317fc39" + }, + "source": [ + "# Stage\n", + "1. Data Ingestion: First of all, I got 3 dataframe from given data set and building World Cup 2018 data manully from website\n", + "2. Preprocessing: Preprocessing the needed feature for building characeter embedding input\n", + "3. Transformation: Transform the processed data for model input\n", + "4. Model Building: Build Character Embedding model\n", + "5. Evaluation: Try to predict World Cup 2018 result\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "eff2e82e7d082597c2051f4e4c5b275ad364dd15" + }, + "source": [ + "# Data Ingestion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "07b8e2651463ab0c72fea1f23fc330a57e8714fe", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "summary_df = pd.read_csv('../input/WorldCups.csv')\n", + "summary_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "a7684d9576da866f230d8fa59f558f2e3b467a83", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "match_df = pd.read_csv('../input/WorldCupMatches.csv')\n", + "match_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", + "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "player_df = pd.read_csv('../input/WorldCupPlayers.csv')\n", + "player_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "fec1faa7b94e977905f97817416a5b51614bcdf5" + }, + "source": [ + "### Get 2018 player list" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "b3db5111ea1697ff56b14ff0d84aac21feacd26a", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "# Players Source :http://www.goal.com/en/news/revealed-every-world-cup-2018-squad-23-man-preliminary-lists/oa0atsduflsv1nsf6oqk576rb\n", + "# Coach Source: https://www.fifa.com/worldcup/players/coaches/\n", + "\n", + "worldcup_2018_data = []\n", + "worldcup_2018_data.extend([{\n", + " 'country': 'russia',\n", + " 'group': 'A',\n", + " 'coach': 'Cherchesov Stanislav',\n", + " 'name': 'Igor Akinfeev, Vladimir Gabulov, Andrey Lunev; Sergei Ignashevich, Mario Fernandes, Vladimir Granat, Fyodor Kudryashov, Andrei Semyonov, Igor Smolnikov, Ilya Kutepov, Aleksandr Yerokhin, Yuri Zhirkov, Daler Kuzyaev, Aleksandr Golovin, Alan Dzagoev, Roman Zobnin, Aleksandr Samedov, Yuri Gazinsky, Anton Miranchuk, Denis Cheryshev, Artyom Dzyuba, Aleksei Miranchuk, Fyodor Smolov'\n", + "}, {\n", + " 'country': 'saudi_arabia',\n", + " 'group': 'A',\n", + " 'coach': 'Pizzi Juan Antonio',\n", + " 'name': 'Mohammed Al-Owais, Yasser Al-Musailem, Abdullah Al-Mayuf; Mansoor Al-Harbi, Yasser Al-Shahrani, Mohammed Al-Burayk, Motaz Hawsawi, Osama Hawsawi, Ali Al-Bulaihi, Omar Othman; Abdullah Alkhaibari, Abdulmalek Alkhaibri, Abdullah Otayf, Taiseer Al-Jassam, Hussain Al-Moqahwi, Salman Al-Faraj, Mohamed Kanno, Hatan Bahbir, Salem Al-Dawsari, Yahia Al-Shehri, Fahad Al-Muwallad, Mohammad Al-Sahlawi, Muhannad Assiri'\n", + "}, {\n", + " 'country': 'egypt',\n", + " 'group': 'A',\n", + " 'coach': 'Cuper Hector',\n", + " 'name': 'Essam El Hadary, Mohamed El-Shennawy, Sherif Ekramy; Ahmed Fathi, Abdallah Said, Saad Samir, Ayman Ashraf, Mohamed Abdel-Shafy, Ahmed Hegazi, Ali Gabr, Ahmed Elmohamady, Omar Gaber; Tarek Hamed, Mahmoud Shikabala, Sam Morsy, Mohamed Elneny, Mahmoud Kahraba, Ramadan Sobhi, Trezeguet, Amr Warda; Marwan Mohsen, Mohamed Salah, Mahmoud Elwensh'\n", + "}, {\n", + " 'country': 'uruguay',\n", + " 'group': 'A',\n", + " 'coach': 'Tabarez Oscar',\n", + " 'name': 'Fernando Muslera, Martin Silva, Martin Campana, Diego Godin, Sebastian Coates, Jose Maria Gimenez, Maximiliano Pereira, Gaston Silva, Martin Caceres, Guillermo Varela, Nahitan Nandez, Lucas Torreira, Matias Vecino, Rodrigo Bentancur, Carlos Sanchez, Giorgian De Arrascaeta, Diego Laxalt, Cristian Rodriguez, Jonathan Urretaviscaya, Cristhian Stuani, Maximiliano Gomez, Edinson Cavani, Luis Suarez'\n", + "}, {\n", + " 'country': 'portugal',\n", + " 'group': 'B',\n", + " 'coach': 'Santos Fernando',\n", + " 'name': 'Anthony Lopes, Beto, Rui Patricio, Bruno Alves, Cedric Soares, Jose Fonte, Mario Rui, Pepe, Raphael Guerreiro, Ricardo Pereira, Ruben Dias, Adrien Silva, Bruno Fernandes, Joao Mario, Joao Moutinho, Manuel Fernandes, William Carvalho, Andre Silva, Bernardo Silva, Cristiano Ronaldo, Gelson Martins, Goncalo Guedes, Ricardo Quaresma'\n", + "}, {\n", + " 'country': 'spain',\n", + " 'group': 'B',\n", + " 'coach': 'Hierro Fernando',\n", + " 'name': 'David de Gea, Pepe Reina, Kepa Arrizabalaga; Dani Carvajal, Alvaro Odriozola, Gerard Pique, Sergio Ramos, Nacho, Cesar Azpilicueta, Jordi Alba, Nacho Monreal; Sergio Busquets, Saul Niquez, Koke, Thiago Alcantara, Andres Iniesta, David Silva; Isco, Marcio Asensio, Lucas Vazquez, Iago Aspas, Rodrigo, Diego Costa'\n", + "}, {\n", + " 'country': 'morocco',\n", + " 'group': 'B',\n", + " 'coach': 'Renard Herve',\n", + " 'name': \"Mounir El Kajoui, Yassine Bounou, Ahmad Reda Tagnaouti, Mehdi Benatia, Romain Saiss, Manuel Da Costa, Badr Benoun, Nabil Dirar, Achraf Hakimi, Hamza Mendyl; M'bark Boussoufa, Karim El Ahmadi, Youssef Ait Bennasser, Sofyan Amrabat, Younes Belhanda, Faycal Fajr, Amine Harit; Khalid Boutaib, Aziz Bouhaddouz, Ayoub El Kaabi, Nordin Amrabat, Mehdi Carcela, Hakim Ziyech\"\n", + "}, {\n", + " 'country': 'iran',\n", + " 'group': 'B',\n", + " 'coach': 'Queiroz Carlos',\n", + " 'name': 'Alireza Beiranvand, Rashid Mazaheri, Amir Abedzadeh; Ramin Rezaeian, Mohammad Reza Khanzadeh, Morteza Pouraliganji, Pejman Montazeri, Seyed Majid Hosseini, Milad Mohammadi, Roozbeh Cheshmi; Saeid Ezatolahi, Masoud Shojaei, Saman Ghoddos, Mehdi Torabi, Ashkan Dejagah, Omid Ebrahimi, Ehsan Hajsafi, Vahid Amiri; Alireza Jahanbakhsh, Karim Ansarifard, Mahdi Taremi, Sardar Azmoun, Reza Ghoochannejhad'\n", + "}, {\n", + " 'country': 'france',\n", + " 'group': 'C',\n", + " 'coach': 'Deschamps Didier',\n", + " 'name': \"Alphonse Areola, Hugo Lloris, Steve Mandanda; Lucas Hernandez, Presnel Kimpembe, Benjamin Mendy, Benjamin Pavard, Adil Rami, Djibril Sidibe, Samuel Umtiti, Raphael Varane; N'Golo Kante, Blaise Matuidi, Steven N'Zonzi, Paul Pogba, Corentin Tolisso, Ousmane Dembele, Nabil Fekir; Olivier Giroud, Antoine Griezmann, Thomas Lemar, Kylian Mbappe, Florian Thauvin\"\n", + "}, {\n", + " 'country': 'australia',\n", + " 'group': 'C',\n", + " 'coach': 'Van Marwur bert',\n", + " 'name': 'Brad Jones, Mat Ryan, Danny Vukovic; Aziz Behich, Milos Degenek, Matthew Jurman, James Meredith, Josh Risdon, Trent Sainsbury; Jackson Irvine, Mile Jedinak, Robbie Kruse, Massimo Luongo, Mark Milligan, Aaron Mooy, Tom Rogic; Daniel Arzani, Tim Cahill, Tomi Juric, Mathew Leckie, Andrew Nabbout, Dimitri Petratos, Jamie Maclaren'\n", + "}, {\n", + " 'country': 'peru',\n", + " 'group': 'C',\n", + " 'coach': 'Gareca Ricardo',\n", + " 'name': 'Carlos Caceda, Jose Carvallo, Pedro Gallese, Luis Advincula, Pedro Aquino, Miguel Araujo, Andre Carrillo, Wilder Cartagena, Aldo Corzo, Christian Cueva, Jefferson Farfan, Edison Flores, Paolo Hurtado, Nilson Loyola, Andy Polo, Christian Ramos, Alberto Rodriguez, Raul Ruidiaz, Anderson Santamaria, Renato Tapia, Miguel Trauco, Yoshimar Yotun, Paolo Guerrero'\n", + "}, {\n", + " 'country': 'denmark',\n", + " 'group': 'C',\n", + " 'coach': 'Hareide Age',\n", + " 'name': 'Kasper Schmeichel, Jonas Lossl, Frederik Ronow; Simon Kjaer, Andreas Christensen, Mathias Jorgensen, Jannik Vestergaard, Henrik Dalsgaard, Jens Stryger, Jonas Knudsen; William Kvist, Thomas Delaney, Lukas Lerager, Lasse Schone, Christian Eriksen, Michael Krohn-Dehli; Pione Sisto, Martin Braithwaite, Andreas Cornelius, Viktor Fischer, Yussuf Poulsen, Nicolai Jorgensen, Kasper Dolberg'\n", + "}, {\n", + " 'country': 'argentina',\n", + " 'group': 'D',\n", + " 'coach': 'Sampaoli Jorge',\n", + " 'name': 'Nahuel Guzmán, Willy Caballero, Franco Armani; Gabriel Mercado, Nicolas Otamendi, Federico Fazio, Nicolas Tagliafico, Marcos Rojo, Marcos Acuna, Cristian Ansaldi, Eduardo Salvio; Javier Mascherano, Angel Di Maria, Ever Banega, Lucas Biglia, Manuel Lanzini, Gio Lo Celso, Maximiliano Meza; Lionel Messi, Sergio Aguero, Gonzalo Higuain, Paulo Dybala, Cristian Pavon'\n", + "}, {\n", + " 'country': 'iceland',\n", + " 'group': 'D',\n", + " 'coach': 'Hallgrimsson Heimir',\n", + " 'name': 'Hannes Thor Halldorsson, Runar Alex Runarsson, Frederik Schram; Kari Arnason, Ari Freyr Skulason, Birkir Mar Saevarsson, Sverrir Ingi Ingason, Hordur Magnusson, Holmar Orn Eyjolfsson, Ragnar Sigurdsson; Johann Berg Gudmundsson, Birkir Bjarnason, Arnor Ingvi Traustason, Emil Hallfredsson, Gylfi Sigurdsson, Olafur Ingi Skulason, Rurik Gislason, Samuel Fridjonsson, Aron Gunnarsson; Alfred Finnbogason, Bjorn Bergmann Sigurdarson, Jon Dadi Bodvarsson, Albert Gudmundsson'\n", + "}, {\n", + " 'country': 'croatia',\n", + " 'group': 'D',\n", + " 'coach': 'Dalic Zlatko',\n", + " 'name': 'Danijel Subasic, Lovre Kalinic, Dominik Livakovic; Vedran Corluka, Domagoj Vida, Ivan Strinic, Dejan Lovren, Sime Vrsaljko, Josip Pivaric, Tin Jedvaj, Duje Caleta-Car; Luka Modric, Ivan Rakitic, Mateo Kovacic, Milan Badelj, Marcelo Brozovic, Filip Bradaric; Mario Mandzukic, Ivan Perisic, Nikola Kalinic, Andrej Kramaric, Marko Pjaca, Ante Rebic'\n", + "}, {\n", + " 'country': 'nigeria',\n", + " 'group': 'D',\n", + " 'coach': 'Rohr Gernot',\n", + " 'name': 'Ikechukwu Ezenwa, Daniel Akpeyi, Francis Uzoho; William Troost-Ekong, Leon Balogun, Kenneth Omeruo, Bryan Idowu, Chidozie Awaziem, Abdullahi Shehu, Elderson Echiejile, Tyronne Ebuehi; John Obi Mikel, Ogenyi Onazi, John Ogu, Wilfred Ndidi, Oghenekaro Etebo, Joel Obi; Odion Ighalo, Ahmed Musa, Victor Moses, Alex Iwobi, Kelechi Iheanacho, Simeon Nwankwo'\n", + "}, {\n", + " 'country': 'brazil',\n", + " 'group': 'E',\n", + " 'coach': 'Tite',\n", + " 'name': ' Alisson, Ederson, Cassio; Danilo, Fagner, Marcelo, Filipe Luis, Thiago Silva, Marquinhos, Miranda, Pedro Geromel; Casemiro, Fernandinho, Paulinho, Fred, Renato Augusto, Philippe Coutinho, Willian, Douglas Costa; Neymar, Taison, Gabriel Jesus, Roberto Firmino'\n", + "}, {\n", + " 'country': 'switzerland',\n", + " 'group': 'E',\n", + " 'coach': 'Petkovic Vladimir',\n", + " 'name': 'Roman Burki, Yvon Mvogo, Yann Sommer; Manuel Akanji, Johan Djourou, Nico Elvedi, Michael Lang, Stephan Lichtsteiner, Jacques-Francois Moubandje, Ricardo Rodriguez, Fabian Schaer; Valon Behrami, Blerim Dzemaili, Gelson Fernandes, Remo Freuler, Xherdan Shaqiri, Granit Xhaka, Steven Zuber, Denis Zakaria; Josip Drmic, Breel Embolo, Mario Gavranovic, Haris Seferovic'\n", + "}, {\n", + " 'country': 'costa_rica',\n", + " 'group': 'E',\n", + " 'coach': 'Ranurez Oscar',\n", + " 'name': 'Keylor Navas, Patrick Pemberton, Leonel Moreira, Cristian Gamboa, Ian Smith, Ronald Matarrita, Bryan Oviedo, Oscar Duarte, Giancarlo Gonzalez, Francisco Calvo, Kendall Waston, Johnny Acosta, David Guzman, Yeltsin Tejeda, Celso Borges, Randall Azofeifa, Rodney Wallace, Bryan Ruiz, Daniel Colindres, Christian Bolanos, Johan Venegas, Joel Campbell, Marco Urena'\n", + "}, {\n", + " 'country': 'serbia',\n", + " 'group': 'E',\n", + " 'coach': 'Krstajic Mladen',\n", + " 'name': ' Vladimir Stojkovic, Predrag Rajkovic, Marko Dmitrovic, Aleksandar Kolarov, Antonio Rukavina, Milan Rodic, Branislav Ivanovic, Uros Spajic, Milos Veljkovic, Dusko Tosic, Nikola Milenkovic; Nemanja Matic, Luka Milivojevic, Marko Grujic, Dusan Tadic, Andrija Zivkovic, Filip Kostic, Nemanja Radonjic, Sergej Milinkovic-Savic, Adem Ljajic; Aleksandar Mitrovic, Aleksandar Prijovic, Luka Jovic'\n", + "}, {\n", + " 'country': 'germany',\n", + " 'group': 'F',\n", + " 'coach': 'Low Joachim',\n", + " 'name': 'Manuel Neuer, Marc-Andre ter Stegen, Kevin Trapp; Jerome Boateng, Matthias Ginter, Jonas Hector, Mats Hummels, Joshua Kimmich, Marvin Plattenhardt, Antonio Rudiger, Niklas Sule; Julian Brandt, Julian Draxler, Mario Gomez, Leon Goretzka, Ilkay Gundogan, Sami Khedira, Toni Kroos, Thomas Muller, Mesut Ozil, Marco Reus, Sebastian Rudy, Timo Werner'\n", + "}, {\n", + " 'country': 'mexico',\n", + " 'group': 'F',\n", + " 'coach': 'Osorio Juan Carlos',\n", + " 'name': 'Jesus Corona, Alfredo Talavera, Guillermo Ochoa; Hugo Ayala, Carlos Salcedo, Diego Reyes, Miguel Layun, Hector Moreno, Edson Alvarez; Rafael Marquez, Jonathan dos Santos, Marco Fabian, Giovani dos Santos, Hector Herrera, Andres Guardado; Raul Jimenez, Carlos Vela, Javier Hernandez, Jesus Corona, Oribe Peralta, Javier Aquino, Hirving Lozano'\n", + "}, {\n", + " 'country': 'sweden',\n", + " 'group': 'F',\n", + " 'coach': 'Andersson Janne',\n", + " 'name': 'Robin Olsen, Karl-Johan Johnsson, Kristoffer Nordfeldt, Mikael Lustig, Victor Lindelof, Andreas Granqvist, Martin Olsson, Ludwig Augustinsson, Filip Helander, Emil Krafth, Pontus Jansson, Sebastian Larsson, Albin Ekdal, Emil Forsberg, Gustav Svensson, Oscar Hiljemark, Viktor Claesson, Marcus Rohden, Jimmy Durmaz, Marcus Berg, John Guidetti, Ola Toivonen, Isaac Kiese Thelin'\n", + "}, {\n", + " 'country': 'south_korea',\n", + " 'group': 'F',\n", + " 'coach': 'Shin Taeyong',\n", + " 'name': 'Kim Seunggyu, Kim Jinhyeon, Cho Hyeonwoo, Kim Younggwon, Jang Hyunsoo, Jeong Seunghyeon, Yun Yeongseon, Oh Bansuk, Kim Minwoo, Park Jooho, Hong Chul, Go Yohan, Lee Yong, Ki Sungyueng, Jeong Wooyoung, Ju Sejong, Koo Jacheol, Lee Jaesung, Lee Seungwoo, Moon Sunmin, Kim Shinwook, Son Heungmin, Hwang Heechan'\n", + "}, {\n", + " 'country': 'belgium',\n", + " 'group': 'G',\n", + " 'coach': 'Martinez Roberto',\n", + " 'name': 'Koen Casteels, Thibaut Courtois, Simon Mignolet; Toby Alderweireld, Dedryck Boyata, Vincent Kompany, Thomas Meunier, Thomas Vermaelen, Jan Vertonghen; Nacer Chadli, Kevin De Bruyne, Mousa Dembele, Leander Dendoncker, Marouane Fellaini, Youri Tielemans, Axel Witsel; Michy Batshuayi, Yannick Carrasco, Eden Hazard, Thorgan Hazard, Adnan Januzaj, Romelu Lukaku, Dries Mertens'\n", + "}, {\n", + " 'country': 'panama',\n", + " 'group': 'G',\n", + " 'coach': 'Gomez Hernan',\n", + " 'name': 'Jose Calderon, Jaime Penedo, Alex Rodríguez; Felipe Baloy, Harold Cummings, Eric Davis, Fidel Escobar, Adolfo Machado, Michael Murillo, Luis Ovalle, Roman Torres; Edgar Barcenas, Armando Cooper, Anibal Godoy, Gabriel Gomez, Valentin Pimentel, Alberto Quintero, Jose Luis Rodriguez; Abdiel Arroyo, Ismael Diaz, Blas Perez, Luis Tejada, Gabriel Torres'\n", + "}, {\n", + " 'country': 'tunisia',\n", + " 'group': 'G',\n", + " 'coach': 'Maaloul Nabil',\n", + " 'name': 'Farouk Ben Mustapha, Moez Hassen, Aymen Mathlouthi, Rami Bedoui, Yohan Benalouane, Syam Ben Youssef, Dylan Bronn, Oussama Haddadi, Ali Maaloul, Yassine Meriah, Hamdi Nagguez, Anice Badri, Mohamed Amine Ben Amor, Ghaylene Chaalali, Ahmed Khalil, Saifeddine Khaoui, Ferjani Sassi, Ellyes Skhiri, Naim Sliti, Bassem Srarfi, Fakhreddine Ben Youssef, Saber Khalifa, Wahbi Khazri'\n", + "}, {\n", + " 'country': 'england',\n", + " 'group': 'G',\n", + " 'coach': 'Southgate Gareth',\n", + " 'name': 'Jack Butland, Nick Pope, Jordan Pickford; Fabian Delph, Danny Rose, Eric Dier, Kyle Walker, Kieran Trippier, Trent Alexander-Arnold, Harry Maguire, John Stones, Phil Jones, Gary Cahill; Jordan Henderson, Jesse Lingard, Ruben Loftus-Cheek, Ashley Young, Dele Alli, Raheem Sterling; Harry Kane, Jamie Vardy, Marcus Rashford, Danny Welbeck'\n", + "}, {\n", + " 'country': 'poland',\n", + " 'group': 'H',\n", + " 'coach': 'Nawalka Adam',\n", + " 'name': 'Bartosz Bialkowski, Lukasz Fabianski, Wojciech Szczesny; Jan Bednarek, Bartosz Bereszynski, Thiago Cionek, Kamil Glik, Artur Jedrzejczyk, Michal Pazdan, Lukasz Piszczek; Jakub Blaszczykowski, Jacek Goralski, Kamil Goricki, Grzegorz Krychowiak, Slawomir Peszko, Maciej Rybus, Piotr Zielinski, Rafal Kurzawa, Karol Linetty; Dawid Kownacki, Robert Lewandowski, Arkadiusz Milik, Lukasz Teodorczyk'\n", + "}, {\n", + " 'country': 'senegal',\n", + " 'group': 'H',\n", + " 'coach': 'Cisse Aliou',\n", + " 'name': 'Abdoulaye Diallo, Khadim Ndiaye, Alfred Gomis, Lamine Gassama, Moussa Wague, Saliou Ciss, Youssouf Sabaly, Kalidou Koulibaly, Salif Sane, Cheikhou Kouyate, Kara Mbodji, Idrisa Gana Gueye, Cheikh Ndoye, Alfred Ndiaye, Pape Alioune Ndiaye, Moussa Sow, Moussa Konate, Diafra Sakho, Sadio Mane, Ismaila Sarr, Mame Biram Diouf, Mbaye Niang, Diao Keita Balde'\n", + "}, {\n", + " 'country': 'colombia',\n", + " 'group': 'H',\n", + " 'coach': 'Pekerman Jose',\n", + " 'name': 'David Ospina, Camilo Vargas, Jose Fernando Cuadrado; Cristian Zapata, Davinson Sanchez, Santiago Arias, Oscar Murillo, Frank Fabra, Johan Mojica, Yerry Mina; Wilmar Barrios, Carlos Sanchez, Jefferson Lerma, Jose Izquierdo, James Rodriguez, Abel Aguilar, Juan Fernando Quintero, Mateus Uribe, Juan Guillermo Cuadrado; Radamel Falcao Garcia, Miguel Borja, Carlos Bacca, Luis Fernando Muriel'\n", + "}, {\n", + " 'country': 'japan',\n", + " 'group': 'H',\n", + " 'coach': 'Nishino Akira',\n", + " 'name': 'Eiji Kawashima, Masaaki Higashiguchi, Kosuke Nakamura, Yuto Nagatomo, Tomoaki Makino, Maya Yoshida, Hiroki Sakai, Gotoku Sakai, Gen Shoji, Wataru Endo, Naomichi Ueda, Makoto Hasebe, Keisuke Honda, Takashi Inui, Shinji Kagawa, Hotaru Yamaguchi, Genki Haraguchi, Takashi Usami, Gaku Shibasaki, Ryota Oshima, Shinji Okazaki, Yuya Osako, Yoshinori Muto'\n", + "}])\n", + "\n", + "worldcup_2018_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "58343c8a068358511995efd9a0c08a5b5d27999d" + }, + "source": [ + "# Preprocessing" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "4230dcf4378460efc5d86070b7ce885896a5f028" + }, + "source": [ + "### clean 2018 data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "0b5f50a88ab1e4ddf9e94cf199451c40a8cc3645", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "worldcup_2018_df = pd.DataFrame(worldcup_2018_data)\n", + "\n", + "def clean_merge(row):\n", + " name = row['name'].replace(';', ',').strip()\n", + " names = name.split(',')\n", + " names.append(row['coach'])\n", + " names = sorted(names)\n", + " return ' '.join(names)\n", + "\n", + "worldcup_2018_df['participants'] = worldcup_2018_df.apply(clean_merge, axis=1)\n", + "\n", + "worldcup_2018_df['label'] = 0\n", + "worldcup_2018_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "74c5d8e5d2b4caf11697cc59baaca508d50878f0" + }, + "source": [ + " ### Get map of country code and name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "e169c6a453f7102be4ac2565ed30982cab3eaeeb", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "def country_name_code_mapping(df):\n", + " code2name = {}\n", + " name2code = {}\n", + " name2pos = {}\n", + " pos2name = {}\n", + "\n", + " working_df = df[['Home Team Name', 'Home Team Initials']].drop_duplicates()\n", + "\n", + " for i, row in working_df.iterrows():\n", + " code2name[row['Home Team Initials']] = row['Home Team Name']\n", + " name2code[row['Home Team Name']] = row['Home Team Initials']\n", + " \n", + " for i, name in enumerate(name2code):\n", + " name2pos[name] = i\n", + " pos2name[i] = name\n", + " \n", + " print('Name to Code Sample')\n", + " for x in name2code:\n", + " print(x, name2code[x])\n", + " break\n", + "\n", + " print('Code to Name Sample')\n", + " for x in code2name:\n", + " print(x, code2name[x])\n", + " break\n", + " \n", + " print('Name to Position Sample')\n", + " for x in name2pos:\n", + " print(x, name2pos[x])\n", + " break\n", + " \n", + " print('Position to Name Sample')\n", + " for x in pos2name:\n", + " print(x, pos2name[x])\n", + " break\n", + " \n", + " return code2name, name2code, name2pos, pos2name\n", + "\n", + "code2name, name2code, name2pos, pos2name = country_name_code_mapping(match_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "e1db34fc15dbbac46608ac6e6d12acd6b24ade18" + }, + "source": [ + "### Get all participants per match**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "5b1e11b23589b57f1e9cc86caebbfda8600b325b", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "def join_participants(match_id, team):\n", + " working_df = player_df[\n", + " (player_df['MatchID'] == match_id)\n", + " & (player_df['Team Initials'] == team)\n", + " ]\n", + " coachs = working_df['Coach Name'].unique().tolist()\n", + " players = working_df['Player Name'].unique().tolist()\n", + " \n", + " return coachs + players\n", + "\n", + "match_df['home_participants'] = match_df.apply(lambda x: join_participants(x['MatchID'], x['Home Team Initials']), axis=1)\n", + "match_df['away_participants'] = match_df.apply(lambda x: join_participants(x['MatchID'], x['Away Team Initials']), axis=1)\n", + "match_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "b2f8930084a2d3cb1e9fd9e929fcf2219ad27b14" + }, + "source": [ + "# Transformation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "b5676322aace7d598dd7248e265f34f58bce7900", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "# code2name, name2code, name2pos, pos2name\n", + "\n", + "# results = []\n", + "\n", + "def _build_record(year, team_name):\n", + " list_of_home_participants = match_df[\n", + " (match_df['Year'] == year)\n", + " & (match_df['Home Team Name'] == team_name)\n", + " ]['home_participants'].tolist()\n", + " \n", + " list_of_away_participants = match_df[\n", + " (match_df['Year'] == year)\n", + " & (match_df['Away Team Name'] == team_name)\n", + " ]['away_participants'].tolist()\n", + " \n", + " participants = []\n", + " for ps in list_of_home_participants + list_of_away_participants:\n", + " participants.extend(ps)\n", + " participants = sorted(list(set(participants)))\n", + " \n", + " return ' '.join(participants)\n", + "\n", + "def get_non_first_fouth_team(year, positive_team_names):\n", + " home_names = match_df[match_df['Year'] == year]['Home Team Name'].unique().tolist()\n", + " away_names = match_df[match_df['Year'] == year]['Away Team Name'].unique().tolist()\n", + " non_winner_names = list(set(home_names + away_names))\n", + " for name in positive_team_names:\n", + " non_winner_names.remove(name)\n", + " \n", + " return non_winner_names\n", + "\n", + "def build_negative(year, positive_team_names):\n", + " non_winner_names = get_non_first_fouth_team(year, positive_team_names)\n", + " \n", + " results = []\n", + " for name in non_winner_names:\n", + " results.append({\n", + " 'label': 0,\n", + " 'name': _build_record(year, name)\n", + " })\n", + " \n", + " return results\n", + "\n", + "trainin_data = []\n", + "for i, row in summary_df.iterrows():\n", + " # positve sample\n", + " trainin_data.append({\n", + " 'label': 1,\n", + " 'name': _build_record(row['Year'], row['Winner'])\n", + " })\n", + " trainin_data.append({\n", + " 'label': 2,\n", + " 'name': _build_record(row['Year'], row['Runners-Up'])\n", + " })\n", + " trainin_data.append({\n", + " 'label': 3,\n", + " 'name': _build_record(row['Year'], row['Third'])\n", + " })\n", + " trainin_data.append({\n", + " 'label': 4,\n", + " 'name': _build_record(row['Year'], row['Fourth'])\n", + " })\n", + " \n", + " winner_names = [row['Winner'], row['Runners-Up'], row['Third'], row['Fourth']]\n", + " \n", + " # negative sample\n", + " results = build_negative(row['Year'], winner_names)\n", + " trainin_data.extend(results)\n", + " \n", + "training_df = pd.DataFrame(trainin_data)\n", + "\n", + "print('Number of Training Record: %d' % len(training_df))\n", + "training_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "48de565289b935ac42d7ce3826a715d330d4318b" + }, + "source": [ + "# Model Building" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "d79c8e4c13ea24622e4a2699aac1279ee857deb3" + }, + "source": [ + "You may think that Character Embedding seems like meaninless but there is some reason behind of using Character as a feature. If you want to understand more about Character Embedding. You may check out this article for reference\n", + "https://medium.com/@makcedward/besides-word-embedding-why-you-need-to-know-character-embedding-6096a34a3b10\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "a702beda72b94c33f1cd10dff73139178834d1bd", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "from nltk.tokenize import sent_tokenize\n", + "\n", + "class CharCNN:\n", + " CHAR_DICT = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .!?:,\\'%-\\(\\)/$|&;[]\"'\n", + " \n", + " def __init__(self, max_len_of_sentence, max_num_of_setnence, verbose=10):\n", + " self.max_len_of_sentence = max_len_of_sentence\n", + " self.max_num_of_setnence = max_num_of_setnence\n", + " self.verbose = verbose\n", + " \n", + " self.num_of_char = 0\n", + " self.num_of_label = 0\n", + " self.unknown_label = ''\n", + " \n", + " def build_char_dictionary(self, char_dict=None, unknown_label='UNK'):\n", + " \"\"\"\n", + " Define possbile char set. Using \"UNK\" if character does not exist in this set\n", + " \"\"\" \n", + " \n", + " if char_dict is None:\n", + " char_dict = self.CHAR_DICT\n", + " \n", + " self.unknown_label = unknown_label\n", + "\n", + " chars = []\n", + "\n", + " for c in char_dict:\n", + " chars.append(c)\n", + "\n", + " chars = list(set(chars))\n", + " \n", + " chars.insert(0, unknown_label)\n", + "\n", + " self.num_of_char = len(chars)\n", + " self.char_indices = dict((c, i) for i, c in enumerate(chars))\n", + " self.indices_char = dict((i, c) for i, c in enumerate(chars))\n", + " \n", + " if self.verbose > 5:\n", + " print('Totoal number of chars:', self.num_of_char)\n", + "\n", + " print('First 3 char_indices sample:', {k: self.char_indices[k] for k in list(self.char_indices)[:3]})\n", + " print('First 3 indices_char sample:', {k: self.indices_char[k] for k in list(self.indices_char)[:3]})\n", + " \n", + "\n", + " return self.char_indices, self.indices_char, self.num_of_char\n", + " \n", + " def convert_labels(self, labels):\n", + " \"\"\"\n", + " Convert label to numeric\n", + " \"\"\"\n", + " self.label2indexes = dict((l, i) for i, l in enumerate(labels))\n", + " self.index2labels = dict((i, l) for i, l in enumerate(labels))\n", + "\n", + " if self.verbose > 5:\n", + " print('Label to Index: ', self.label2indexes)\n", + " print('Index to Label: ', self.index2labels)\n", + " \n", + " self.num_of_label = len(self.label2indexes)\n", + "\n", + " return self.label2indexes, self.index2labels\n", + " \n", + " def _transform_raw_data(self, df, x_col, y_col, label2indexes=None, sample_size=None):\n", + " \"\"\"\n", + " ##### Transform raw data to list\n", + " \"\"\"\n", + " \n", + " x = []\n", + " y = []\n", + "\n", + " actual_max_sentence = 0\n", + " \n", + " if sample_size is None:\n", + " sample_size = len(df)\n", + "\n", + " for i, row in df.head(sample_size).iterrows():\n", + " x_data = row[x_col]\n", + " y_data = row[y_col]\n", + "\n", + " sentences = sent_tokenize(x_data)\n", + " x.append(sentences)\n", + "\n", + " if len(sentences) > actual_max_sentence:\n", + " actual_max_sentence = len(sentences)\n", + "\n", + " y.append(label2indexes[y_data])\n", + "\n", + " if self.verbose > 5:\n", + " print('Number of news: %d' % (len(x)))\n", + " print('Actual max sentence: %d' % actual_max_sentence)\n", + "\n", + " return x, y\n", + " \n", + " def _transform_training_data(self, x_raw, y_raw, max_len_of_sentence=None, max_num_of_setnence=None):\n", + " \"\"\"\n", + " ##### Transform preorcessed data to numpy\n", + " \"\"\"\n", + " unknown_value = self.char_indices[self.unknown_label]\n", + " \n", + " x = np.ones((len(x_raw), max_num_of_setnence, max_len_of_sentence), dtype=np.int64) * unknown_value\n", + " y = np.array(y_raw)\n", + " \n", + " if max_len_of_sentence is None:\n", + " max_len_of_sentence = self.max_len_of_sentence\n", + " if max_num_of_setnence is None:\n", + " max_num_of_setnence = self.max_num_of_setnence\n", + "\n", + " for i, doc in enumerate(x_raw):\n", + " for j, sentence in enumerate(doc):\n", + " if j < max_num_of_setnence:\n", + " for t, char in enumerate(sentence[-max_len_of_sentence:]):\n", + " if char not in self.char_indices:\n", + " x[i, j, (max_len_of_sentence-1-t)] = self.char_indices['UNK']\n", + " else:\n", + " x[i, j, (max_len_of_sentence-1-t)] = self.char_indices[char]\n", + "\n", + " return x, y\n", + "\n", + " def _build_character_block(self, block, dropout=0.3, filters=[64, 100], kernel_size=[3, 3], \n", + " pool_size=[2, 2], padding='valid', activation='relu', \n", + " kernel_initializer='glorot_normal'):\n", + " \n", + " for i in range(len(filters)):\n", + " block = Conv1D(\n", + " filters=filters[i], kernel_size=kernel_size[i],\n", + " padding=padding, activation=activation, kernel_initializer=kernel_initializer)(block)\n", + "\n", + " block = Dropout(dropout)(block)\n", + " block = MaxPooling1D(pool_size=pool_size[i])(block)\n", + "\n", + " block = GlobalMaxPool1D()(block)\n", + " block = Dense(128, activation='relu')(block)\n", + " return block\n", + " \n", + " def _build_sentence_block(self, max_len_of_sentence, max_num_of_setnence, \n", + " char_dimension=16,\n", + " filters=[[3, 5, 7], [200, 300, 300], [300, 400, 400]], \n", + "# filters=[[100, 200, 200], [200, 300, 300], [300, 400, 400]], \n", + " kernel_sizes=[[4, 3, 3], [5, 3, 3], [6, 3, 3]], \n", + " pool_sizes=[[2, 2, 2], [2, 2, 2], [2, 2, 2]],\n", + " dropout=0.4):\n", + " \n", + " sent_input = Input(shape=(max_len_of_sentence, ), dtype='int64')\n", + " embedded = Embedding(self.num_of_char, char_dimension, input_length=max_len_of_sentence)(sent_input)\n", + " \n", + " blocks = []\n", + " for i, filter_layers in enumerate(filters):\n", + " blocks.append(\n", + " self._build_character_block(\n", + " block=embedded, filters=filters[i], kernel_size=kernel_sizes[i], pool_size=pool_sizes[i])\n", + " )\n", + "\n", + " sent_output = concatenate(blocks, axis=-1)\n", + " sent_output = Dropout(dropout)(sent_output)\n", + " sent_encoder = Model(inputs=sent_input, outputs=sent_output)\n", + "\n", + " return sent_encoder\n", + " \n", + " def _build_document_block(self, sent_encoder, max_len_of_sentence, max_num_of_setnence, \n", + " num_of_label, dropout=0.3, \n", + " loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']):\n", + " doc_input = Input(shape=(max_num_of_setnence, max_len_of_sentence), dtype='int64')\n", + " doc_output = TimeDistributed(sent_encoder)(doc_input)\n", + "\n", + " doc_output = Bidirectional(LSTM(128, return_sequences=False, dropout=dropout))(doc_output)\n", + "\n", + " doc_output = Dropout(dropout)(doc_output)\n", + " doc_output = Dense(128, activation='relu')(doc_output)\n", + " doc_output = Dropout(dropout)(doc_output)\n", + " doc_output = Dense(num_of_label, activation='sigmoid')(doc_output)\n", + "\n", + " doc_encoder = Model(inputs=doc_input, outputs=doc_output)\n", + " doc_encoder.compile(loss=loss, optimizer=optimizer, metrics=metrics)\n", + " return doc_encoder\n", + " \n", + " def preporcess(self, labels, char_dict=None, unknown_label='UNK'):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: preprocess')\n", + " \n", + " self.build_char_dictionary(char_dict, unknown_label)\n", + " self.convert_labels(labels)\n", + " \n", + " def process(self, df, x_col, y_col, \n", + " max_len_of_sentence=None, max_num_of_setnence=None, label2indexes=None, sample_size=None):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: process')\n", + " \n", + " if sample_size is None:\n", + " sample_size = 1000\n", + " if label2indexes is None:\n", + " if self.label2indexes is None:\n", + " raise Exception('Does not initalize label2indexes. Please invoke preprocess step first')\n", + " label2indexes = self.label2indexes\n", + " if max_len_of_sentence is None:\n", + " max_len_of_sentence = self.max_len_of_sentence\n", + " if max_num_of_setnence is None:\n", + " max_num_of_setnence = self.max_num_of_setnence\n", + "\n", + " x_preprocess, y_preprocess = self._transform_raw_data(\n", + " df=df, x_col=x_col, y_col=y_col, label2indexes=label2indexes)\n", + " \n", + " x_preprocess, y_preprocess = self._transform_training_data(\n", + " x_raw=x_preprocess, y_raw=y_preprocess,\n", + " max_len_of_sentence=max_len_of_sentence, max_num_of_setnence=max_num_of_setnence)\n", + " \n", + " if self.verbose > 5:\n", + " print('Shape: ', x_preprocess.shape, y_preprocess.shape)\n", + "\n", + " return x_preprocess, y_preprocess\n", + " \n", + " def build_model(self, char_dimension=16, display_summary=False, display_architecture=False, \n", + " loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: build model')\n", + " \n", + " sent_encoder = self._build_sentence_block(\n", + " char_dimension=char_dimension,\n", + " max_len_of_sentence=self.max_len_of_sentence, max_num_of_setnence=self.max_num_of_setnence)\n", + " \n", + " doc_encoder = self._build_document_block(\n", + " sent_encoder=sent_encoder, num_of_label=self.num_of_label,\n", + " max_len_of_sentence=self.max_len_of_sentence, max_num_of_setnence=self.max_num_of_setnence, \n", + " loss=loss, optimizer=optimizer, metrics=metrics)\n", + " \n", + " if display_architecture:\n", + " print('Sentence Architecture')\n", + " IPython.display.display(SVG(model_to_dot(sent_encoder).create(prog='dot', format='svg')))\n", + " print()\n", + " print('Document Architecture')\n", + " IPython.display.display(SVG(model_to_dot(doc_encoder).create(prog='dot', format='svg')))\n", + " \n", + " if display_summary:\n", + " print(doc_encoder.summary())\n", + " \n", + " \n", + " self.model = {\n", + " 'sent_encoder': sent_encoder,\n", + " 'doc_encoder': doc_encoder\n", + " }\n", + " \n", + " return doc_encoder\n", + " \n", + " def train(self, x_train, y_train, x_test, y_test, batch_size=128, epochs=1, shuffle=True):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: train model')\n", + " \n", + " self.get_model().fit(\n", + " x_train, y_train, validation_data=(x_test, y_test), \n", + " batch_size=batch_size, epochs=epochs, shuffle=shuffle)\n", + " \n", + "# return self.model['doc_encoder']\n", + "\n", + " def predict(self, x, model=None, return_prob=False):\n", + " if self.verbose > 3:\n", + " print('-----> Stage: predict')\n", + " \n", + " if model is None:\n", + " model = self.get_model()\n", + " \n", + " if return_prob:\n", + " return model.predict(x_test)\n", + " \n", + " return model.predict(x_test).argmax(axis=-1)\n", + " \n", + " def get_model(self):\n", + " return self.model['doc_encoder']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "97425d0a69723820392eedfb1d4a2624d440b0ec", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "train_df, test_df = train_test_split(training_df, test_size=0.2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "7fc8d75dc5f14237a0eba2b8a34b6b4347e11964", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "char_cnn = CharCNN(max_len_of_sentence=256, max_num_of_setnence=1)\n", + "char_cnn.preporcess(labels=training_df['label'].unique())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "931f26a83e15ae8f617e25836502e212c8b5b321", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "x_train, y_train = char_cnn.process(\n", + " df=train_df, x_col='name', y_col='label')\n", + "x_test, y_test = char_cnn.process(\n", + " df=test_df, x_col='name', y_col='label')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "60a3811d3c8ab657a898110948aaa67948d581d4", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "import keras\n", + "from keras.models import Model, load_model\n", + "from keras.layers import Dense, Input, Dropout, MaxPooling1D, Conv1D, GlobalMaxPool1D, Bidirectional\n", + "from keras.layers import LSTM, Lambda, Bidirectional, concatenate, BatchNormalization, Embedding\n", + "from keras.layers import TimeDistributed\n", + "from keras.optimizers import Adam\n", + "import tensorflow as tf\n", + "import keras.backend as K\n", + "\n", + "char_cnn.build_model()\n", + "char_cnn.train(x_train, y_train, x_test, y_test, batch_size=32, epochs=10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "1ddf3f1b5982d8852903b841ec0df804b4bf68a1" + }, + "source": [ + "# Evaluations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "aa78af278ad827f9a8b3e6f0745592c456434802", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + }, + "scrolled": true + }, + "outputs": [], + "source": [ + "# Passing dummpy label and getting dummpy y_real just because try to reuse defined function to convert input\n", + "x_real, y_real = char_cnn.process(\n", + " df=worldcup_2018_df, x_col='participants', y_col='label')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "41e16db1bdc3f29eb2fffa817884d3c55896a460" + }, + "source": [ + "Encoded result. 4 means it is 0 actually. 0 means lost." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_uuid": "18fcf0fd02415f69e1aaf25b8b15f3e8f676f18d", + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [], + "source": [ + "char_cnn.predict(x_real)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "91e195c8ed3b20722fc39a0f2bcd43d2d7604c02" + }, + "source": [ + "# Conclusion\n", + "\n", + "Thank you for reading this meaningless model. Actually, I would like to demonstrate how is the importance of feature but not model architecture. \n", + "\n", + "* When people talk about \"We are using Machine Learning\", \"Applied Deep Neural Network\", you may better ask about feature and data instead of asking about model architecture. Of course, model architecture is important but feature and data are also important as well. Please remember that **GARBAGE IN, GARBAGE OUT**.\n", + "* When people talk about having 80% or even 90% accuracy. You may better check whether it is a result in **experiment or actual**. Lots of model is overfit in experiment stage although data scientist believe that they already prevent it very well.\n", + "* For the measurement, better understand **other metric** but not just accuracy. For example, we also have precision and recall in classification. We have BELU in machine translation. \n", + "* As a Data Scientist, rather talking about using CNN, LSTM bla bla bla, **spend more time on understanding your feature and data**.\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "_uuid": "93f9edbaf910a41290ea0226926fa392b7450768" + }, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 6f5c02590a2e27d5a0050cec908baa1a092ac6ca Mon Sep 17 00:00:00 2001 From: VARUNSHIYAM <138989960+Varunshiyam@users.noreply.github.com> Date: Sat, 9 Nov 2024 18:15:08 +0530 Subject: [PATCH 4/4] Create Readme.md --- .../World_cup_Prediction/Readme.md | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 Prediction Models/World_cup_Prediction/Readme.md diff --git a/Prediction Models/World_cup_Prediction/Readme.md b/Prediction Models/World_cup_Prediction/Readme.md new file mode 100644 index 00000000..2ed7b0fe --- /dev/null +++ b/Prediction Models/World_cup_Prediction/Readme.md @@ -0,0 +1,51 @@ +# World Cup Prediction with Deep Learning + +This project uses a Character Embedding model to predict FIFA World Cup match outcomes with an 80% accuracy. The prediction model is developed using TensorFlow, with a focus on leveraging player-specific data along with team information to enhance prediction accuracy. + +## Project Overview + +In this project, we integrate individual player data with match history to construct a deep learning model that predicts match results. This model demonstrates the importance of feature selection, showing how player data and match records contribute to accurate predictions. + +### Key Features + +- **Data Ingestion:** Load and preprocess historical World Cup match data and player rosters. +- **Data Transformation:** Create character embeddings for player names to use in deep learning model inputs. +- **Model Building:** Use TensorFlow to construct a Character Embedding model tailored for predictive analysis. +- **Evaluation:** Achieve 80% prediction accuracy on test data, using World Cup 2018 results as a validation set. + +## Project Structure + +- `data/`: Contains historical World Cup data and player rosters. +- `notebooks/`: Jupyter notebooks with data preprocessing, model building, and evaluation code. +- `scripts/`: Python scripts for data loading, model training, and evaluation. +- `models/`: Saved models for reproducibility. + +## Getting Started + +1. **Install dependencies**: + ```bash + pip install -r requirements.txt + ``` + +2. **Run the model**: + - Load the data in the `data/` directory. + - Execute the Jupyter notebooks in `notebooks/` to preprocess data, build, and evaluate the model. + +3. **Model Evaluation**: + - The model's performance metrics are logged, including accuracy, precision, and recall. + +## Dependencies + +- TensorFlow +- Pandas +- NumPy +- Scikit-learn + +## Results + +The model achieves approximately 80% accuracy in predicting the outcome of FIFA World Cup matches. This result highlights the effectiveness of character embeddings for player names and team rosters in sports outcome prediction. + +## Conclusion + +The project emphasizes that feature selection and data quality are as crucial as model architecture in predictive modeling. By focusing on player-specific data, this project offers insights into the predictive power of individual player contributions to team performance. +