Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0 #49

Open
wants to merge 1 commit into
base: hnedenko_master
Choose a base branch
from
Open

v1.0 #49

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Naive_Bayes_text_classification/Instructions.pdf
Binary file not shown.
358 changes: 358 additions & 0 deletions Naive_Bayes_text_classification/NaiveBayesTextClassification.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Naive Bayes Text Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using libraries:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import math\n",
"import nltk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Global variable:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"dataBaseFile = \"traning.csv\"\n",
"dataBase = None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Support method printDataBase:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def printDataBase():\n",
" for i in dataBase:\n",
" print(i)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data preprocesing:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading .csv file:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def readCSVFileToArray(dataBase):\n",
" with open(dataBaseFile, 'r') as fp:\n",
" reader = csv.reader(fp, delimiter=',', quotechar='\"')\n",
" dataBase = [row for row in reader]\n",
" fp.close() \n",
" for i in range(0,len(dataBase)):\n",
" temp = dataBase[i][-1]\n",
" dataBase[i][-1] = dataBase[i][0]\n",
" dataBase[i][0] = temp\n",
" return dataBase"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Spliting:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def splitDataBase(dataBase):\n",
" for i in range(0,len(dataBase)): \n",
" for j in range(1,len(dataBase[i])):\n",
" tempList = str(dataBase[i][1]).split()\n",
" dataBase[i].pop(1)\n",
" for k in range(0,len(tempList)):\n",
" dataBase[i].append(tempList[k])\n",
" return dataBase"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lowercase:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def lowercase(dataBase):\n",
" for i in range(0,len(dataBase)):\n",
" for j in range(1,len(dataBase[i])):\n",
" dataBase[i][j] = dataBase[i][j].lower()\n",
" return dataBase"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove STOP-WORDS:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def deleteSTOPWORDS(dataBase):\n",
" nltk.download('stopwords')\n",
" for i in range(0,len(dataBase)):\n",
" newData = []\n",
" for j in dataBase[i]:\n",
" if (j not in nltk.corpus.stopwords.words('english')):\n",
" newData.append(j)\n",
" dataBase[i] = newData\n",
" return dataBase"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Naive Bayes:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"class NaiveBayes:\n",
" dataBase = None\n",
" classesProb = []\n",
" probOccurrence = []\n",
" def __init__(self, dataBase):\n",
" self.dataBase = dataBase\n",
" \n",
" def countClassesProb(self):\n",
" for doc in self.dataBase:\n",
" if (doc[0] not in self.classesProb):\n",
" self.classesProb.append(doc[0])\n",
" for i in range(0,len(self.classesProb)):\n",
" self.classesProb[i] = 0\n",
" for doc in self.dataBase:\n",
" self.classesProb[int(doc[0])] = int(self.classesProb[int(doc[0])]) + 1 \n",
" for i in range(0,len(self.classesProb)):\n",
" self.classesProb[i] = self.classesProb[i] / len(self.dataBase) \n",
" \n",
" def countProbOccurrenceInClasses(self, sample):\n",
" self.countClassesProb()\n",
" tableVocabulary = [[]]\n",
" \n",
" #generating tableVocabulary\n",
" for i in range(0,len(self.dataBase)):\n",
" for j in range(1,len(self.dataBase[i])):\n",
" if (self.dataBase[i][j] not in tableVocabulary[0]):\n",
" tableVocabulary[0].append(self.dataBase[i][j])\n",
" tableVocabulary[0].append(\"ALL\")\n",
" \n",
" for i in range(0,len(self.classesProb)):\n",
" tableVocabulary.append([])\n",
" \n",
" for k in range(1,len(tableVocabulary)):\n",
" for q in range(0,len(tableVocabulary[0])):\n",
" tableVocabulary[k].append(0)\n",
" \n",
" for i in range(0,len(self.dataBase)):\n",
" for j in range(1,len(self.dataBase[i])):\n",
" for k in range(0,len(tableVocabulary[0])):\n",
" if (tableVocabulary[0][k]==self.dataBase[i][j]):\n",
" tableVocabulary[int(dataBase[i][0])+1][k] += 1\n",
" \n",
" print(tableVocabulary)\n",
" \n",
" for i in range(1,len(tableVocabulary)):\n",
" for j in range(0,len(tableVocabulary[i])-1):\n",
" tableVocabulary[i][j] += 1\n",
" \n",
" for i in range(1,len(tableVocabulary)):\n",
" for j in range(0,len(tableVocabulary[i])-1):\n",
" tableVocabulary[i][len(tableVocabulary[i])-1] += tableVocabulary[i][j]\n",
" \n",
" for i in range(1,len(tableVocabulary)):\n",
" for j in range(0,len(tableVocabulary[i])-1):\n",
" tableVocabulary[i][j] /= tableVocabulary[i][len(tableVocabulary[i])-1] \n",
" \n",
" #preprosesing sample to correct x-vector\n",
" x = [[]]\n",
" x[0].append('0')\n",
" x[0].append(sample)\n",
" x = lowercase(x)\n",
" x = splitDataBase(x)\n",
" x = deleteSTOPWORDS(x)\n",
" \n",
" #countProbOccurrenceInClasses \n",
" for c in range(0,len(self.classesProb)):\n",
" self.probOccurrence.append(0)\n",
" P = 0\n",
" for i in range(1,len(x[0])):\n",
" for j in range(0,len(tableVocabulary[0])-1):\n",
" if (x[0][i] == tableVocabulary[0][j]):\n",
" P = P + math.log(tableVocabulary[c+1][j])\n",
" self.probOccurrence[c] = math.log(self.classesProb[c]) + P \n",
" \n",
" def fit(self): \n",
" accuracy = 0\n",
" trainDataBase = []\n",
" testDataBase = []\n",
" for i in range(0, len(self.dataBase)):\n",
" if ((i+1)%5 == 0):\n",
" testDataBase.append(self.dataBase[i])\n",
" else:\n",
" trainDataBase.append(self.dataBase[i])\n",
" self.dataBase = trainDataBase;\n",
" for i in range(0, len(testDataBase)):\n",
" sample = \"\"\n",
" for j in range(1, len(testDataBase[i])):\n",
" sample = sample + \" \" + testDataBase[i][j]\n",
" if (self.predict(sample)[1]==testDataBase[i][0]):\n",
" accuracy += 1\n",
" accuracy = (accuracy*100)/len(testDataBase)\n",
" accuracy = str(int(accuracy)) + \"%\"\n",
" return accuracy \n",
" \n",
" def predict(self, sample):\n",
" answer = [sample]\n",
" answer.append(str(0))\n",
" self.countProbOccurrenceInClasses(sample)\n",
" m = self.probOccurrence[0]\n",
" for i in range(0,len(self.probOccurrence)):\n",
" if (self.probOccurrence[i]>m):\n",
" m = self.probOccurrence[i]\n",
" answer[-1] = str(i)\n",
" return answer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test model:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['chinese', 'beijing', 'shanghai', 'macao', 'tokyo', 'japan', 'ALL'], [5, 1, 1, 1, 0, 0, 0], [1, 0, 0, 0, 1, 1, 0]]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to\n",
"[nltk_data] C:\\Users\\Александр\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n",
"[nltk_data] Downloading package stopwords to\n",
"[nltk_data] C:\\Users\\Александр\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"['Chinese The Chinese In Chinese Tokyo The In The Japan', '0']"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataBase = readCSVFileToArray(dataBase)\n",
"dataBase = lowercase(dataBase)\n",
"dataBase = splitDataBase(dataBase)\n",
"dataBase = deleteSTOPWORDS(dataBase)\n",
"\n",
"nb = NaiveBayes(dataBase)\n",
"nb.predict(\"Chinese The Chinese In Chinese Tokyo The In The Japan\")\n",
"#nb.fit()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Binary file not shown.
Binary file not shown.
7 changes: 7 additions & 0 deletions Naive_Bayes_text_classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
In text classification, the goal is to find the best class for the document.
Multinomial Naive Bayes or multinomial NB model, a probabilistic learning method is a supervised learning method.

Please, use the pdf file for more detailed instructions. For worked example look at 44 slide.
Use Naive Bayes model with TF/IDF algorithm to solve text classification problem. Be free to choose any task for applying this algorithm.
The datasets can be taken from broad set of links e.g., kaggle, etc.
For testing purposes you can use email spam filter dataset from here.
Binary file not shown.
Binary file added Naive_Bayes_text_classification/naivebayes.pdf
Binary file not shown.
Loading