- In this project i have built some models to classify texts catogorized (economy,sports,international,state,technology,entertainment,education') using machine learning and deeplearning models.
- For dataset i have used a public dataset available called banglamct7-bangla-multiclass-text-dataset-7-tags which is avaiable in this link.
Word embeding
is used for deep learning models andTFIDF
is used for machine learning models for feature represtations for extracting the semantic meaning of the words.- Machine Learning models has been built by using Logistic Regression and Multinomial Naieve Bayes.
- Deep learning models has been built by using Deep Neural Network,Convolutional Neural Network,BiDirectional LSTM and CNN-BiLSTM Mybrid model.
- Finally, the models performance is evaluated using various evaluation measures such as
confusion matrix, accuracy , precision, recall and f1-score
with classification report.
- Developement Envioronment : Kaggle
- Python Version : 3.7
- Framework and Packages : Tensorflow, Scikit-Learn, Pandas, Numpy, Matplotlib, Seaborn
- Data Collection and Cleaning
- Data Summary
- Data Preparation
- Model
- Model Evaluation
The dataset contains clean data ready to be used for feature extraction and classification.But i have applied Stopwords removal and Stemming on the clean data.
Data summary is done in a single notebook available in this link.In EDA notebook, i have shown number of documents, words and unique words have in each category class, histogram analysis to text length and Ngram analysis upto trigram in each category.
To prepare data before model building, i have used TFIDF for machine learning and Word Embedding for deep learning models.The parameters are optimized and tuned with respect to the EDA results.
I have used Logistic Regression,Multinomial Naive Bayes,Deep Neural Networks,CNN,LSTM,CNN-BiLSTM hybrid model. All the models parameters are tuned and optimized.
Model Name | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Logistic Regression | 0.9156 | 0.9157 | 0.9156 | 0.9156 |
Multinomial Naive Bayes | 0.8858 | 0.8863 | 0.8858 | 0.8859 |
Deep Neural Network | 0.9318 | 0.9320 | 0.9318 | 0.9318 |
CNN | 0.9061 | 0.9067 | 0.9061 | 0.9060 |
Bi-LSTM | 0.9276 | 0.9277 | 0.9275 | 0.9275 |
CNN-BiLSTM Hybrid | 0.9061 | 0.9067 | 0.9061 | 0.9060 |
In this project, i have found 93%
accuracy with Deep Neural Network model which is the best score.
Report of classification and the Confusion Matrix shows that the models miss classified the category of economy, state and education
. It is because the text in this categories are very similar and contain similar words.
In conclusion, I have achieved good accuracy with different models. This accuray can be further improved by doing hyperparameter tunning and by employing more shophisticated network architecture like Transformers and Pretrained Embeddings.