-
This project aims to identify and filter out content that is offensive, harmful, or disrespectful to provide safer online environment.
-
With the rapid growth of online platforms and social media, there is an increasing need to address critical natural language processing task of Abusive Comment Detection in Low Resource Languages like Telugu.
- Python, Google Colab for Training models using GPU.
- HuggingFace Transformers for model trainig and saving.
- Visual Studio (VS) Code with Conda Environment.
- Streamlit for Frontend, CSS for styles.
Live Demo Streamlit Abusive Comment Detector App
Video Demo of Local Run : Youtube Video
Clone the project
git clone https://github.com/Revanth-Reddy-Pingala/Abusive_Comment_Detector_BERT
Go to the project directory
cd my-project
After setting up environment and installing packages using requirements.txt, Run the following command to Deploy app Locally
streamlit run ./app.py
- Github and Code Set Up.
- Text Preprocessing.
- Exploratory Data Analysis.
- Model Hyper Parameters tuning.
- Finetuning the BERT model using Google Colab's GPU.
- Saving the Finetuned BERT model locally.
- Pushing the saved model to Hugging Face.
- Development of web app using Streamlit.
- Getting the saved model using transformers from Hugging Face.
- Deploying the app to Streamlit Community Cloud.
-
This project provided in this repository is just a sample version of my Main Project at Central University of Tamil Nadu.
-
The dataset is a real time dataset from Research paper using SnapChat data and another dataset from Codalab Competition. The final dataset used is a combination of both datasets.
-
The Abusive Comment Dataset contains Comments in native Telugu script, Code-Mixed(combination of Telugu and English in the same comment, telugu comments written in English alphabet) Telugu and combination of Telugu-English Comments.
- Data : Comments present in native Telugu, code-mixed Telugu and Telugu-English.
- Sentiment : Abusive or Not Abusive
- Missing values
- Duplicates
- data type
- Count of Abusive class and Not Abusive class
- More Description can be found in the files of notebook folder.
- This is a balanced Dataset. So we don't have to perform dataset balancing tasks.
- BERT
- mBERT
- XLMRoBERTa
- loss
- accuracy
- precision
- recall
- f1 score
- BERT - 69% (sample dataset of 1k comments)
- mBERT - 86% (Entire dataset of 34.5k Comments)
- XLMRoBERTa - 89% (Entire dataset of 34.5k Comments)
-
Due to limited access of cloud RAM after Deployment.
-
The finetuned mBERT and XLMRoBERTa use lot of RAM while fetching and Predicting.