BERT is state-of-the-art natural language processing model from Google. Using its latent space, it can be repurpossed for various NLP tasks, such as sentiment analysis.
I have used Hugging Face Transformers and Pytorch and the task is predicting positivity / negativity on IMDB reviews.
Firstly, you need to prepare IMDB data which is publicly available. Format used here is one review per line, with first 12500 lines being positive, followed by 12500 negative lines. Positive has been encoded with 0 and negative with 1.
You can download data and weights (in the correct format) directly from my drive link here.
I have used 3 models:
- BertForSequenceClassification (Hugging Face)
- BertModel (Hugging Face)
- Pytorch pretrained BERT (not from Hugging Face)
- BertForSequenceClassification:
precision | recall | f1-score | support | |
---|---|---|---|---|
0.0 | 0.90 | 0.93 | 0.91 | 12500 |
1.0 | 0.93 | 0.90 | 0.91 | 12500 |
accuracy | 0.91 | 25000 | ||
macro avg | 0.91 | 0.91 | 0.91 | 25000 |
weighted avg | 0.91 | 0.91 | 0.91 | 25000 |
Accuracy achieved: 91 %
- After optimization experiments BertModel does better with an accuracy of 93 %
I will optimize the hyperparameters later to get as close to the sota as possible.
You can view the optimization experiments here.
Code has been uploaded as a notebook and a .py file.
Note: For the .py file, ensure transformers is installed (command: pip install transformers) and set correct paths in lines 76 and 227.
Code with the base BertModel can be found here.
Useful comments and links to tutorials have been given inside the notebook to guide you through