The omnipresence of bots is not foreign to the Twitter community. Currently, it is estimated that 20% - 29% of content in the US on Twitter is generated by bots (Varanasi, 2022). Some of these bots are harmless, but there exist bots that engage in various fraudulent activities - which is broadly defined as wrongful or criminal deception to result in financial or personal gain. Some examples include manipulating election votes (Metz, 2020) and cryptocurrency scams (Perez, 2022). Ultimately, these fraudulent bots need to be detected fast, and punished accordingly before they bring more harm to users.
Currently, Twitter is culling 1 million bot accounts per day (Sutcliffe, 2022). However, this is far from enough, as bots continue to plague the Twitter space. Furthermore, Twitter admits that fraudulent bot detection is a highly complex and nuanced problem (Twitter, 2021). Therefore, we propose a data-driven approach using a mix of traditional machine learning and neural networks to tackle the uncertainty and complexity of fraudulent bot detection. For simplicity, we shall refer to the fraudulent bots as bots in the report.
- scrape_profile_pic.ipynb
- Scrape the profile picture of Twitter Users
- Data Cleaning.ipynb
- Code to clean data based on the file in scrape_profile_pic.ipynb. E.g. Removal of invalid rows and columns
- Get Face.ipynb
- Read the profile picture of Twitter Users to detect the presence of faces
- Graph.ipynb
- Create the reciprocity feature of users, based on a graph structure
- Feature Engineering.ipynb
- Feature Engineering based on the files generated in Data Cleaning.ipynb, Get Face.ipynb, and Graph.ipynb
- All notebooks in /Traditional Models and /Neural Networks
- Each notebook contains code that trains different ML/NN models and evaluate performance