This repository contains a dockerized pipeline for collecting tweets extracted from two different sources, which are Twitter scraping and dataset sampling. The pipeline is based on airflow DAGs and the tweets are stored in Azure Database for PostgreSQL. We utilize the data from the warehouse as the input for sentiment analysis predictive modelling and fine tuning.
- Aulia Nur Fajriyah - 20/456360/TK/50490
- Daffa Muhammad Romero - 20/456363/TK/50493
- Hafizha Ulinnuha Ahmad - 20/456365/TK/50495
- Mochammad Novaldy Pratama Hakim - 20/463606/TK/51598
Airflow image has been extended to include Python dependencies listed in requirements.txt.
Change the image line in docker-compose.yaml to this:
image: ${AIRFLOW_IMAGE_NAME:-extending_airflow:latest}
Note: The Dockerfile has been set up for Airflow version 2.4.3. Change this line to suit a different version of Airflow:
FROM apache/airflow:2.4.3
Build the image with:
$ docker build . --tag extending_airflow:latest
Then, to start the containers:
$ docker-compose -f docker-compose.yaml up -d
Edit the DAG as such:
- Insert your Twitter API credentials in dags/twitter_dag_azure_ETL.py, lines 17-20:
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
- Insert database credentials in CREDENTIALS.md in dags/csv_to_azure_dag.py, lines 33-37:
# Update connection string information
host = ""
dbname = ""
user = ""
password = ""
sslmode = ""
To access local database via pgAdmin,
- Go to https://localhost:15432
- Input the default email and password
- Add a new server.
- Set host to 'airflow-postgres-1'
- Set username and password to 'airflow'
- You can now access the table for the scraped Tweets.
To access Azure database via pgAdmin,
- Add a new server.
- Set host, username, and password
- Use the credentials listed in CREDENTIALS.md
- You can now access the table for the scraped Tweets (DAG ID: twitter_crawl_azure).