End-to-end Data Engineering: World Cup 2022 Tweets Pipelining

This repository contains a dockerized pipeline for collecting tweets extracted from two different sources, which are Twitter scraping and dataset sampling. The pipeline is based on airflow DAGs and the tweets are stored in Azure Database for PostgreSQL. We utilize the data from the warehouse as the input for sentiment analysis predictive modelling and fine tuning.

Group 11

Running the project

Airflow image has been extended to include Python dependencies listed in requirements.txt.

Change the image line in docker-compose.yaml to this:

image: ${AIRFLOW_IMAGE_NAME:-extending_airflow:latest}

Note: The Dockerfile has been set up for Airflow version 2.4.3. Change this line to suit a different version of Airflow:

FROM apache/airflow:2.4.3

Build the image with:

$ docker build . --tag extending_airflow:latest

Then, to start the containers:

$ docker-compose -f docker-compose.yaml up -d

Edit the DAG as such:

Insert your Twitter API credentials in dags/twitter_dag_azure_ETL.py, lines 17-20:

consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''

Insert database credentials in CREDENTIALS.md in dags/csv_to_azure_dag.py, lines 33-37:

# Update connection string information
host = ""
dbname = ""
user = ""
password = ""
sslmode = ""

To access local database via pgAdmin,

Go to https://localhost:15432
Input the default email and password
Add a new server.
Set host to 'airflow-postgres-1'
Set username and password to 'airflow'
You can now access the table for the scraped Tweets.

To access Azure database via pgAdmin,

Add a new server.
Set host, username, and password
Use the credentials listed in CREDENTIALS.md
You can now access the table for the scraped Tweets (DAG ID: twitter_crawl_azure).

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
dags		dags
data		data
.env		.env
.gitignore		.gitignore
CREDENTIALS.md		CREDENTIALS.md
Dockerfile		Dockerfile
FlairSA.ipynb		FlairSA.ipynb
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-end Data Engineering: World Cup 2022 Tweets Pipelining

Group 11

Running the project

About

Releases

Packages

Contributors 4

Languages

daffaromero/airflow-rekdat

Folders and files

Latest commit

History

Repository files navigation

End-to-end Data Engineering: World Cup 2022 Tweets Pipelining

Group 11

Running the project

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages