Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.
/ airflow-rekdat Public archive

End-to-end data engineering for a final assignment.

Notifications You must be signed in to change notification settings

daffaromero/airflow-rekdat

Repository files navigation

End-to-end Data Engineering: World Cup 2022 Tweets Pipelining

This repository contains a dockerized pipeline for collecting tweets extracted from two different sources, which are Twitter scraping and dataset sampling. The pipeline is based on airflow DAGs and the tweets are stored in Azure Database for PostgreSQL. We utilize the data from the warehouse as the input for sentiment analysis predictive modelling and fine tuning.

Group 11

Running the project

Airflow image has been extended to include Python dependencies listed in requirements.txt.

Change the image line in docker-compose.yaml to this:

image: ${AIRFLOW_IMAGE_NAME:-extending_airflow:latest}

Note: The Dockerfile has been set up for Airflow version 2.4.3. Change this line to suit a different version of Airflow:

FROM apache/airflow:2.4.3 

Build the image with:

$ docker build . --tag extending_airflow:latest

Then, to start the containers:

$ docker-compose -f docker-compose.yaml up -d

Edit the DAG as such:

  • Insert your Twitter API credentials in dags/twitter_dag_azure_ETL.py, lines 17-20:
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
  • Insert database credentials in CREDENTIALS.md in dags/csv_to_azure_dag.py, lines 33-37:
# Update connection string information
host = ""
dbname = ""
user = ""
password = ""
sslmode = ""

To access local database via pgAdmin,

  • Go to https://localhost:15432
  • Input the default email and password
  • Add a new server.
  • Set host to 'airflow-postgres-1'
  • Set username and password to 'airflow'
  • You can now access the table for the scraped Tweets.

To access Azure database via pgAdmin,

  • Add a new server.
  • Set host, username, and password
  • Use the credentials listed in CREDENTIALS.md
  • You can now access the table for the scraped Tweets (DAG ID: twitter_crawl_azure).

About

End-to-end data engineering for a final assignment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages