This project, drawing from the Machine Learning Zoomcamp course and utilizing the Flight Delay dataset from Kaggle, aims to create a real-time machine learning system for predicting flight durations. It leverages Azure Event Hubs for data ingestion and output, providing predictions that can support both business decision-making and passenger experience. Predicted flight durations are continually updated and can be used for optimizing airline operations, enabling customer applications for real-time information, and aiding various industry decision-making processes, thereby enhancing operational efficiency and passenger satisfaction within the aviation sector
When booking a flight, passengers often receive estimated flight durations that may not accurately reflect real-world conditions. Factors such as departure and arrival delays can significantly impact the actual flight duration, leading to uncertainty and inconvenience for travelers. To address this issue and provide a more accurate and reliable flight duration timeline for customers, this project seeks to employ machine learning techniques. By training a machine learning model on historical flight delay data, this project aim to develop a predictive system capable of predicting flight durations in real-time. This predictive model will consider various factors, including weather, aircraft type, historical performance, and real-time data feeds, to provide up-to-date and more precise flight duration estimates. Additionally, this predictive system can assist airlines in optimizing their operations, resource allocation, and scheduling, contributing to improved operational efficiency and passenger satisfaction within the aviation industry.
This project used the tool below.
- Infrastructure Setup: Terraform (for provisioning and managing infrastructure)
- Containerization: Docker and Docker-compose (for containerized deployment and management)
- Cloud Storage: Azure Blob Storage (for data storage)
- Reproducibility: Makefile (for ease of project reproducibility)
- Machine Leraning: Scikit-learn (for modle prediction)
We begin by acquiring the Flight Delay dataset from Kaggle and conduct data exploration to gain insights. We experiment with machine learning models, focusing on the simplicity and accuracy of Linear Regression. Following model training, we generate a model artifact, including the model itself and a dict vectorizer for future production deployment. We also create sample event data to simulate real-time flight data for the next step.
Using the sample data generated in step 1, we send events to an Azure Event Hub to simulate real-time data transmission.
A Python script is designed to listen to an Azure Event Hub, fetching incoming events. The script utilizes the trained model from step 1 to make predictions based on received data. Checkpoint events are created and stored in Azure Blob Storage to prevent event duplication. Once the data is processed, the predicted events are pushed into another Azure Event Hub.
Another Python script is developed to monitor the Azure Event Hub containing the predicted events. This output can serve as an API response or be used in customer-facing applications and for critical business decisions.
Prerequisite
:
To reproduce this project you would need Azure Account account
You also need below package
- Makefile
pip install make
- Azure CLI
- Terraform
- Docker
- Docker Compose
You will also need package to train the model you can run
make prerequisite
to install the package required.
Once all package is installed please follow the step in Reproducre to re-create the project
Deploy the consumer docker as a function app
Improve prediction accuracy