Project based on xgboost algorithm to estimate the load factor of the competitors from advance purchase 15 to 0.
These instructions will get you a copy of the project up and running on your local machine for development.
The application is authenticatd with the
GOOGLE_APPLICATION_CREDENTIALS
application crendetial
- Python 3.6 packages
google-cloud-bigquery==1.10.0
python-dateutil==2.8.0
numpy==1.16.3
pandas==0.24.2
pyod==0.6.8
pyarrow==0.13.0
scikit-learn==0.20.3
xgboost==0.82
- BigQuery tables
desa-cli-aa360.IORM_MINUTA_DISPO.MINUTA_DISPO_BI
desa-cli-aa360.IORM_INFARE_RM.FT_INFARE_AR
desa-cli-aa360.IORM_INFARE_RM.FT_INFARE_CL
desa-cli-aa360.IORM_INFARE_RM.FT_INFARE_CO
desa-cli-aa360.IORM_INFARE_RM.FT_INFARE_EC
desa-cli-aa360.IORM_INFARE_RM.FT_INFARE_PE
git clone [email protected]:revenue-latam/competitor-load-factor-estimator.git
cd competitor-load-factor-estimator
virtualenv env -p python3.6
source env/bin/activate
pip3 install -r requirements.txt
python3 main.py [GBQ_DATASET] [GBQ_TABLE] [COUNTRY CODE]
Example
python3 main.py IORM_MODELS competitors_load_factors CL
This project aims to estimate the load factor of the competitors using machine learning techniques for domestic routes. To achieve this, it was necessary to collect data from historical flights of LATAM.
The data was preprocessed to remove canceled flights (without ap 0), fill missing price values and perform feature engineering that involves handling dummy variables, apply timesteps of prices and their changes over time.
A machine learning model is then trained using the xgboost algorithm with one set of hyperparameters for each country.
With the trained model the process iterates over the competitor airlines and process the downloaded data from infare database, and predicts their load factor.
FIles: train.py
preprocess.py
query_train.py
The training process generates a xgboost model based on LATAM data and uses the following variables:
- Advance purchase
- Route
- Month
- Day of week
- Hour of departure
- Price with N time steps
- Price delta based on time steps
- Load factor
FIles: predict.py
preprocess.py
query_competitor.py
The prediction process uses the model generated by the training process and upload the resulting dataframe to google bigquery.
- Output
observation_date
carrier
origin
destination
flight
departure_date
departure_time
predicted_load_factor
The config.py
file is used to configure the parameters and initial settings of the process.
Parameters
- MAX_AP: Maximum advance purchase for training and prediction.
- STEPS: Number of time steps for price value.
- YEARS_TRAIN: Used to get the time window for the training dataset.
- MONTH_TRAIN: Used to get the time window for the training dataset.
- AIRLINES: Dictionary that contains a list of target airlines for every key country code
- XGB_PARAMS: Dictionary that contains xgboost hyperparameters for every key country code