In the first part of this project, we will create the training pipeline for a categorization model.
More specifically, you will train a model that should receive data related to products and return the best categories for them.
More info about the data can be found here.
Your training pipeline should be composed of the following steps:
-
Data extraction
Loads a dataset with product data from a specified path available in the environment variableDATASET_PATH
. -
Data formatting
Processes the dataset to use it for training and validation. -
Modeling
Specifies a model to handle the categorization problem. -
Model validation
Generates metrics about the model accuracy (precision, recall, F1, etc.) for each category and exports them to a specified path available in the environment variableMETRICS_PATH
. -
Model exportation
Exports a candidate model to a specified path available in the environment variableMODEL_PATH
.
The training pipeline should be implemented using JupyterLab in a file
named trainer.ipynb
.
Use Markdown cells to document relevant details about your implementation. Remember that good documentation should focus on the why (e.g., why a specific type of model was chosen), since clean code should be enough to understand the how (e.g., how you selected a specific type of model).
In this directory, we provide a containerized environment that uses docker and docker-compose to run JupyterLab. This should standardize the development environment and avoid compatibility problems.
To install docker and docker-compose, check their official documentation here and here. Both tools should be instalable at Linux, MacOS and Windows.
To execute JupyterLab, just run the following command:
docker-compose up --build
Then open the link shown in the end.
To install an OS package (Debian-based), add the name of the package in the file
packages.txt
. To intall a Python package (Pip-based), add the name and version
of the package in the file requirements.txt
.
The evaluation will be based on four criteria:
-
Correctness
If the solution runs without unexpected errors. -
Compliance
If the solution respects all specified behaviors, in particular concerning inputs and outputs. -
Code Quality
If the solution follows the principles of clean code and general good practices discussed in class. -
Documentation
If the solution documents relevant decisions in the right measure.