Skip to content

c3divya/MLOps

Repository files navigation

MLOps

Overview of Model Deployment

Why is modern deployment challenging?

  • Modern deployment presents the challenges of traditional software like reliability, reusability, maintainability and flexibility, and it also presents additional challenges that are specific to machine learning,like reproducibility.

  • Reliability refers to the ability of the software to produce the intended result and to be robust to errors.

  • Reusability refers to the ability of the software to be used across systems and across projects

  • Maintainability is the ease with which we can maintain the software in time

  • Flexibility is the ability of add or remove functionality from the software.

  • Reproducibility instead refers to the ability of the software to produce the same result, given the same data across systems.

Let's discuss reproducibility in more detail

  • ML Model deployment is also challenging because it needs the coordination of various themes within an organization to ensure that the more they produce the intended result and that it works reliably, it usually needs the coordination of

    • data scientists who develop the machine learning model
    • I.T. teams and software developers who help put the model in production
    • business professionals who understand how the model will be used in the organization and the customers and purpose that it is intended to model.
  • Deployment is also challenging because there is often a discrepancy between the programming language in which the model is developed and the programming languages that are available in the production environments.

  • If this is the case, then in the model to fit the languages available in production extends the project timeline and also risks the lack of reproducibility.

Summary

  • Model deployment is one of the most important stages in the machine learning lifecycle.

  • And this is because to start using the machine learning model, it needs to be effectively deployed nto production so that it can provide its predictions to other software systems.

  • In other words, to maximize the value of the machine learning model we create, we need to be able to reliably extract the predictions from the model and share them with other systems.

Model Deployment Pipelines

We're normally talk about the deployment of machine learning models when, in fact we refer to deployment of machine learning pipelines.

So in this section, I want to introduce what a machine learning pipeline is and why we deploy pipelines instead of just mothers.

  • In a simple way of imagining machine learning models, the organization has some data that can be stored in databases in the cloud or retrieved from Third-Party APIs. And then it feeds these data to build a model that then is able to provide some predictions that are of use to the business.

  • However, data is almost never ready to be used to train machine learning models.

  • In fact, data presents a variety of characteristics in terms of format and quality that make it unsuitable to machine loving mothers.

    • For example, data, one percent missing values and some of the library's tutoring machine learning models cannot cope with the lack of values in a data set.

    • Data can also contain strings instead of numbers, and the computer cannot perform computations with strings.So we need to transform them into numbers to variables in the data present some distributions and the

    • distribution of the variable may affect the performance of the model. So we may want to change some distributions.

    • The data may also present outliers and some models are sensitive to the presence of outliers, so we may choose to censor or remove these values, the different variables in our dataset, maybe in a different scale.

    • Some mothers are sensitive to the scale of the variables, so we need to rescale them.

    • Sometimes our data is in the form of text, so we need to be able to extract information from this text into some sort of input that can be computed sometimes our data, our images.

    • And again, we should be able to extract data or features from these images.

    • Some of our data may be transactional and we may want to aggregate this data to have a more streamlined view of the customer.

    • Sometimes the data is geolocation and we may want to extract features from this data.

    • Sometimes we have a time series that we need to process or aggregate in a certain way, and we can also have date and time variables that we cannot use straight away in our models, but instead we derive eatures from them.

So before we can use these data to train them, although we need to carry out a lot of variable transformation, extract features from data, or create new features by combining existing ones. So then we're not talking of just training the models, but we are also talking about preprocessing the data to make it suitable to train them over. And sometimes we may not want to use all the features available to us. So we may include a type of feature selection.

So a machine learning pipeline contains a variety of steps

That allow us to go from the raw data to the data format that is suitable to train the machine learning models so that we can obtain the predictions when we deploy our machine learning model, then we do not just deploy the model itself, but we need to deploy the entire pipeline.

And why is that? Well, because both in the research and in the production environment, most likely we are going to receive the raw data. So we need the steps of the pipeline that will allow us to produce the mature data with which we will be able to either train the model or once the model is train, obtain the predictions. I'm going to discuss the details of the different steps of the machine learning pipeline later on in Section four.

For the rest of this section, we're going to focus on the concept of building reproducible machine learning pipelines.

Creating Machine Learning reproducible pipeline :

In this section we will focus on about building reproducible machine learning pipelines and why that matters.

What is deployment of machine learning model?

The deployment of a machine learning model is the process of making our model available in production environments where they can provide predictions to other software systems.

Are the Machine learning model that we develop in Jupyter notebook means ML deployment ?

The ML Model that we develop in our jupyter notebook are the models developed in the Jupyter notebook. We do so in the so-called research environment.

What is a research environment?

This is an isolated environment without contact to light data when the data scientist has the freedom to try and research the different models and find out a solution to a certain product need. In a common setting, we have historical data and we use this data to train a machine learning model, once we're happy with the performance of our model in the research environment, we're ready to migrate it to the production environment where it can receive input from life data and output predictions that can be used to make decisions.

We often talk about deployment of machine learning models when, in fact, we mean deployment of a machine learning pipeline.

So what is a machine learning pipeline?

It is the series of steps that need to occur from the moment we received the data up to the moment we obtain a prediction.

  • A typical machine learning pipeline includes a big proportion of feature transformation steps.
    Perhaps this is even the biggest part of the pipeline.

  • Then it includes steps to train the machine learning mother and steps to output a prediction would create an entire pipeline in a research environment.

  • I would need to deploy as well the entire pipeline to the production environment.

    • Why do we need to do? Because in the light environment, we're also going to receive raw data as input and we need to transform it to create the necessary features so that the model can interpret them and return the prediction.
  • When we deploy our pipeline to production, we need to do it in a way so that the pipelines are reproducible.

When we deploy our pipeline to production, we need to do it in a way so that the pipelines are reproducible.What does this mean?

It means that if both the pipeline in the research environment and the pipeline in the production environment receive the same raw data input, both pipelines should return this same protection, same data. Khamsin same output should come out of both pipelines.

Why do you think the research phase of our project we built a model and we optimized its performance. Static maximizes the business value.

When we develop a machine learning model for a machine learning pipeline, I should say we test model performance from a statistical perspective, utilizing classical metrics like the accuracy, the precision, recall, medium squared error and any other metric that you're familiar with. And we also evaluate metrics that translate these parameters into business value.For example, we may measure the increased revenue that our model will create or the increased customer satisfaction or any other measure that we're interested in, and that will vary with the organization.

If we want to translate this value, this revenue increase, for example, as I show in this light from a research environment to the real life that is to the environment, we need to make sure that the like model reproduces.

Exactly, or at least as much as possible in practice the that we created in the research environment. And this is why our models need to be reproducible.

So why is reproducibility matters first, if we cannot reproduce our machine learning pipelines within the research and production environments we may incur in financial costs.

But we also incur in substantial time loss because we need to spend a lot of time trying to figure out why the models are not identical, or at least very similar. And these will end up in the loss of reputation of the theme. But more important for me, at least, is the fact that without the ability to replicate the model results exactly.

It is very difficult to determine if a new model is truly better than the existing one. When we develop the machine learning model would usually compare it with the current model or set of rules that the organization is using without the ability to replicate the results.

Exactly. It is very difficult to determine if the new model is truly better than the current one.

Tox Code Input

[tox]

The [tox] section of the tox.ini file contains global configuration options for Tox, which apply to all environments. In this case, min_version = 4 specifies the minimum version of Tox that is required to run the configuration file. The envlist option specifies a list of environments to create, separated by commas. In this case, two environments are defined: test_package and checks. The skipsdist option specifies whether to skip creating a source distribution package for each environment. If skipsdist = True, Tox will not create a source distribution package. This can be useful if you are only interested in running tests or other commands, and don't need to create a package.

[testenv:test_package]

The [testenv:test_package] section of the tox.ini file defines an environment called test_package, which is one of the environments listed in the envlist option in the [tox] section.

The envdir option specifies the directory where the virtual environment for this environment will be created. In this case, it is set to {toxworkdir}/test_package, which means the virtual environment will be created in a subdirectory called test_package under the .tox directory.

The deps option specifies a list of dependencies that will be installed in the virtual environment for this environment. The -r option indicates that the dependencies will be read from a file, which is specified by {toxinidir}/requirements/test_requirements.txt. This file is a list of Python packages required to run the tests for the regression_model package.

The commands option specifies a list of commands that will be run when this environment is activated. In this case, two commands are specified. First, it will run the train_pipeline.py script in the regression_model package. Then it will run the pytest command with some options, including -s to show output from print() statements, -vv for more verbose output, and {posargs:tests/} to run tests in the tests/ directory by default, or other directories specified by additional arguments.

[testenv:train]

Specifying a test environment called 'train'

envdir: This sets the directory where the test environment will be created. In this case, it's set to {toxworkdir}/test_package, which means that Tox will create the environment in a subdirectory called test_package within the directory specified by toxworkdir. toxworkdir is a built-in variable that points to the directory where Tox will create its temporary files.

deps: This specifies the dependencies for the 'train' environment. It is inheriting from another environment called 'test_package', which means that the dependencies listed in the 'deps' section of the 'test_package' environment will also be installed in the 'train' environment. This is achieved using the syntax {[testenv:test_package]deps}, which references the 'deps' section of the 'test_package' environment.

commands: This specifies the commands that Tox should run in the 'train' environment. In this case, it's running a Python script called train_pipeline.py that is located in the regression_model directory. This script is presumably used to train a regression model.

[testenv:checks]

envdir: This sets the directory where the test environment will be created. In this case, it's set to {toxworkdir}/checks, which means that Tox will create the environment in a subdirectory called checks within the directory specified by toxworkdir. toxworkdir is a built-in variable that points to the directory where Tox will create its temporary files.

deps: This specifies the dependencies for the 'checks' environment. It's installing the dependencies from the file located at {toxinidir}/requirements/typing_requirements.txt using the -r option. toxinidir is a built-in variable that points to the root directory of the project. The typing_requirements.txt file presumably contains the requirements needed for type checking the code, such as mypy.

commands: This specifies the commands that Tox should run in the 'checks' environment. In this case, it's running a series of tools to check the code quality. Here's what each of the commands does:

flake8: This is a code linter that checks for syntax errors, code style issues, and other potential problems in the code. It's checking the regression_model and tests directories.

isort: This is a tool that sorts the import statements in Python code according to a defined style. It's checking the regression_model and tests directories.

black: This is a code formatter that reformats the code to adhere to a consistent style. It's checking the regression_model and tests directories.

{posargs:mypy regression_model}: This is running the mypy tool to perform static type checking on the code in the regression_model directory. The {posargs} syntax allows for additional arguments to be passed to the command line when running tox, such as specific files or directories to check. If no arguments are provided, it defaults to checking the regression_model directory.

Troubleshooting Tox

If you are using Python 3.10/3.11, downgrade to Python 3.9

For Windows users, when using tox, if you encounter an error like Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory \METADATA this is likely to be because of an annoying issue in Windows with the default max path lenghts. You have two options:

  1. Install the package to a shorter path

  2. Change your system max path length as explained here in the Python docs https://docs.python.org/3.7/using/windows.html#removing-the-max-path-limitation and in this stackoverflow answer https://stackoverflow.com/questions/54778630/could-not-install-packages-due-to-an-environmenterror-errno-2-no-such-file-or


Run pip3 install wheel


If you are using the Windows Python installed from the Microsoft Store (which we recommend against), you need to add the tox scripts directory to your system PATH. You'll know this is the case because when you run pip install tox you'll see a warning similar to this:

The Scripts directory will be of the format:

C:\Users{username}\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Scripts

If you're using the Python install from the Microsoft Store you will also need to make sure you're using tox >4.0.0a8 to avoid this error: tox-dev/tox#2202

A catch here is that tox can get installed as tox4 (you can rename here: C:\Users{YOUR_USERNAME}\AppData\Local\Packages\PythonSoftwareFoundation.{YOUR PYTHON VERSION}\LocalCache\local-packages\Python39\Scripts)

If you are unable to run things with Tox (which hopefully shouldn't be the case), then you can run the Python commands you will find in the tox.ini file. However, in order to do this you must add the relevant directory (e.g. section-05, not just the root directory) to your system PYTHONPATH

Here's how you do that on windows https://stackoverflow.com/questions/3701646/how-to-add-to-the-pythonpath-in-windows-so-it-finds-my-modules-packages

Here's how you do that on MacOS and Linux https://stackoverflow.com/questions/3402168/permanently-add-a-directory-to-pythonpath

Why shouldn't I just use python code for configuration

Reference Link

Pydantic refer link

Reference Link 1 Reference Link 2

strictYAML refer link

Reference Link

Manifest.ini

Which file to include in the package.

Link for building the pacakage

Reference Link Generating distribution archives¶

FastAPI

FASTAPI reference LINK
ASYNC and AWAIT Link
APP ROUTER LINK
Fast API features

GemFury

Create account on GemFury Private Index URL Format

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published