Keeping local (python) libraries in synch with those in runtime images #1647

simon-streamotion · 2021-05-06T02:38:50Z

simon-streamotion
May 6, 2021

If elyra becomes heavily adopted within our org I can see a potential issue when it comes to trying to keep python libraries in sync between a developers local notebook environment, and those provided within elyra runtime images (for running on kubeflow).

In our org, runtime images will be provided by the data engineering team and will be relatively static as compared to a data scientists notebook environment that will evolve and change rapidly. So an issue could arise were a data scientists installs a particular version of a ML lib into their notebook kernel and the stages/pipelines run successfully in their local environment, but when submitted to kubeflow, the pipeline fails due to some differences between the libraries.

I thought at first that maybe just a simple pip install within the notebook would install the same libraries into the kubeflow pod. The command is indeed executed, but for the changes to take effect the kernel needs to be restarted! Is there a way we can use elyra to pip install libraries into a kubeflow pod, and for those changes to take effect immediately?

Elyra installs its own dependencies (contained within https://raw.githubusercontent.com/elyra-ai/kfp-notebook/v0.23.0/etc/requirements-elyra.txt) into a kubeflow pod before the kernel starts using a bootstrap process (https://raw.githubusercontent.com/elyra-ai/kfp-notebook/v0.23.0/etc/docker-scripts/bootstrapper.py). Perhaps a user-supplied bootstrap process in which we could install user dependencies could be provided as a new feature ?

Many thanks -- Simon

ptitzler · 2021-05-06T14:56:46Z

ptitzler
May 6, 2021
Maintainer

For the sake of re-creatability dynamically extended runtime images as you are proposing is probably something that should be avoided because it can lead different package versions across multiple runs.

Static container images (are built once, registered as a runtime image and include all prerequisite libraries) are guaranteed to have the same (relevant) packages installed.

Here's the potential problem with "dynamically extended" images. Let's say a "base" image (registered as a runtime image and only includes a few of the prerequisite libraries) is "extended" prior to execution using a list of requirements, say package A and package B. Each one of the packages has dependencies (Ax or Bx) of its own, which might not be pinned to a specific version. (Wether or not those packages are pinned to specific versions might be out of your control.) Running the image today, pip install might pull A==V_JAN1 and it's unpinned dependency A1/V_JAN1. Running the "same" image again months later, pip install might pick up A==V_JAN1 and A1/V_MAR1. As a result the notebook's runtime environment has changed even though nothing has changed as far as the user-supplied process is concerned.

The only way I can think of to avoid this is to freeze/capture all package versions in the user-supplied process - which in essence yields the same results as a static image that has everything pre-installed but incurs the installation overhead every time a pipeline node is executed.

3 replies

ptitzler May 6, 2021
Maintainer

I am also not sure if/how KFP's step caching might have unintended consequences.

The cache key calculation is based on the component (base image, command-line, code), arguments passed to the component (values or artifacts) and any additional customizations.

If the only difference between two runs is the content of a requirements file that is fetched on demand prior to execution, then the second run might appear to be an exact duplicate and cached results used. The reason is that on the surface nothing has changed (base image, command-line, and code are the same), whereas in reality something did.

simon-streamotion May 7, 2021
Author

Thanks for the response @ptitzler .....

I can see your point re. different package version being imported between different runs - and the obvious complications that might cause. However, I still see a use case for this particular feature using pip freeze to capture all the package dependencies to ensure a consistent environment, but its definitely a 2-edged sword and may ultimately be more hassle than its worth.

Is there a recommendation on how this should be handled - how can developers be free to do pip installs to use their favourite libraries etc... without the need for the reproducing that environment in a 'hardened' runtime image. I'm concerned that if we need an image every time a developer installs a different version of a package, we'll finish up with 100s of runtime images and it will eventually become unmanageable.

I'm not too concerned about creating images for the standard ML frameworks - that will probably lead to 10's of images, which is manageable, and we can recommend developers pip install the same libraries when running locally. But what happens when one developer needs to use the boto package, and another wants to install pandas etc.... ontop of one of our predefined images. I'm just concerned it will be unmanageable.

Maybe the solution is a combination of 10's of runtime images covering the standard ML frameworks and packages, with the ability to bootstrap the odd pip install using specific versions to ensure repeatability. Thoughts ???

Cheers -- Simon

kevin-bates May 7, 2021
Collaborator

I can see both sides as well, but agree that pipeline development phases should have the ability to specify custom package requirements in some manner, despite the risk that non-reproducibility be incurred. That would be a warning we'd certainly want to emphasize but that should also be a decision left to the user.

Assuming the organizational structure @simon-streamotion references is in place, it seems like data scientists could then request the appropriate image be configured (based on results of their experiments) prior to the pipeline's production, while, at the same time, retracting the "custom requirements" in order to gain full reproducibility.

simon-streamotion · 2021-09-20T06:08:51Z

simon-streamotion
Sep 20, 2021
Author

Hi folks....
I'd like to resurrect this discussion please as we are now close to opening up our ML environment (which will be built on top of Elyra) to our data scientists. When we do, I'm sure we'll need this feature. The question is, whats the safest way to implement it!

One possibility could be that Elyra does a pip freeze in the current kernel (notebook/python) to get a list of python modules and then passes this to the boot-strapper loading that particular pipeline component/image. The boot-strapper can then pip install those modules before it starts the kernel to run that stage.

One issue wth this approach is that there could be lots of unused modules in the working environment which are not required to actually run the stage - but these would get picked up by pip freeze and installed each time by the boot strapper.

Another alternative might be to have users do all their pip installs via a requirements.txt file (one for each stage). When the pipeline/stage is run, if a requirements.txt file exists, elyra picks it up and the boot strapper installs the modules from the file. This is little cleaner, more controlled IMHO.

Either way, I don't think installing additional modules by the boot strapper should be the default setting - so as to minimise the number of cases which introduce unpredictability. It would be nice if it could be a checkbox that has to be checked for each stage which is going to do these "extended pip installs". That way, it's a a conscious decision on the part of the user.

Any thoughts ?

Cheers -- Simon

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping local (python) libraries in synch with those in runtime images #1647

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Keeping local (python) libraries in synch with those in runtime images #1647

simon-streamotion May 6, 2021

Replies: 2 comments · 3 replies

ptitzler May 6, 2021 Maintainer

ptitzler May 6, 2021 Maintainer

simon-streamotion May 7, 2021 Author

kevin-bates May 7, 2021 Collaborator

simon-streamotion Sep 20, 2021 Author

simon-streamotion
May 6, 2021

Replies: 2 comments 3 replies

ptitzler
May 6, 2021
Maintainer

ptitzler May 6, 2021
Maintainer

simon-streamotion May 7, 2021
Author

kevin-bates May 7, 2021
Collaborator

simon-streamotion
Sep 20, 2021
Author