-
Notifications
You must be signed in to change notification settings - Fork 187
Home
Deploy and use a multi framework Deep Learning platform on Kubernetes
Deploy and consume a Deep Learning Platform on Kubernetes, offering TensorFlow, Caffe, PyTorch etc. as a Service.
Cognitive
This Code provides a fabric for a scalable Deep Learning on Kubernetes, by giving users a platform to leverage deep learning libraries such as Caffe, Torch and TensorFlow, in the cloud in a scalable and resilient manner with minimal effort. The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes. A resource provisioning layer enables flexible job management on heterogeneous resources, such as graphics processing units (GPUs) and central processing units (CPUs), in an infrastructure as a service (IaaS) cloud.
By Scott Boag, Waldemar Hummer, Tommy Li, Falk Pollok, Animesh Singh
Training deep neural networks, known as deep learning, is currently highly complex and computationally intensive. A typical user of deep learning is unnecessarily exposed to the details of the underlying hardware and software infrastructure, including configuring expensive GPU machines, installing deep learning libraries, and managing the jobs during execution to handle failures and recovery. Despite the ease of obtaining hardware from infrastructure as a service (IaaS) clouds and paying by the hour, the user still needs to manage those machines, install required libraries, and ensure resiliency of the deep learning training jobs.
This is where the opportunity of deep learning as a service lies. In this code pattern we are going to show how to deploy Deep Learning fabric on Kubernetes. By using Cloud native architectural artifacts like Kubernetes, Microservices, Helm charts, Object storage etc. we show to deploy and use a deep learning fabric which spans across multiple deep learnign engines like TensorFlow, Caffe, PyTorch etc. like It combines the flexibility, ease-of-use, and economics of a cloud service with the power of deep learning: It is easy to use using the REST APIs, one can train with different amounts of resources per user requirements or budget, it is resilient (handles failures), and it frees users so that they can spend time on deep learning and its applications.
- FfDL deployer deploys the FfDL code base it to a Kubernetes cluster. The Kubernetes cluster is configured to used GPUs or CPUs or both, and has access to a S3 compatible Object Storage. If not specified, a locally simulated S3 pod is created.
- Once deployed, the data scientist uploads the model training data to the S3 compatible Object storage. FfDL assumes the data is already in the required format as prescribed by different deep learning frameworks.
- User then creates a FfDL Model manifest file. The manifest file contains different fields describing the model in FfDL , its object store information, its resource requirements, and several arguments (including hyperparameters) required for model execution during training and testing. User then interacts with FfDL using CLI/SDK or UI to deploy the FfDL model manifest file with Model definition file. Finally the user launches the training job and monitors its progress
- Finally the user downloads the trained model and associated logs once the training job is complete.
- TensorFlow: An open-source library for implementing Deep Learning models
- Caffe: Caffe is a deep learning framework
- PyTorch: PyTorch is a deep learning framework that puts Python first.
- Kubernetes cluster: An open-source system for orchestrating containers on a cluster of servers
- IBM Cloud Container Service: A public service from IBM that hosts users applications on Docker and Kubernetes
- Machine Learning
- Deep Learning
- Container Orchestration
- https://www.slideshare.net/AnimeshSingh/fabric-for-deep-learning-94941117
- https://medium.com/ibm-watson/introducing-fabric-for-deep-learning-ffdl-542522774775
Related links: (pls list 3-4 related links)
-
Deep Learning as a Service (https://arxiv.org/abs/1709.05871)
-
Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs (http://learningsys.org/nips17/assets/papers/paper_29.pdf)
-
https://www.ibm.com/developerworks/library/cc-get-started-tensorflow/