Feature/gpu support extended #87

ksatzke · 2020-10-12T15:24:00Z

This PR adds the capability to execute Python KNIX functions in sandboxes using NVIDIA GPU resources for both ansible and helm deployments of KNIX. GPU nodes are detected and configured automatically. The required kubernetes configuration for deployments with GPU nodes is described in README_GPU_Installation.md

Subsumes #11, and fixes #79.

…nding values.yml capability definition example

…GPU related tests

…ementService

…eployments

…k8s label names

iakkus · 2020-10-12T15:32:11Z

Sandbox/Dockerfile_gpu

+# Install dlib for CUDA
+RUN git clone https://github.com/davisking/dlib.git
+RUN mkdir -p /dlib/build
+
+RUN cmake -H/dlib -B/dlib/build -DDLIB_USE_CUDA=1 -DUSE_AVX_INSTRUCTIONS=1
+RUN cmake --build /dlib/build
+
+RUN cd /dlib; python3 /dlib/setup.py install
+
+# Install the face recognition package and tensorflow
+RUN pip3 install face_recognition
+RUN pip3 install tensorflow==2.1.0


I am not sure why we need to install all these custom libraries for the GPU usage.

If these are needed by the workflows, then they should specify it in the function requirements.

iakkus · 2020-10-12T15:33:07Z

mfn_sdk/mfn_sdk/mfnclient.py

@@ -449,7 +449,7 @@ def _get_state_names_and_resource(self, desired_state_type, wf_dict):
        return state_list


-    def add_workflow(self,name,filename=None):
+    def add_workflow(self,name,filename=None, gpu_usage="None"):


Should read: gpu_usage=None

iakkus · 2020-10-12T15:34:48Z

deploy/ansible/Makefile

@@ -21,7 +21,7 @@ NAMES := $(YAML:%.yaml=%)
 .PHONY: $(NAMES)
 default: prepare_packages install

-install: init_once riak elasticsearch fluentbit datalayer sandbox management nginx
+install: init_once installnvidiadocker riak elasticsearch fluentbit datalayer frontend sandbox management nginx


I think 'frontend' component does not exist anymore.

What happens if the host does not have any Nvidia GPUs? Will the 'installnvidiadocker' still succeed?

iakkus · 2020-10-12T15:35:59Z

Sandbox/Makefile

@@ -107,6 +118,7 @@ image_java: \

 push: image image_java
 	$(call push_image,microfn/sandbox)
+	$(call push_image,microfn/sandbox_gpu)


microfn/sandbox_java_gpu?

Need to also update the dependencies for the Makefile target.

iakkus · 2020-10-12T15:37:49Z

ManagementService/python/deployWorkflow.py

+                        gpu_hosts[hostname] = hostip
+
+                # instruct hosts to start the sandbox and deploy workflow
+                if runtime=="Java" or sandbox_image_name == "microfn/sandbox": # can use any host 


I thought we had the "microfn/sandbox_java_gpu" image?

…x-microfunctions/knix into feature/GPU_support_extended

This reverts commit 7a1b157.

This reverts commit 700e298.

…new sessin update messages

ksatzke and others added 24 commits July 10, 2020 14:38

adding GPU support, adding test cases, adding NVIDIA runtime support

5b9a16a

removing local configuration

5628d51

adding description on how to prepare and add a GPU node, and correspo…

2d8a7ff

…nding values.yml capability definition example

fixing typos

9fab0e3

corrections to description on adding GPU nodes

01c5c3d

add GPU sandbox type to Makefiles and helm charts

977ee01

add GPU sandbox type to Makefiles and helm charts

87fd792

adding logig to spin up a GPU sandbox on demand

32dd904

fixing typos in README

cdd0faf

configure separate GPU support for management and common workflow kscv's

9fba583

added configuration of sandox_gpu container image for wf pods to run …

0124123

…GPU related tests

improved configuration for workflows calling for GPUs

cf047f0

first cut on extending Workflow class with GPU properties

89e8d24

fixing bug on addWorkflow

dfc7cd7

adding support for dynamic config of helm deployments on GPU to Manag…

4508ea6

…ementService

removing bug on java function executions

e251bf2

fixing a bug on asl_Map state tests with helm deployment

5508abc

adding first cut on gpu node selection logic for ansible multi-host d…

a9ad920

…eployments

merge develop branch

ea3b5f6

fixing bugs on SDK and GPU test configurations

ef7ed75

adding logic to configure gpu hosts, fixing bug on deployWorkflow on …

f7571f2

…k8s label names

cleanup tests and values.yaml

a250fb3

adjustments to available_hosts script and cleanup

f86c970

final adjustments to values.yaml

e334db1

iakkus reviewed Oct 12, 2020

View reviewed changes

addressing comments from PR review, first part

297a8e0

ksatzke and others added 25 commits January 12, 2021 13:54

Merge branch 'develop' into feature/GPU_support_extended

47bd2cd

fixing bug on GPU parameter calculations

defacaf

WIP: adding logic for node GPU capacity queries to ManagementService

730e0f7

use vgpu parameters in kservice setup

cd779af

adding capability to handle secret token for k8s core API

09f58e8

adding GUI and ManagementService changes for GPU parameters

7311aec

fixing bugs on GPU memory parameter calculations

8520d5f

fixing more bugs on GPU memory parameter calculations

338fcd3

fixing bugs in deployment script, adjusting values

5697446

fixing bugs in available_hosts scripts

051b327

resolving bugs in host selection logic for deployment

70ce3ec

fixing a bug in workflow GPU resource calculation

e0e4a86

extending mfn SDK to handle GPU parameters

d92a604

Merge branch 'feature/GPU_support_extended' of https://github.com/kni…

1de8f8b

…x-microfunctions/knix into feature/GPU_support_extended

fixing bugs on ASL tests using GPUs

48f7546

merge develop; update Dockerfile_gpu

463cbab

fix to helm template management.yaml

7a1b157

Revert "fix to helm template management.yaml"

191e0da

This reverts commit 7a1b157.

fix to helm template management.yaml after merging with develop

4528bc6

Merge branch 'develop' into feature/GPU_support_extended

ea5e1bd

make Dockerfile installation instructions follow the same order

6913fbb

ansible: fix available hosts script

3647851

management: fix deployWorkflow for bare metal with gpu hosts

3187c49

update ansible readme; fixes #117

700e298

Revert "update ansible readme; fixes #117"

c46055b

This reverts commit 700e298.

iakkus mentioned this pull request May 16, 2021

GPU support #11

Closed

Istemi Ekin Akkus and others added 4 commits May 17, 2021 11:07

function worker: fix addressable function stopping when blocking for …

f30f610

…new sessin update messages

adding Dockerfile for opencv package

529be65

fixing Map state, all tests are running

05e6b25

merging recent develop into GPU feature branch

a198f2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/gpu support extended #87

Feature/gpu support extended #87

ksatzke commented Oct 12, 2020 •

edited by iakkus

Loading

iakkus Oct 12, 2020

iakkus Oct 12, 2020

iakkus Oct 12, 2020

iakkus Oct 12, 2020

iakkus Oct 12, 2020

Feature/gpu support extended #87

Are you sure you want to change the base?

Feature/gpu support extended #87

Conversation

ksatzke commented Oct 12, 2020 • edited by iakkus Loading

iakkus Oct 12, 2020

Choose a reason for hiding this comment

iakkus Oct 12, 2020

Choose a reason for hiding this comment

iakkus Oct 12, 2020

Choose a reason for hiding this comment

iakkus Oct 12, 2020

Choose a reason for hiding this comment

iakkus Oct 12, 2020

Choose a reason for hiding this comment

ksatzke commented Oct 12, 2020 •

edited by iakkus

Loading