This example shows how to make ChainerMN docker image extending kube-openmpi base image. And how to use it.
Please be advised ChainerMN base container image are provided at DockerHub already. This is only for sample.
Dockerfile used everpeace/kube-openmpi:0.7.0-cuda8.0
as base image. And installs:
- NCCL
2.1.4-1+cuda8.0
- CuPy
4.0.0b3
- Chainer
4.0.0b3
- ChainerMN
1.2.0
$ REPO=(your_org)/(your_registry_name):(your_tag)
$ docker build . -t $REPO
$ docker push $REPO
# generate temporary key
$ ../gen-ssh-key.sh
# edit your values.yaml
# please replace 'image.repository' and 'image.tag' with your pushed image.
$ $EDITOR values.yaml
$ MPI_CLUSTER_NAME=__CHANGE_ME__
$ KUBE_NAMESPACE=__CHANGE_ME_
$ helm template ../chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE create -f -
You can be ignored 'Unexpected end of /proc/mounts line`overlay....'. Please refer to here
# wait until $MPI_CLUSTER_NAME-master is ready
$ kubectl get -n $KUBE_NAMESPACE po $MPI_CLUSTER_NAME-master
NAME READY STATUS RESTARTS AGE
MPI_CLUSTER_NAME-master 1/1 Running 0 17s
# You can run mpiexec now via 'kubectl exec'!
# hostfile is automatically generated and located '/kube-openmpi/generated/hostfile'
$ kubectl -n $KUBE_NAMESPACE exec -it $MPI_CLUSTER_NAME-master -- mpiexec --allow-run-as-root \
--hostfile /kube-openmpi/generated/hostfile \
--display-map -n 2 -npernode 1 \
python3 /chainermn-examples/mnist/train_mnist.py -g
Warning: Permanently added 'MPI_CLUSTER_NAME-worker-0,172.23.36.171' (ECDSA) to the list of known hosts.
Warning: Permanently added 'MPI_CLUSTER_NAME-worker-1,172.23.36.38' (ECDSA) to the list of known hosts.
Data for JOB [28697,1] offset 0
======================== JOB MAP ========================
Data for node: MPI_CLUSTER_NAME-worker-0 Num slots: 8 Max slots: 0 Num procs: 1
Process OMPI jobid: [28697,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]
Data for node: MPI_CLUSTER_NAME-worker-1 Num slots: 8 Max slots: 0 Num procs: 1
Process OMPI jobid: [28697,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../..][../../../..]
=============================================================
==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/UNJN7WML5VB67GZI2C4GR77VJU:/var/lib/docker/overlay2/l/X3SWZTQ7PBMI6SFQQOYVWIW76Z:/var/lib/docker/overlay2/l/DHOJ5NNKETXKCXTPC643JBWPKF:/var/lib/docker/overlay2/l/U6GTZTDX5NPZZS6T3VI2CUH5OY:/var/lib/docker/overlay2/l/K3WGBUGMBI7QFN25T3W7E3DZXG:/var/lib/docker/overlay2/l/TEJH26S5SNGTRKBYMB5XSVKUNM:/var/lib/docker/overlay2/l/TATGQREE7UIRJ2XVPJFCFT6LVZ:/var/lib/docker/overlay2/l/53LU4L3PHZGHYZVAKUL2B6SVN4:/var/lib/docker/overlay2/l/4N2FFIIRA7HPB'
...
1 0.224002 0.102322 0.9335 0.9695 17.1341
2 0.0733692 0.0672879 0.977967 0.9765 24.7188
3 0.0462234 0.0652903 0.985134 0.9795 32.04
4 0.0303526 0.0639791 0.990867 0.9805 39.42
5 0.0252077 0.079451 0.991667 0.9775 46.8614
...
20 0.00531046 0.105093 0.998267 0.9799 160.794
$ helm template ../chart --namespace $KUBE_NAMESPACE --name $MPI_CLUSTER_NAME -f values.yaml -f ssh-key.yaml | kubectl -n $KUBE_NAMESPACE delete -f -