Distributed Deep Learning on IBM Cloud Private

阿新 • • 發佈：2019-01-16

IBM PowerAI Distributed Deep Learning (DDL) can be deployed directly into your enterprise private cloud with IBM Cloud Private (ICP). This blog post explains how to do that using TCP or InfiniBand communication between the worker nodes. We will use the command line interface, however the Web interface could also be used for most of the steps.

Minimum requirements

Before you begin

You need to install the Kubernetes and Helm CLI to deploy your application from the command line. After installing the CLI, add the IBM Helm Chart repository:

helm repo add ibm-charts https://raw.githubusercontent.com/IBM/charts/master/repo/stable/

Deploying IBM PowerAI DDL with TCP cross node communication

Create container SSH keys as a Kubernetes secret.

mkdir -p .tmp
yes | ssh-keygen -N "" -f .tmp/id_rsa
kubectl create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub

Deploy the PowerAI Helm Chart with DDL enabled.
```
helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret 
```
- –name release_name: Name for the deployment
- –set resources.gpu=gpu_count: Total number of requested GPUs
- –set ddl.enabled=true: Enable Distributed Deep Learning
- –set ddl.sshKeySecret=sshkeys-secret: Name of the Kubernetes secret containing the SSH keys
Check that the pods were created and wait until they are in a running and ready state.
```
kubectl get pod -l app=ddl-instance-ibm-powerai
NAME                         READY     STATUS    RESTARTS   AGE
ddl-instance-ibm-powerai-0   1/1       Running   0          30s
ddl-instance-ibm-powerai-1   1/1       Running   0          30s
```
NOTE: One pod per worker node is created. DDL deployments currently always take all the GPUs of a node. Run kubectl describe pod pod_name to get more info about a pod.

Get a shell to the first pod, create a local copy of the model, and run the activation script.
We will use the Tensorflow framework with the High-Performance Models as an example.

kubectl exec -it ddl-instance-ibm-powerai-0 bash
cd; /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

Train the model with DDL.

ddlrun --mpiarg '-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0' --tcp --hostfile /powerai/config/hostfile python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl

–mpiarg ‘-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0’: Specify the network interface to use for MPI and NCCL. eth0 connects the different nodes
–hostfile /powerai/config/hostfile: Use the autogenerated hostfile available inside the pod.

The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 2284.62
----------------------------------------------------------------

Delete your deployment.
```
helm delete ddl-instance --purge --tls
```

Using host network on container

The Helm Chart provides the option to use the host network for communication to get better performance. The potential disadvantage of this option is that all the host network interfaces will be visible inside the container. A different SSH port than 22 needs to be picked for the containers not to interfere with the host SSH. Here is an example of deploying with the host network:

helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret --set ddl.useHostNetwork=true --set ddl.sshPort=2200

Deploying IBM PowerAI DDL with InfiniBand cross node communication

Create container SSH keys as a Kubernetes secret.

mkdir -p .tmp
yes | ssh-keygen -N "" -f .tmp/id_rsa
kubectl create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub

Deploy InfiniBand device plugin.

kubectl -n kube-system apply -f https://raw.githubusercontent.com/nimbix/k8s-rdma-device-plugin/deploy-bionic/rdma-device-plugin.yml

Install the latest Mellanox OFED driver user-space on a PowerAI Docker container.
- Download latest MOFED into the container.
- Install needed packages, decompress archive, and run the installer.
```
sudo apt-get update; sudo apt-get install -y lsb-release
tar -xzvf MLNX_OFED_LINUX-*
MLNX_OFED_LINUX-*-ppc64le/mlnxofedinstall --user-space-only --without-fw-update --all -q
```
Create a Docker image from this container and store it in a registry accessible by all the worker nodes.
Deploy the PowerAI Helm Chart with InfiniBand communication.
```
helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret --set ddl.useInfiniBand=true --set image.repository=my_docker_repo --set image.tag=powerai-mofed
```
- –name release_name: Name for the deployment
- –set resources.gpu=gpu_count: Total number of requested GPUs
- –set ddl.enabled=true: Enable Distributed Deep Learning
- –set ddl.sshKeySecret=sshkeys-secret: Name of the Kubernetes secret containing the SSH keys
- –set ddl.useInfiniBand=true: Use InfiniBand for communication
- –set image.repository=repo: Repository containing PowerAI image with MOFED installed
- –set image.tag=tag: Tag of the PowerAI image with MOFED installed
Check that the pods were created and wait until they are in a running and ready state.
```
kubectl get pod -l app=ddl-instance-ibm-powerai
NAME                         READY     STATUS    RESTARTS   AGE
ddl-instance-ibm-powerai-0   1/1       Running   0          30s
ddl-instance-ibm-powerai-1   1/1       Running   0          30s
```
NOTE: One pod per worker node is created. DDL deployments currently always take all the GPUs of a node. Run kubectl describe pod pod_name to get more info about a pod.

Get a shell to the first pod, create a local copy of the model, and run the activation script.
We will use the Tensorflow framework with the High-Performance Models as an example.

kubectl exec -it ddl-instance-ibm-powerai-0 bash
cd; /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

Restart a login session to get the correct ulimit settings.
```
sudo su - $USER
```
Note: Alternatively, you can modify the default ulimit by adding --default-ulimit memlock=-1 to the Docker daemon on all the worker nodes.

Train the model with DDL using InfiniBand.

ddlrun --hostfile /powerai/config/hostfile python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl

–hostfile /powerai/config/hostfile: Use the autogenerated hostfile available inside the pod

The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 2855.78
----------------------------------------------------------------

Delete your deployment.
```
helm delete ddl-instance --purge --tls
```

Distributed Deep Learning on IBM Cloud Private

Minimum requirements

Before you begin

Deploying IBM PowerAI DDL with TCP cross node communication

Using host network on container

Deploying IBM PowerAI DDL with InfiniBand cross node communication

Distributed Deep Learning on IBM Cloud Private

cloud deployments on IBM Cloud Private

Distributed Deep Learning on Kubernetes with Polyaxon

Deep Learning on the Cloud

Distributed Deep Learning with IBM DDL and TensorFlow NMT

IBM Cloud Private 3.1.0安裝問題

通過IBM CLOUD Private社群版學習kerbernetes

IBM Cloud Private 2.1.0.2 和LDAP整合的問題

剖析IBM Cloud Private服務，以Kubernetes為排程核心_Kubernetes中文社群

Install Knative with Istio on IKS cluster and deploy an app on IBM Cloud

SMEStreet: Knowledge & Networking for Growth 'Global Enterprises Adopting IBM Cloud Private' | AITopics

Deep Learning on a Mac? Turi Create Review.

How to do Deep Learning on Graphs with Graph Convolutional Networks

Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 Notebook

[骨架動作識別]ST-NBNN&Deep Learning on Lie Groups CVPR2017

Deep Learning on AWS

Create a cognitive chatbot and deploy to Kubernetes on IBM Cloud

Host Watson Botkit middleware on IBM Cloud

JVM-based deep learning on IoT data with Apache Spark

Developing microservices and APIs on IBM Cloud

Distributed Deep Learning on IBM Cloud Private

Minimum requirements

Before you begin

Deploying IBM PowerAI DDL with TCP cross node communication

Using host network on container

Deploying IBM PowerAI DDL with InfiniBand cross node communication

相關推薦