1. 程式人生 > >Distributed Deep Learning on IBM Cloud Private

Distributed Deep Learning on IBM Cloud Private

IBM PowerAI Distributed Deep Learning (DDL) can be deployed directly into your enterprise private cloud with IBM Cloud Private (ICP). This blog post explains how to do that using TCP or InfiniBand communication between the worker nodes. We will use the command line interface, however the Web interface could also be used for most of the steps.

Minimum requirements

Before you begin

You need to install the Kubernetes and Helm CLI to deploy your application from the command line. After installing the CLI, add the IBM Helm Chart repository:

helm repo add ibm-charts https://raw.githubusercontent.com/IBM/charts/master/repo/stable/

Deploying IBM PowerAI DDL with TCP cross node communication

  1. Create container SSH keys as a Kubernetes secret.
    mkdir -p .tmp
    yes | ssh-keygen -N "" -f .tmp/id_rsa
    kubectl create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub
  2. Deploy the PowerAI Helm Chart with DDL enabled.
    helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret
    • –name release_name: Name for the deployment
    • –set resources.gpu=gpu_count: Total number of requested GPUs
    • –set ddl.enabled=true: Enable Distributed Deep Learning
    • –set ddl.sshKeySecret=sshkeys-secret: Name of the Kubernetes secret containing the SSH keys
  3. Check that the pods were created and wait until they are in a running and ready state.
    kubectl get pod -l app=ddl-instance-ibm-powerai
    NAME                         READY     STATUS    RESTARTS   AGE
    ddl-instance-ibm-powerai-0   1/1       Running   0          30s
    ddl-instance-ibm-powerai-1   1/1       Running   0          30s
    

    NOTE: One pod per worker node is created. DDL deployments currently always take all the GPUs of a node. Run kubectl describe pod pod_name to get more info about a pod.

  4. Get a shell to the first pod, create a local copy of the model, and run the activation script.
    We will use the Tensorflow framework with the High-Performance Models as an example.
    kubectl exec -it ddl-instance-ibm-powerai-0 bash
    cd; /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
    source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
    
  5. Train the model with DDL.
    ddlrun --mpiarg '-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0' --tcp --hostfile /powerai/config/hostfile python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl
    • –mpiarg ‘-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0’: Specify the network interface to use for MPI and NCCL. eth0 connects the different nodes
    • –hostfile /powerai/config/hostfile: Use the autogenerated hostfile available inside the pod.

    The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

    I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
    ...
    ----------------------------------------------------------------
    total images/sec: 2284.62
    ----------------------------------------------------------------
    
  6. Delete your deployment.
    helm delete ddl-instance --purge --tls

Using host network on container

The Helm Chart provides the option to use the host network for communication to get better performance. The potential disadvantage of this option is that all the host network interfaces will be visible inside the container. A different SSH port than 22 needs to be picked for the containers not to interfere with the host SSH. Here is an example of deploying with the host network:

helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret --set ddl.useHostNetwork=true --set ddl.sshPort=2200

Deploying IBM PowerAI DDL with InfiniBand cross node communication

  1. Create container SSH keys as a Kubernetes secret.
    mkdir -p .tmp
    yes | ssh-keygen -N "" -f .tmp/id_rsa
    kubectl create secret generic sshkeys-secret --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub
  2. Deploy InfiniBand device plugin.
    kubectl -n kube-system apply -f https://raw.githubusercontent.com/nimbix/k8s-rdma-device-plugin/deploy-bionic/rdma-device-plugin.yml
  3. Install the latest Mellanox OFED driver user-space on a PowerAI Docker container.
    • Download latest MOFED into the container.
    • Install needed packages, decompress archive, and run the installer.
      sudo apt-get update; sudo apt-get install -y lsb-release
      tar -xzvf MLNX_OFED_LINUX-*
      MLNX_OFED_LINUX-*-ppc64le/mlnxofedinstall --user-space-only --without-fw-update --all -q
  4. Create a Docker image from this container and store it in a registry accessible by all the worker nodes.
  5. Deploy the PowerAI Helm Chart with InfiniBand communication.
    helm install --name ddl-instance --set license=accept ibm-charts/ibm-powerai --tls --set resources.gpu=8 --set ddl.enabled=true --set ddl.sshKeySecret=sshkeys-secret --set ddl.useInfiniBand=true --set image.repository=my_docker_repo --set image.tag=powerai-mofed
    • –name release_name: Name for the deployment
    • –set resources.gpu=gpu_count: Total number of requested GPUs
    • –set ddl.enabled=true: Enable Distributed Deep Learning
    • –set ddl.sshKeySecret=sshkeys-secret: Name of the Kubernetes secret containing the SSH keys
    • –set ddl.useInfiniBand=true: Use InfiniBand for communication
    • –set image.repository=repo: Repository containing PowerAI image with MOFED installed
    • –set image.tag=tag: Tag of the PowerAI image with MOFED installed
  6. Check that the pods were created and wait until they are in a running and ready state.
    kubectl get pod -l app=ddl-instance-ibm-powerai
    NAME                         READY     STATUS    RESTARTS   AGE
    ddl-instance-ibm-powerai-0   1/1       Running   0          30s
    ddl-instance-ibm-powerai-1   1/1       Running   0          30s
    

    NOTE: One pod per worker node is created. DDL deployments currently always take all the GPUs of a node. Run kubectl describe pod pod_name to get more info about a pod.

  7. Get a shell to the first pod, create a local copy of the model, and run the activation script.
    We will use the Tensorflow framework with the High-Performance Models as an example.
    kubectl exec -it ddl-instance-ibm-powerai-0 bash
    cd; /opt/DL/tensorflow-performance-models/bin/tensorflow-install-models hpms
    source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
    
  8. Restart a login session to get the correct ulimit settings.
    sudo su - $USER

    Note: Alternatively, you can modify the default ulimit by adding --default-ulimit memlock=-1 to the Docker daemon on all the worker nodes.

  9. Train the model with DDL using InfiniBand.
    ddlrun --hostfile /powerai/config/hostfile python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --variable_update=ddl

    –hostfile /powerai/config/hostfile: Use the autogenerated hostfile available inside the pod

    The run output should display the IBM Corp. DDL banner and for this model, the total images/sec.

    I 20:42:52.209 12173 12173 DDL:29  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
    ...
    ----------------------------------------------------------------
    total images/sec: 2855.78
    ----------------------------------------------------------------
    
  10. Delete your deployment.
    helm delete ddl-instance --purge --tls

相關推薦

Distributed Deep Learning on IBM Cloud Private

IBM PowerAI Distributed Deep Learning (DDL) can be deployed directly into your enterprise private cloud with IBM Cloud Private (ICP). This blog post exp

cloud deployments on IBM Cloud Private

IBM Cloud Private is an application platform for developing and maintaining on-premise applications. Itb s an integrated environment

Distributed Deep Learning on Kubernetes with Polyaxon

Distributed Deep Learning on Kubernetes with PolyaxonIn this short tutorial, we will be going over a new feature in Polyaxon, distributed training.Polyaxon

Deep Learning on the Cloud

Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine

Distributed Deep Learning with IBM DDL and TensorFlow NMT

by Seetharami Seelam, Geert Janssen, and Luis Lastras Introduction Sequence-to-sequence models are used extensively in tasks such as machine translation

IBM Cloud Private 3.1.0安裝問題

安裝的時候出現kube-dns無法啟動。也不知道什麼問題。 TASK [waitfor : include_tasks] **************************************************************************

通過IBM CLOUD Private社群版學習kerbernetes

kerbernetes 安裝源在中國是個問題。而通過IBM CLOUD Private社群版學習kerbernetes 就簡單得多了。 IBM CLOUD Private社群版在all in one這種單節點部署方式下需要的時間很短,幾乎不用學習。也不用翻牆安裝。 安裝完成後,我們還是可以用

IBM Cloud Private 2.1.0.2 和LDAP整合的問題

This security patch resolves the following issues in IBM Cloud Private Version 2.1.0.2: IBM Cloud Private cannot connect to a LDAP server if there are

剖析IBM Cloud Private服務,以Kubernetes為排程核心_Kubernetes中文社群

IBM Cloud Private重點策略 主打PaaS應用管理需求,強調快速自建的私有容器雲 兼顧傳統IT需求,能同步管理容器及VM 上有AWS、GCP以及Azure,下有甲骨文競爭對手的IBM,在今年3月時宣佈,正式在Bluemix容器服務上支援Kubernetes。目前這家公司支援

Install Knative with Istio on IKS cluster and deploy an app on IBM Cloud

Install Knative with Istio on IKS cluster and deploy an app on IBM CloudThis post provides you step-by-step instructions to install Knative with Istio on I

SMEStreet: Knowledge & Networking for Growth 'Global Enterprises Adopting IBM Cloud Private' | AITopics

IBM (NYSE: IBM) has announced that in less than 12 months since the release of IBM Cloud Private – an open source technology that brings cloud capabilities

Deep Learning on a Mac? Turi Create Review.

Deep Learning on a Mac? Turi Create Review.I am a self-confessed Apple fanboy, and own both an iMac Pro and a 2018 MacBook Pro. I also own a couple of cust

How to do Deep Learning on Graphs with Graph Convolutional Networks

Observe that the weights (the values) in each row of the adjacency matrix have been divided by the degree of the node corresponding to the row. We apply th

Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 Notebook

Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 NotebookFigure 1. PlaidML Logo.PlaidML is a deep learning software platform which enab

[骨架動作識別]ST-NBNN&Deep Learning on Lie Groups CVPR2017

一、ST-NBNN: 沒上神經網路 Each 3D action instance is represented by a collection of temporal stages composed by 3D poses, and each pose i

Deep Learning on AWS

Organizations are increasingly turning to deep learning because it allows computers to learn independently and undertake tasks with little s

Create a cognitive chatbot and deploy to Kubernetes on IBM Cloud

Bluemix Architect for Global Development Tools Andrew Trice demonstrates how to set up your own instance of the Watson Banking Chatbot, a sample chatbot a

Host Watson Botkit middleware on IBM Cloud

Ronan Dalton, Solutions Architect, IBM Watson and Cloud Platforms, received a number of queries on this topic, so he outlines the steps required to take t

JVM-based deep learning on IoT data with Apache Spark

Romeo Kienzler works as a Chief Data Scientist in the IBM Watson IoT worldwide team helping clients to apply advanced machine learning at scale on their Io

Developing microservices and APIs on IBM Cloud

Make apps smarter with serverless Learn how to implemen