Table of Contents HOL EMT

Similar documents
Table of Contents HOL ADV

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Building and Managing Clouds with CloudForms & Ansible. Götz Rieger Senior Solution Architect January 27, 2017

NEW vsphere Replication Enhancements & Best Practices

Getting Started Guide

TOOLS & PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Fall 2017 Computer Vision Developer Survey

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA

GearBox 3.1 Release Notes

Creating Intelligence at the Edge

TOOLS AND PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Spring 2017 Computer Vision Developer Survey

Ansible Tower Quick Setup Guide

Challenges in Transition

AUTOMATION ACROSS THE ENTERPRISE

Kaseya 2. User Guide. Version 7.0

Optimizing VM Checkpointing for Restore Performance in VMware ESXi Server

Kodiak Corporate Administration Tool

Ansible Tower Quick Setup Guide

VERSION 3.5 RELEASE NOTES

GPU Computing for Cognitive Robotics

Console Games Are Just Like Mobile Games* (* well, not really. But they are more alike than you

NetApp Sizing Guidelines for MEDITECH Environments

Deep Learning. Dr. Johan Hagelbäck.

Softing TDX ODX- and OTX-Based Diagnostic System Framework

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)

TOOLS & PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Computer Vision Developer Survey

1/31/2010 Google's Picture Perfect Picasa

From Nothing to Something using AutoCAD Electrical

PaperCut PaperCut Payment Gateway Module - CBORD Quick Start Guide

Networks of any size and topology. System infrastructure monitoring and control. Bridging for different radio networks

Artificial Intelligence Machine learning and Deep Learning: Trends and Tools. Dr. Shaona

Artificial intelligence, made simple. Written by: Dale Benton Produced by: Danielle Harris

Deep learning for INTELLIGENT machines

6 System architecture

Matthew Grossman Mentor: Rick Brownrigg

1 ImageBrowser Software User Guide 5.1

Harnessing the Power of AI: An Easy Start with Lattice s sensai

Infoblox and Ansible Integration

HPC + AI. Mike Houston

Line 6 GearBox Version 2.0 Release Notes

What's New in RhinoCAM 2018

FAQ and Solutions. 02 May TM and copyright Imagicle spa

Document downloaded from:

Decoding Brainwave Data using Regression

Ansible in Depth WHITEPAPER. ansible.com

R&S RTO-K92 emmc Compliance Test Test Procedures

Software Version x.x.xx Document Number xx-xx-xxxx Printed 12/11/12

Enhancing Shipboard Maintenance with Augmented Reality

Chanalyzer by MetaGeek USER GUIDE page 1

Mastering the game of Omok

AGENTLESS ARCHITECTURE

truepixa Chromantis Operating Guide

THE NEXT WAVE OF COMPUTING. September 2017

Transformation to Artificial Intelligence with MATLAB Roy Lurie, PhD Vice President of Engineering MATLAB Products

Fundamentals of ModelBuilder

Neural Networks The New Moore s Law

Get Automating with Infoblox DDI IPAM and Ansible

WEB I/O. Wireless On/Off Control USER MANUAL

User Guide / Rules (v1.6)

Staff get data back just hours after fire guts The Academy, Selsey. Redstor to the rescue after disaster strikes

Endurance R/C Wi-Fi Servo Controller 2 Instructions

Machine Learning Practical Part 2: Group Projects. MLP Lecture 11 MLP Part 2: Group Projects 1

Live Agent for Administrators

LESSONS Lesson 1. Microcontrollers and SBCs. The Big Idea: Lesson 1: Microcontrollers and SBCs. Background: What, precisely, is computer science?

Teleoperated Robot Controlling Interface: an Internet of Things Based Approach

AI Application Processing Requirements

Individual Test Item Specifications

Product Overview. Dream Report. OCEAN DATA SYSTEMS The Art of Industrial Intelligence. User Friendly & Programming Free Reporting.

INTRODUCTION CONTENTS BEGINNER S GUIDE: CONTROL WITH RED HAT ANSIBLE TOWER

Release Notes v KINOVA Gen3 Ultra lightweight robot enabled by KINOVA KORTEX

OCEAN DATA SYSTEMS The Art of Industrial Intelligence. User Friendly & Programming Free Reporting. Product Overview. Dream Report

AUTOMATING THE ENTERPRISE WITH ANSIBLE. Dustin Boyd Solutions Architect September 12, 2017

Relationship to theory: This activity involves the motion of bodies under constant velocity.

EDUCATION GIS CONFERENCE Geoprocessing with ArcGIS Pro. Rudy Prosser GISP CTT+ Instructor, Esri

Effective Training Inc. Aug 2009

Ansible Tower on the AWS Cloud

Online Game Quality Assessment Research Paper

Live Agent for Administrators

Figure 1: Electronics Workbench screen

1. Future Vision of Office Robot

Contents. Nikon Scan for Windows. Scanner Control Software and TWAIN Source. Reference Manual. Overview Before You Begin.

IN DEPTH INTRODUCTION ARCHITECTURE, AGENTS, AND SECURITY

DEVELOPMENT OF A ROBOID COMPONENT FOR PLAYER/STAGE ROBOT SIMULATOR

DEEP LEARNING A NEW COMPUTING MODEL. Sundara R Nagalingam Head Deep Learning Practice

Experiments with Tensor Flow Roman Weber (Geschäftsführer) Richard Schmid (Senior Consultant)

HASHICORP TERRAFORM AND RED HAT ANSIBLE AUTOMATION Infrastructure as code automation

Zero Touch Provisioning of NIOS on Openstack using Ansible

Technical Notes LAND MAPPING APPLICATIONS. Leading the way with increased reliability.

ArcGIS Runtime: Analysis. Lucas Danzinger Mark Baird Mike Branscomb

ADOBE 9A Adobe(R) Photoshop CS4 ACE. Download Full Version :

Mobile Cognitive Indoor Assistive Navigation for the Visually Impaired

Hands on New Tech Fast and FREE with DevNet Sandbox

MULTI-USER VR SOLUTIONS FOR ENTERPRISE DEPLOYMENT

Getting Started with Kurzweil 3000 for Macintosh

Computer Vision at the Edge and in the Cloud: Architectures, Algorithms, Processors, and Tools

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

The best-in-class and most popular timing software!

BIM Toolbox. User Guide. Version: Copyright 2017 Computer and Design Services Ltd GLOBAL CONSTRUCTION SOFTWARE AND SERVICES

Ansible Tower Quick Install

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

Ben Baker. Sponsored by:

Transcription:

Table of Contents Lab Overview - - Machine Learning Workloads in vsphere Using GPUs - Getting Started... 2 Lab Guidance... 3 Module 1 - Machine Learning Apps in vsphere VMs Using GPUs (15 minutes)...9 Introduction... 10 Conclusion... 12 Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes)... 14 Introduction... 15 Hands-on Labs Interactive Simulation: NVIDIA GRID vgpus in vsphere...17 Conclusion... 18 Module 3 - Using GPUs in Pass-through Mode (15 Minutes)... 20 Introduction... 21 Hands-on Labs Interactive Simulation: Configuring Passthrough for NVIDIA P100 on vsphere... 23 Conclusion... 24 Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes)... 26 Introduction... 27 Hands-on Labs Interactive Simulation: Using Bitfusion GPU virtualization in vsphere... 31 Conclusion... 32 Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes)... 34 Introduction... 35 Conclusion... 37 Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (30 minutes)... 39 Introduction... 40 Hands-on Labs Interactive Simulation: Running Machine Learning Workloads using TensorFlow in vsphere... 41 Conclusion... 42 Module 7 - vgpu Scheduling Options (15 minutes)... 44 Introduction... 45 Hands-on Labs Interactive Simulation: vgpu Scheduling Options... 46 Conclusion... 47 Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes)... 49 Introduction... 50 Hands-on Labs Interactive Simulation: Maximizing the Power of vsphere for Diverse Workloads using GPUs... 51 Conclusion... 52 Page 1

Lab Overview - - Machine Learning Workloads in vsphere Using GPUs - Getting Started Page 2

Lab Guidance Note: It may take more than 90 minutes to complete this lab. You should expect to only finish 2-3 of the modules during your time. The modules are independent of each other so you can start at the beginning of any module and proceed from there. You can use the Table of Contents to access any module of your choosing. The Table of Contents can be accessed in the upper right-hand corner of the Lab Manual. This lab explores how to accelerate Machine Learning Workloads by using GPUs and vgpus in vsphere. For this lab we will be using NVIDIA GRID GPUs installed in the ESXi hosts. Throughout all 8 modules, we will show you mechanisms to access GPUs either directly or through passthrough mode from a VM, how to run machine learning workloads using TensorFlow, and how to maximize your datacenter resources including GPUs by running diverse workloads. Lab Module List: Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic - In this module, you will get a basic overview of what Machine Learning is and how to run ML workloads with TensorFlow in vsphere VMs Module 2 -Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic - In this module, you will enable NVIDIA vgpu GRID in vsphere. Module 3 -Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic - In this module, you will access GPUs in Pass-through mode. Module 4 -Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic - In this module, you will enable Bitusion Elastic GPUs on vsphere. Module 5 -Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic - In this module, you will perform vmotion on VMs running applications that are using GPUs for compute acceleration in order to remediate an ESXi host. Module 6 -Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Basic - In this module, you will learn how run machine learning workloads on NVIDIA GPUs using TensorFlow in vsphere. Module 7 -vgpu Scheduling Options (15 minutes) - Basic - In this module, you will learn about different vgpu schedulers and how to select between different vgpu Scheduler. Module 8 -Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Basic - In this module, you will maximize your datacenter resources including GPUs. Lab Captains: Module 1 - Uday Kurkure, Staff Engineer 1, USA Page 3

Module 2 - Uday Kurkure, Staff Engineer 1, USA Module 3 - Uday Kurkure, Staff Engineer 1, USA Module 4 - Uday Kurkure, Staff Engineer 1, USA Module 5 - Uday Kurkure, Staff Engineer 1, USA Module 6 - Uday Kurkure, Staff Engineer 1, USA Module 7 - Uday Kurkure, Staff Engineer 1, USA Module 8 - Uday Kurkure, Staff Engineer 1, USA This lab manual can be downloaded from the Hands-on Labs Document site found here: http://docs.hol.vmware.com Location of the Main Console 1. The area in the RED box contains the Main Console. The Lab Manual is on the tab to the Right of the Main Console. 2. A particular lab may have additional consoles found on separate tabs in the upper left. You will be directed to open another specific console if needed. 3. Your lab starts with 90 minutes on the timer. The lab can not be saved. All your work must be done during the lab session. But you can click the EXTEND to increase your time. If you are at a VMware event, you can extend your lab time twice, for up to 30 minutes. Each click gives you an additional 15 minutes. Outside of VMware events, you can extend your lab time up to 9 hours and 30 minutes. Each click gives you an additional hour. Page 4

Alternate Methods of Keyboard Data Entry During this module, you will input text into the Main Console. Besides directly typing it in, there are two very helpful methods of entering data which make it easier to enter complex data. Click and Drag Lab Manual Content Into Console Active Window You can also click and drag text and Command Line Interface (CLI) commands directly from the Lab Manual into the active window in the Main Console. Accessing the Online International Keyboard You can also use the Online International Keyboard found in the Main Console. 1. Click on the Keyboard Icon found on the Windows Quick Launch Task Bar. Page 5

Click once in active console window In this example, you will use the Online Keyboard to enter the "@" sign used in email addresses. The "@" sign is Shift-2 on US keyboard layouts. 1. Click once in the active console window. 2. Click on the Shift key. Click on the @ key 1. Click on the "@ key". Notice the @ sign entered in the active console window. Page 6

Activation Prompt or Watermark When you first start your lab, you may notice a watermark on the desktop indicating that Windows is not activated. One of the major benefits of virtualization is that virtual machines can be moved and run on any platform. The Hands-on Labs utilizes this benefit and we are able to run the labs out of multiple datacenters. However, these datacenters may not have identical processors, which triggers a Microsoft activation check through the Internet. Rest assured, VMware and the Hands-on Labs are in full compliance with Microsoft licensing requirements. The lab that you are using is a self-contained pod and does not have full access to the Internet, which is required for Windows to verify the activation. Without full access to the Internet, this automated process fails and you see this watermark. This cosmetic issue has no effect on your lab. Look at the lower right portion of the screen Page 7

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes your lab has not changed to "Ready", please ask for assistance. Page 8

Module 1 - Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) Page 9

Introduction In this module, you will learn about Machine Learning (ML) and how to run ML workloads using TensorFlow in vsphere VMs. Machine learning is an exciting area of technology that allows computers to behave without being explicitly programmed, that is, in the way a person might learn. This tech is increasingly applied in many areas like health, science, finance, and intelligent systems, among others. In recent years, the emergence of deep learning and the enhancement of accelerators like GPUs has brought the tremendous adoption of machine learning applications in a broader and deeper aspect of our lives. Some application areas include facial recognition in images, medical diagnosis in MRIs, robotics, automobile safety, and text and speech recognition. GPUs reduce the time it takes for a machine learning or deep learning algorithm to learn (known as the training time) from hours to minutes. Machine learning (ML), especially deep learning (DL) workloads are growing in the datacenters and cloud environments. The use of ML in intelligent applications usually includes two main stages: building models using ML methods (Neural Networks, Support Vector Machines, Hidden Markov Models, etc.), which is known as training stage, and then applying the models for intelligent tasks like recognition, prediction or classification, which is known as the inference stage. There are several ways you can run ML applications using GPUs, one of which is to use GPU compute applications inside virtual machines on VMware vsphere. In this lab we present three of these options: Using NVIDIA vgpus in vsphere Using GPUs in Passthrough Using Bitfusion FlexDirect What to expect from each Module NVIDIA GRID vgpu is a GPU virtualization solution by NVIDIA. It is a suitable option when you want multiple VMs to share the same physical GPU. It also enables well-known virtualization benefits, such as cloning a VM or suspending and resuming a VM. We will show you this in Module 2 The NVIDIA GRID vgpu manager is installed in vsphere to virtualize the underlying physical GPUs. The graphics memory of the physical GPU is divided into equal chunks and those chunks are given to each VM. The type of vgpu profile determines the amount of graphics memory each VM can have. Page 10

Passthrough on vsphere (also known as VMware DirectPath I/O) allows direct access from the guest operating system in a virtual machine (VM) to the physical PCI or PCIe hardware devices of the server controlled by the vsphere hypervisor layer. Each VM is assigned one or more GPUs as PCI devices. Pass-through is a suitable option when you want a VM to have one or multiple physical GPUs for huge computation needs of an applications running inside the VM. Since the guest OS bypasses the virtualization layer to access the GPUs, the overhead of using pass-through mode is low. There is no GPU sharing amongst VMs when using this mode. We will show you this in Module 3 Bitfusion FlexDirect is a GPU virtualization solution provided by a company named Bitfusion. It allows ML workflows running inside a VM to use one or more GPUs on the same vsphere host and/or on remote hosts. It also supports multiple VMs sharing a single physical GPU. We will show this in Module 4 Machine Learning training and High Performance Computing jobs can take weeks to complete even with GPUs. Currently, if the server needs maintenance, weeks of work is lost when a server is powered down. Now VMware vsphere has added the ability to perform live VM migrations using vmotion for vgpu enabled VMs. The live VMs are migrated to another server before the maintenance begins. No work is lost due to maintenance. We will show you this in Module 5 Most ML methods are very computationally intensive. The training time for building prediction models can take hours, days or even weeks for large datasets and fast inference time is a critical requirement in many real-time applications. Hence, using accelerators like GPU, TPU, FPGA to accelerate ML tasks. In this lab, we focus on the GPU because of its popular use for ML. We can use CUDA and its cudnn library for developing ML applications for NVIDIA s GPUs or OpenCL for applications running on AMD's GPUs. Some ML frameworks supporting cudnn are Tensorflow, Keras, Theano, Caffe, Torch, MXNet, etc. We will show you this in Module 6 Our performance studies have shown that adding vgpu to VMs often leads to underutilization of CPU resources. One can run CPU-only workloads concurrently with GPU workloads without significant performance penalties. One can run Machine Learning training batch jobs at night time and interactive 3D-CAD jobs during daytime hours by suspending and resuming VMs. We will show you this in Module 7 Page 11

Conclusion In this module, we reviewed the basics of what Machine Learning (ML) is and what you can expect in each module. You've finished Module 1 Congratulations on completing Module 1. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic Module 3 - Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Intermediate Module 7 - vgpu Scheduling Options (15 minutes) - Intermediate Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Intermediate Page 12

How to End Lab To end your lab click on the END button. Page 13

Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) Page 14

Introduction In this module, we will take a closer look at how the NVIDIA GRID vgpu is integrated into a vsphere environment. We will show you how to install the NVIDIA drivers in a VM to take advantage of the vsphere driver, and then run a ML workload to show you how it works. The NVIDIA GRID vgpu is a GPU virtualization solution by NVIDIA. This solution allows multiple VMs to share a physical GPU and is also called a mediated pass-through solution. To enable this solution you would need to install the NVIDIA GRID vgpu manager, also known as NVIDIA vgpu Driver or NVIDIA-ESX-HOST driver. To run the ML workloads using GPUs, you need to install CUDA and CUDNN libraries from NVIDIA in a VM. CUDNN stands for CUDA Deep Neural Network. It is a GPU-accelerated library for deep neural networks. Many ML frameworks like TensorFlow and Caffe2 use this library to accelerate machine learning performance. Once the driver is installed in the ESXi host, the graphics memory of the physical GPU is divided into equal chunks and given to each VM. The type of vgpu profile determines the amount of graphics memory each VM can have. The Pascal P40 card has 24 GB of Memory that will be distributed across the VMs base on the assigned profile. Page 15

Table 1 lists the available Nvidia Pascal P40 vgpu profiles. You would use the different VM profiles to give the VMs the proper resources needed to drive the type of ML workloads. Currently, only one vgpu can be assigned to a VM. ML Frameworks allow rapid development of machine learning applications. We will use TensorFlow in this lab. TensorFlow is an open source machine learning framework. Once we have TensorFlow install we will run a machine learning workload. The workload we will run is a Handwriting Recognition benchmark known as MNIST. The benchmark employs Convolutional Neural Network and has a training set of 60000 examples. Page 16

Hands-on Labs Interactive Simulation: NVIDIA GRID vgpus in vsphere This part of the lab is presented as a Hands-on Labs Interactive Simulation. This will allow you to experience steps which are too time-consuming or resource intensive to do live in the lab environment. In this simulation, you can use the software interface as if you are interacting with a live environment. 1. Click here to open the interactive simulation. It will open in a new browser window or tab. 2. When finished, click the Return to the lab link to continue with this lab. The lab continues to run in the background. If the lab goes into standby mode, you can resume it after completing the module. Page 17

Conclusion In this module, we reviewed the basics of what Machine Learning (ML) is and how you can utilize the vsphere, GPUs, and vgpu to process ML methods. We installed the NVIDIA drivers in both vsphere and a VM. And showed you how you could use the NVIDIA GPU by running a TensorFlow workload. You've finished Module 2 Congratulations on completing Module 2. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic Module 3 - Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Intermediate Module 7 - vgpu Scheduling Options (15 minutes) - Intermediate Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Intermediate Page 18

How to End Lab To end your lab click on the END button. Page 19

Module 3 - Using GPUs in Pass-through Mode (15 Minutes) Page 20

Introduction In this Module, we will walk through the major steps for configuring DirectPath I/O (Passthrough) for a NVIDIA P100 GPU on vsphere 6.7. In vsphere, GPU can be configured in DirectPath I/O (passthrough) mode, which allows a guest OS to directly access the device, essentially bypassing the hypervisor. Because of the shortened access path, performance of applications accessing GPUs in this way can be very close to that of bare-metal systems. With DirectPath I/O, we can configure one or multiple GPU devices into a single VM. Each GPU device is dedicated to a VM and there is no GPU sharing among the VMs. Please note that some features are unavailable for VMs configured with DirectPath I/O, including hot-adding of virtual devices, taking snapshots, suspending/resuming VMs, and vmotion. Requirements for configuring large-bar GPU devices in Passthrough mode Some high-end compute GPUs like NVIDIA V100, P100, K80 and K40 use large, multigigabyte passthrough memory-mapped I/O (MMIO) device memory regions to transfer data between the host and the device. For example, NVIDIA P100s PCI MMIO space is slightly large than 16GB. To enable a device that uses large PCI MMIO regions, including NVIDIA V100, P100, K80, and K40, there are some preliminaries for configuring them in Passthrough mode: 1. Server BIOS In server BIOS, 4G mapping/encoding should be enabled. The step to enable it depends on server OEM. You can search for above 4G decoding or Page 21

memory mapped I/O above 4GB or PCI 64 bit resource handing above 4G keywords. 2. UEFI installation of the VM Ensure that virtual machine is UEFI enabled. 3. Advanced VM configuration parameters Large PCI MMIO regions require 64bit MMIO support. To enable 64-bit MMIO support, add this line to VM vmx file: pcipassthru.use64bitmmio=true Specify large enough the MMIO region as power of two of GB in VM vmx file, e.g. to passthrough 4 NVIDIA P100s into one VM, add this line to VM vmx file: pcipassthru.64bitmmiosizegb = 128 Please note there are different MMIO limitations across vsphere versions and if your GPU card doesn't use large PCI MMIO regions, you don't need to configure the special settings for BIOS and advanced VM configuration parameters. For more details, please refer to VMware vsphere VMDirectPath I/O: Requirements for Platforms and Devices. Page 22

Hands-on Labs Interactive Simulation: Configuring Passthrough for NVIDIA P100 on vsphere This part of the lab is presented as a Hands-on Labs Interactive Simulation. This will allow you to experience steps which are too time-consuming or resource intensive to do live in the lab environment. In this simulation, you can use the software interface as if you are interacting with a live environment. 1. Click here to open the interactive simulation. It will open in a new browser window or tab. 2. When finished, click the Return to the lab link to continue with this lab. The lab continues to run in the background. If the lab goes into standby mode, you can resume it after completing the module. Page 23

Conclusion In this module, we show you how to configure DirectPath I/O (Passthrough) way for using GPUs on vsphere. You've finished Module 3 Congratulations on completing Module 3. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Intermediate Module 7 - vgpu Scheduling Options (15 minutes) - Intermediate Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Intermediate Page 24

How to End Lab To end your lab click on the END button. Page 25

Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) Page 26

Introduction In this module, you will learn about Bitfusion FlexDirect and how a VM without a GPU can use the GPU on another VM. Bitfusion FlexDirect is a GPU virtualization solution provided by a company named Bitfusion. The GPU accelerators can be shared over the network and accessed remotely by VMs. With Bitfusion, GPU accelerators are now part of a common infrastructure resource pool and available for use by VMs in the vsphere-based environment. Bitfusion FlexDirect runs as a userspace application within each VM instance, without the need for change or special software in the ESXi hypervisor or the AI applications. On the GPU-accelerated server VM, FlexDirect also runs as a transparent software layer and exposes the individual physical GPUs as a pooled resource to be consumed by client VMs (VMs don't have GPUs). Upon completion of the AI runtime code, the shared GPU resources go back into the resource pool. Bitfusion use-cases on vsphere can be broadly categorized into 3 types. Dynamic and Remote Attached GPUs Bitfusion FlexDirect allows remote attach of GPUs dynamically to client VMs, as shown in Fig 4.1. GPUs can also be dynamically detached after use. Page 27

Fig 4.1 Dynamic and Remote Attached GPUs Partial GPUs Bitfusion FlexDirect can be used to slice GPUs to non-equal parts of partial GPUs. This serves as an optimal architecture for machine learning, in which each user/workload type is unpredictable and requires non-equal performance and response time. The GPUs are sliced with GPU memory. For instance, say there is a GPU with 16GB of GPU memory, one could create multiple partial GPUs namely two 4GB partial GPUs and four 2GB partial GPUs using FlexDirect. This allows sharing the same GPU across multiple users in a multi-tenant environment, as shown in Fig 4.2. Fig 4.2 Bitfusion FlexDirect Partial GPUs. Here, vgpu means the memory sliced partial GPU. Page 28

Dynamic and Remote Attached Partial GPUs Bitfusion FlexDirect can also be leveraged to remotely attach partial GPUs dynamically. A group of GPUs can be re-configured to partial GPUs of different size and combination, and can be remotely attached to client VMs, as shown in Fig 4.3. Fig 4.3 Bitfusion FlexDirect Remote Partial GPUs. Here, Virtual GPU means the memory sliced partial GPU. Page 29

Summary With VMware vsphere and Bitfusion, GPUs can be a shared pool of resources that can be attached to any VMs as shown in Fig 4.4. A full-fledged GPU as a Service can be created with VMWare vsphere and Bitfusion FlexDirect. FlexDirect GPU resource schedulers will be started on all the GPU server VMs in the pool. Each of the Client VMs will use FlexDirect to attach full or partial remote GPUs from the GPU pool. For more information, you can check Bitfusion FlexDirect documentation https://docs.bitfusion.io Page 30

Hands-on Labs Interactive Simulation: Using Bitfusion GPU virtualization in vsphere This part of the lab is presented as a Hands-on Labs Interactive Simulation. This will allow you to experience steps which are too time-consuming or resource intensive to do live in the lab environment. In this simulation, you can use the software interface as if you are interacting with a live environment. 1. Click here to open the interactive simulation. It will open in a new browser window or tab. 2. When finished, click the Return to the lab link to continue with this lab. The lab continues to run in the background. If the lab goes into standby mode, you can resume it after completing the module. Page 31

Conclusion In this module, you have learned one of ways to use GPUs on vsphere by leveraging Bitfusion GPU virtualization solution. You've finished Module 4 Congratulations on completing Module 4. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic Module 3 - Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Intermediate Module 7 - vgpu Scheduling Options (15 minutes) - Intermediate Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Intermediate Page 32

How to End Lab To end your lab click on the END button. Page 33

Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) Page 34

Introduction In this module, we will discuss why live vmotion of a GPU enabled VM is such a big deal. vmotion's ability to move running VMs between physical machines is well known, so why are we showing this in the ML lab? Because some significant challenges had to be overcome for vmotion to work with a GPU enabled VM. The first challenge was to enable a VM to have direct access to physical hardware and still be able to move from physical host to physical host. How many times have you tried to vmotion a VM that has a CD-ROM drive attached, and what happens? It fails because we don't allow that. But what are we doing here is giving a VM direct access to the NVIDIA GPU installed in the ESXi host. The second challenge is to pass the workload of the GPU between physical hosts. This seems like a simple addition to the capability of vmotion. However, when we consider the nvidia GRID vgpu allocates anywhere from 1GB to 24GB of RAM on the GPU, has thousands of state variables, and the state information of a sophisticated graphics pipeline which is transferred to the destination server where it is setup correctly so that the application in the VM that uses the GPU can continue without missing a beat. It is clear that this is quite a feat. Simply transferring the contents of a graphics RAM is an achievement. In addition, transferring the state information and loading it correctly at the destination makes this a significant achievement. Watch this video to see a live vmotion of a GPU enabled VM between 2 ESXi host. Page 35

Video - vmotion Demo (1:18) Page 36

Conclusion In this module, we showed you how you can vmotion a VM so maintenance can be done on a host without effecting the ML workloads. You've finished Module 5 Congratulations on completing Module 5. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic Module 3 - Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Intermediate Module 7 - vgpu Scheduling Options (15 minutes) - Intermediate Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Intermediate Page 37

How to End Lab To end your lab click on the END button. Page 38

Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (30 minutes) Page 39

Introduction In this module, we will run a Complex Language Modeling ML Workload. Given the history of words this benchmark predicts next word.the benchmark uses Penn Tree Bank (PTB) Database. It has 929K training words, 73K validation words, 82K test words. It has vocabulary of 10K words. The benchmark employs Recurrent Neural Network. It has 3 models. (LSTM stands for Long Short Term Memory) The small model has 200 LSTM unit per layer. The medium model has 650 LSTM units/layer. The large model has 1500 LSTM units/layer. The bigger model give better accuracy but they take more time to train. For example the large model takes 56 hours to train with GPUs. The use of Pascal P40 GPU brings this time to 3 hours. Page 40

Hands-on Labs Interactive Simulation: Running Machine Learning Workloads using TensorFlow in vsphere This part of the lab is presented as a Hands-on Labs Interactive Simulation. This will allow you to experience steps which are too time-consuming or resource intensive to do live in the lab environment. In this simulation, you can use the software interface as if you are interacting with a live environment. 1. Click here to open the interactive simulation. It will open in a new browser window or tab. 2. When finished, click the Return to the lab link to continue with this lab. The lab continues to run in the background. If the lab goes into standby mode, you can resume it after completing the module. Page 41

Conclusion In this module, we reviewed the basics of what Machine Learning (ML) is and how you can utilize the vsphere, GPUs, and vgpu to process ML methods. You've finished Module 6 Congratulations on completing Module 6. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic Module 3 - Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic Module 7 - vgpu Scheduling Options (15 minutes) - Intermediate Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Intermediate Page 42

How to End Lab To end your lab click on the END button. Page 43

Module 7 - vgpu Scheduling Options (15 minutes) Page 44

Introduction In this module, we will introduce you to vgpu scheduling options Multiple VMs share a physical GPU by using NVIDIA Virtual GPU manager. vgpu scheduling policy specifies how GPU is shared among VMs. NVIDIA GRID supports three vgpu scheduling options: Best Effort, Equal Share and Fixed. The selection of a vgpu scheduling option depends on use cases. Best Effort Scheduler optimizes GPU utilization. For some circumstances, a VM running GPU intensive application may affect the performance of GPU lightweight application running in other VMs. To avoid such performance impact and ensure QoS (Quality of Service), you can choose to switch to Equal Share or Fixed Share scheduler. Equal Share Scheduler ensures equal share of GPU time for each powered on VM. Fixed Share scheduler gives fixed share of GPU time to a VM based on the vgpu profile associated VMs on the physical GPU. NVIDIA supports Best Effort vgpu scheduler for all supported architectures. For NVIDIA Pascal and Volta architectures, it supports Equal Share and Fixed Share schedulers in addition to Best Effort Scheduler. Below diagrams show an illustration of thebest Effort and EqualShare schedulers Best Effort Scheduler Equal Share Scheduler Page 45

Hands-on Labs Interactive Simulation: vgpu Scheduling Options This part of the lab is presented as a Hands-on Labs Interactive Simulation. This will allow you to experience steps which are too time-consuming or resource intensive to do live in the lab environment. In this simulation, you can use the software interface as if you are interacting with a live environment. 1. Click here to open the interactive simulation. It will open in a new browser window or tab. 2. When finished, click the Return to the lab link to continue with this lab. The lab continues to run in the background. If the lab goes into standby mode, you can resume it after completing the module. Page 46

Conclusion In this module, we reviewed the basics of what Machine Learning (ML) is and how you can utilize the vsphere, GPUs, and vgpu to process ML methods. You've finished Module 7 Congratulations on completing Module 7. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic Module 3 - Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Intermediate Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) - Intermediate Page 47

How to End Lab To end your lab click on the END button. Page 48

Module 8 - Maximizing the Power of vsphere for Diverse Workloads using GPUs (15 minutes) Page 49

Introduction In this module, we will show you what the nvidia GPU can do based on benchmarks The benchmarks will be started using a script that runs in a controller VM, which runs Ubuntu Linux. Once the script is started, it remotely invokes the SPECapc 3ds Max 2015 on two VMs, and MNIST on the CentOS VM. Once the benchmarks run to completion, the VMs reboot automatically and that signals completion to the controller VM. We will start the benchmark now. The metric we'll use is simply the wall-clock time to complete the CAD and ML benchmarks. We'll compare the wall-clock time to run the ML benchmark, and CAD benchmarks stand-alone with the time to run the CAD+ML benchmarks concurrently. Prior to this lab, we ran the CAD benchmark stand-alone and recorded its wall-clock run time. Subsequently we ran the ML benchmark stand-alone and recorded its wall-clock run time. These times are recorded in the file WT.txt which is printed out once the ML+CAD benchmarks running concurrently complete execution. From the data we can see that the ML benchmark sees no increase in run-time due to sharing the server with CAD. The CAD benchmarks do not show any increase in run-time due to sharing either (data for this is not shown in this lab.) What we have demonstrated in this lab is that Nvidia GRID vgpu on vsphere is sufficiently powerful and capable of running diverse workloads concurrently with no noticeable drop in performance with little or special effort. Page 50

Hands-on Labs Interactive Simulation: Maximizing the Power of vsphere for Diverse Workloads using GPUs This part of the lab is presented as a Hands-on Labs Interactive Simulation. This will allow you to experience steps which are too time-consuming or resource intensive to do live in the lab environment. In this simulation, you can use the software interface as if you are interacting with a live environment. 1. Click here to open the interactive simulation. It will open in a new browser window or tab. 2. When finished, click the Return to the lab link to continue with this lab. The lab continues to run in the background. If the lab goes into standby mode, you can resume it after completing the module. Page 51

Conclusion In this module, we reviewed the basics of what Machine Learning (ML) is and how you can utilize the vsphere, GPUs, and vgpu to process ML methods. You've finished Module 8 Congratulations on completing Module 8. If you are looking for additional information on Machine Learning at VMware, try one of these: Click on this https://blogs.vmware.com/apps/machine-learning-resources Or use your smart device to scan the QRC Code. Proceed to any module below which interests you most. [Add any custom/optional information for your lab manual.] Module 1 -Machine Learning Apps in vsphere VMs Using GPUs (15 minutes) - Basic Module 2 - Using NVIDIA GRID vgpus in vsphere (15 minutes) - Basic Module 3 - Using GPUs in Pass-through Mode on vsphere (15 minutes) - Basic Module 4 - Using Bitfusion GPU virtualization in vsphere (15 minutes) - Basic Module 5 - Performing Infrastructure Maintenance when VMs are using GPUs (15 minutes) - Basic Module 6 - Running Machine Learning Workloads using TensorFlow in vsphere (15 minutes) - Intermediate Module 7 - vgpu Scheduling Options (15 minutes) - Intermediate Page 52

How to End Lab To end your lab click on the END button. Page 53

Conclusion Thank you for participating in the VMware Hands-on Labs. Be sure to visit http://hol.vmware.com/ to continue your lab experience online. Lab SKU: Version: 20181104-153024 Page 54