GPUs and Machine Learning at Scale
With the rise of machine learning and artificial intelligence, organizations are looking to adopt more GPUs. With recent advance on deep learning models in self-driving car areas such as lane-detection, perception and so on, it is important to enable distributed deep learning with large-scale GPU clusters. GPU-enabled clusters are usually dedicated to a specific team or shared across teams. These two scenarios mean that GPUs are either underutilized or overutilized during peak times, leading to increased delays and a waste of precious time for the data science team and cloud resources. Existing tools do not allow dynamic allocation of resources while also guaranteeing performance and isolation This presentation will show how Apache Mesos and DC/OS support allocating GPUs (Graphics Processing Units) to different services and team. In particular, we will look at the GPU isolation which enables sharing of (GPU-based) cluster resources for traditional and machine learning workloads, as well as dynamically allocate GPU resources inside those clusters. We will then show a demo running a TensorFlow, a popular library for machine learning, job for image classification distributed on top of a DC/OS cluster.