Democratizing AI Across Clouds: Low-Cost, Easy-to-Deploy Machine Learning

AI, ML & Data Science

John Thorpe

Head of Product at BreezeML

Machine learning (especially deep learning) is becoming increasingly complex and expensive. Many companies build their core businesses (e.g., self-driving, credit card fraud detection, item recommendation, etc.) upon continuous model training and/or inferencing, which is typically performed with dozens or even hundreds of GPU machines on a (public or on-premise) cloud. While a cloud-based environment makes it possible for these jobs to dynamically scale with load changes (e.g., user requests), running these jobs under the cloud's pay-as-you-go pricing model incurs large monetary costs, which would rapidly grow with the model size/complexity, the size of datasets, and the number of users.

BreezeML democratizes AI/ML by helping AI companies significantly increase their performance-per-dollar by making effective use of preemptible GPU instances. Rooted in years of research at UCLA and Princeton, BreezeML provides (1) a preemption-resilient software system that allows users to reliably run ML training/inference jobs on preemptible instances (such as spot instances) and (2) a virtual cloud interface that performs intelligent selection and scheduling of (spot and on-demand) instances to minimize the monetary costs with strong SLA guarantees.

Currently, BreezeML provides two services:

An API server (http://windmill.breezeml.ai/apis/) that allows ML engineers to upload batch jobs for free trails. It also allows customers to use their own cloud (e.g., AWS) credential to log in and use BreezeML to run jobs under their own cloud configurations.
We provide a docker image of the Breeze runtime, which includes the Breeze-enhanced Pytorch/Tensorflow/XGBoost as well as a new K8S-based orchestration system that can be easily deployed in the user's local environment (compliant with the user's local security policies). Our runtime allows the user to (a) use cheap spot instances in the cloud or (b) sharing resources between (low-priority) training and (high-priority) inference jobs in their on-premise cluster, thereby significantly improving GPU resource utilization.

Experiments across a wide range of vision, language, and classification models demonstrate that BreezeML improve the performance-per-dollar by an average of 3 times. Our approach also eliminates the need of resource over-provisioning in on-premise clusters by allowing (high-priority) inference jobs to safely preempt (low-priority) training jobs.