Operationalizing Data Science with Apache Spark

Hadoop, Spark & Kafka

Lawrence Spracklen

VP of Engineering at Alpine Data

Today, in many data science projects, the sole focus is the complexity of the algorithms being used to address the data problem. While this is a critical consideration, without consideration of how the insights developed can be disseminated through the broader enterprise, many end up dying on the vine. This presentation will highlight not only that turnkey model operationalization strategy is critical to the success of enterprise data science projects, but how this can be achieved using Spark. Today Spark enables data scientists to perform sophisticated analyses using complex machine learning algorithms. Even when the size of the datasets are measured in Terabytes, Spark provides a broad selection of machine learning algorithms that can scale effortlessly. However, the current process for the business to leverage the results of these analyses is far less sophisticated. Indeed, results are frequently communicated by powerpoint presentation, rather than a turn-key solution for deploying improved models into production. In this session, we discuss the current challenges associated with operationalizing these results. We discuss the challenges associated with turnkey model operationalization, including the shortcomings of model serialization standards such as PMML for expressing the complex pre- and post- data processing that is critical to effortless operationalization. Finally, we discuss in detail the potential for turnkey model operationalization with the emerging PFA standard, and highlight how the use of PFA can be achieved using Spark, including how PFA model scoring can be supported using Spark streaming, and our efforts to drive support for PFA model export into MLlib.