Standing on shoulder of giants building Core Digital Media's data platform using Big Data technologies

BI, Reporting & Business Use Cases

Core Digital Media is one of the leading advertisers in the online space. We are responsible for about a fifth of the ads a person sees online on a day to day basis. This is accomplished by our Marketing team, who continues to find performance wins by leveraging propriety algorithms, deep learning models and advanced analytics to optimize the Ad spend. All these frameworks rely on the performance data available in the Core Digital Media’s Enterprise Data Warehouse (EDW) to make decisions. The performance data is collected from a multitude of marketing channels like Social (Facebook), Search(Google), Content(Taboola), Media, Affiliate and Retention and integrated into EDW. Timely and consistent availability of data in EDW is extremely critical for marketing optimization. This talk details how we migrated our Marketing data loads from a legacy ETL platform to a data infrastructure built around Apache Kafka and Apache Spark, using Python and Scala. This migration not only lead to a reduction in data availability times from an average of >60 minutes to 2 mins but also we are able to load 24*7 into the EDW. As a part of the talk we will talk about why we chose Kafka and Spark Structured streaming for this, what were some of the challenges we faced and also some of the best practices to follow for implementing data streaming architectures.