Saving Lives Through Big Data: Using Data Science to Help Drug Researchers
When we think of analytics real-time data, our first thought is High Tech with companies such as Twitter generating exorbitant amounts of data. Often overlooked is the Manufacturing industry, but with the rise of IoT, sensors and probes are ubiquitous, generating data points every second and increasing the necessity to effectively manage these enormous and rapidly generated data sets.
Imagine, hundreds to thousands of machines having tens to hundreds of sensors, each taking a measurement per second. Now multiply that by dozens of production sites across the globe and you can begin to see how immense a data problem this pharmaceutical company faces with its data. And how data science and real-time analytics can help save millions of dollars in drug research.
As we worked to help our client overcome this challenge we faced several barriers that needed to be overcome:
• Storage with a high I/O capability to handle larger volumes of dynamic data
• Data ingestion capabilities that could handle an extremely high throughput
• An “ultimate” analytics solution that would be able to process large data volumes (most of the commonly used are in-memory)
We utilized a real-time lambda architecture, leveraging Kafka, Spark Streaming, and Axibase (time series database based on HBase), that reads and processes sensor data from ten different global sites on a real-time basis, stores the data in a time-series database, and allows users to perform complex analytics immediately. With multiple layers of parallelism we were able to achieve an ingestion throughput of 50,000 messages/sec with only a 6 node cluster, but more importantly we were able to empower research scientists to do what we all want them to be doing: Researching drug therapies to save lives and not troubleshooting time consuming drug production issues.