Approaches to online quantile estimation

AI/ ML/ Data Science

Joe Ross

Principal Data Scientist at Splunk

This talk will explore and compare several compact data structures for estimation of quantiles on streams, including a discussion of how they balance accuracy against computational resource efficiency. A new approach providing more flexibility in specifying how computational resources should be expended across the distribution will also be explained. Quantiles (e.g., median, 99th percentile) are fundamental summary statistics of one-dimensional distributions. They are particularly important for SLA-type calculations and characterizing latency distributions, but unlike their simpler counterparts such as the mean and standard deviation, their computation is somewhat more expensive. The increasing importance of stream processing (in observability and other domains) and the impossibility of exact online quantile calculation together motivate the construction of compact data structures for estimation of quantiles on streams. In this talk we will explore and compare several such data structures (e.g., moment-based, KLL sketch, t-digest) with an eye towards how they balance accuracy against resource efficiency, theoretical guarantees, and desirable properties such as mergeability. We will also discuss a recent variation of the t-digest which provides more flexibility in specifying how computational resources should be expended across the distribution. No prior knowledge of the subject is assumed. Some familiarity with the general problem area would be helpful but is not required.