From Fragments to Full Story - Replaying History from Snapshots and Streams
This talk shows how we reconstructed historical activity from data lakes that only stored snapshots and CDC streams. Learn techniques to rebuild complete timelines from incomplete records, enabling accurate analysis even after source data expired.
In many large-scale systems, application data has a limited retention period, making it difficult to recreate historical activity once the source records expire. At the same time, enterprise data lakes often only store periodic snapshots or partial change logs, rather than complete transaction histories. This talk explores how we solved this challenge by reconstructing full event timelines from snapshot-based data lake storage, using change data capture (CDC) streams as the backbone. We’ll walk through the approach we took to reassemble incomplete records into coherent, replayable sequences—allowing us to analyze historical trends and behaviors long after the original system data expired. Attendees will gain practical insights into designing resilient pipelines, working with imperfect data, and balancing system constraints with the need for accurate historical reconstruction in real-world analytics environments.