Data Versioning - does it mean what we think it means?
The demand for better versioning of data is growing. There are a plethora of open source projects providing "data versioning", "Git for data" and "manage data like code" capabilities (e.g Hudi, DoltHub, Delta Lake, DVC, Pachyderm, and lakeFS). So how do you know you are choosing the right one?
In this talk we will go over the difference between these solutions by clustering them according to 4 main use cases:
*Collaboration over data: enabling teams to collaborate over data over time, while contributing to the data evolution.
*Managing ML pipelines: allowing pipeline management of ML projects, from model creation to production.
*The need for mutability: data formats that grant Insert, Update and delete over an immutable object storage.
*The need for ACID guarantees over an object storage data lake: using branching logic to manage an object storage based data lake.
By the end of the talk, you should have a good understanding of how these solutions compare and which you should choose for different types of use cases