Embeddings at Scale: Lessons from LanceDB and the Lance Format
AI/ML & Data Science Intermediate
Storing hundreds of millions of multimodal embeddings exposes the limits of traditional vector stores. How the columnar Lance format handles it, with benchmarks and production pitfalls.
Storing hundreds of millions of image, text, and audio embeddings exposes the limits of traditional vector stores. This talk walks through how LanceDB's columnar Lance format tackles the problem at scale: storage layout for mixed modalities, IVF-PQ and HNSW index tradeoffs, zero-copy reads, dataset versioning, and predicate pushdown for filtered approximate nearest-neighbor search. It includes benchmarks, ingestion patterns, and pitfalls drawn from production ML pipelines.