Embeddings at Scale: Lessons from LanceDB and the Lance Format

AI/ML & Data Science Intermediate

Storing hundreds of millions of multimodal embeddings exposes the limits of traditional vector stores. How the columnar Lance format handles it, with benchmarks and production pitfalls.

Justin Miller

Senior Software Engineer at LanceDB

Storing hundreds of millions of image, text, and audio embeddings exposes the limits of traditional vector stores. This talk walks through how LanceDB's columnar Lance format tackles the problem at scale: storage layout for mixed modalities, IVF-PQ and HNSW index tradeoffs, zero-copy reads, dataset versioning, and predicate pushdown for filtered approximate nearest-neighbor search. It includes benchmarks, ingestion patterns, and pitfalls drawn from production ML pipelines.