Building a Billion-Scale Vector Embeddings Dataset

For Real-World ANN Benchmarking and RAG Research

Jun 15, 2025

1B Wikipedia Paragraph Embeddings using Open Models

Billion Vector Embeddings Dataset

As part of the Billion-Scale Embeddings Dataset project, my proposal under the mentorship of Jayjeet Chakraborty aims to create the first large-scale, real-world vector embeddings dataset—bridging the critical gap in Approximate Nearest Neighbor (ANN) benchmarks and Retrieval-Augmented Generation (RAG) systems.

Motivation

Existing ANN benchmarks often fall short—they’re either synthetic (like SIFT) or too small-scale (≤1M vectors). With the rapid evolution of LLM-based vector search systems (e.g., OpenAI’s 3072d text-embedding-3-large), there’s a growing need for:

High-dimensional (>1000d), large-scale (>100M) embeddings
Real-world distributions (Wikipedia-scale text)
Open, reproducible benchmarks for the community

Project Goals

Generate 1 billion embeddings from English Wikipedia using open-source models.
Create multiple dimensional variants: 1024d, 4096d, and 8192d.
Deduplicate, compress, and store embeddings with rich metadata (URL, timestamps, models).
Benchmark ANN performance on FAISS, HNSW, and Annoy.
Distribute the dataset via HuggingFace & AWS S3 with shard-level access.

Open Source Impact

ANN Libraries: Enable reproducible benchmarking for real-world workloads.
RAG Systems: Evaluate and optimize retrieval at scale using real Wikipedia text.
Researchers: Conduct large-scale studies on dimensionality, ANN accuracy, and compression trade-offs.

Building a Billion-Scale Vector Embeddings Dataset

Billion Vector Embeddings Dataset

Motivation

Project Goals

Open Source Impact

Prathamesh Devadiga

B.Tech CSE Student at PES University