Building a Billion-Scale Vector Embeddings Dataset

For Real-World ANN Benchmarking and RAG Research

1B Wikipedia Paragraph Embeddings using Open Models

Billion Vector Embeddings Dataset

As part of the Billion-Scale Embeddings Dataset project, my proposal under the mentorship of Jayjeet Chakraborty aims to create the first large-scale, real-world vector embeddings dataset—bridging the critical gap in Approximate Nearest Neighbor (ANN) benchmarks and Retrieval-Augmented Generation (RAG) systems.

Motivation

Existing ANN benchmarks often fall short—they’re either synthetic (like SIFT) or too small-scale (≤1M vectors). With the rapid evolution of LLM-based vector search systems (e.g., OpenAI’s 3072d text-embedding-3-large), there’s a growing need for:

  • High-dimensional (>1000d), large-scale (>100M) embeddings
  • Real-world distributions (Wikipedia-scale text)
  • Open, reproducible benchmarks for the community

Project Goals

  • Generate 1 billion embeddings from English Wikipedia using open-source models.
  • Create multiple dimensional variants: 1024d, 4096d, and 8192d.
  • Deduplicate, compress, and store embeddings with rich metadata (URL, timestamps, models).
  • Benchmark ANN performance on FAISS, HNSW, and Annoy.
  • Distribute the dataset via HuggingFace & AWS S3 with shard-level access.

Open Source Impact

  • ANN Libraries: Enable reproducible benchmarking for real-world workloads.
  • RAG Systems: Evaluate and optimize retrieval at scale using real Wikipedia text.
  • Researchers: Conduct large-scale studies on dimensionality, ANN accuracy, and compression trade-offs.

Prathamesh Devadiga
Prathamesh Devadiga
B.Tech CSE Student at PES University

Prathamesh Devadiga is a B.Tech Computer Science student at PES University. He works on vector search, deep learning systems, and retrieval-augmented generation.