The Alexandria Index

The Alexandria Index

Massive internet datasets, embedded, open-sourced, and free

129 followers

Vector embeddings are incredibly powerful, compressing text into a low-dimensional 'meaning space', but they're also shockingly cheap. Alexandria is an open-source project to embed and release (for free) large datasets in research, law, medicine and more.
The Alexandria Index gallery image
The Alexandria Index gallery image
The Alexandria Index gallery image
Free
Launch Team

What do you think? …

Will DePue
Today, I'm announcing Alexandria, an open-source initiative to embed the internet. To start, we're releasing the embeddings for every research paper on the Arxiv. That's over 4m items, 600m tokens, and 3.07 billion vector dimensions. We're not stopping here. A significant number of the world's problems are just search, clustering, recommendation, or classification; all things embeddings are great at. For example, finding research papers via keywords is hard when there's 10 words that mean the same thing. Embeddings makes this easy. Embeddings are also a one-time cost and are incredibly cheap. In most cases, you'll never need to compute the same document twice. At the moment, we're embedding tokens at high performance for $1 per 100,000,000 tokens. That's the length of the Bible, 10 times, per dollar. I was surprised when I couldn't find any open embedding datasets (research, law, finance, etc.), considering the immense value and low cost. There's too much to be built here... so we're building an org. and doing it ourselves. You can download the Arxiv embeddings (titles and abstracts, 6gb and 8gb respectively) at the link above. There's a lot of datasets to choose from, so we need your help to figure out what to work on next. Let us know by voting! Note: Embeddings are most often used for search / question answering, so we're building those ourselves. Our Arxiv embedding search launches next week, with more to come. We're also experimenting on a AI agent personal research assistant that helps you learn, teach, and publish.
Chris Messina
@willdepue this is pretty rad — what's your business model and what's the license on what you've released?
Will DePue
@chrismessina For now, running this as a non-profit public good, but there's a future angle around generating revenue from search tools + our ChatGPT plugin. MIT. License is still being solidified as relating to the contents of the papers themselves, but all should be under a permissive and commercial license, more info soon.
Mert Deveci
The tabs are not working - are they not live? When will the SEC filings be live?
Leah Baloyi
This is crazy 🤩 Congrats on your launch.