Vector embeddings are incredibly powerful, compressing text into a low-dimensional 'meaning space', but they're also shockingly cheap. Alexandria is an open-source project to embed and release (for free) large datasets in research, law, medicine and more.
Today, I'm announcing Alexandria, an open-source initiative to embed the internet.
To start, we're releasing the embeddings for every research paper on the Arxiv. That's over 4m items, 600m tokens, and 3.07 billion vector dimensions.
We're not stopping here.
A significant number of the world's problems are just search, clustering, recommendation, or classification; all things embeddings are great at.
For example, finding research papers via keywords is hard when there's 10 words that mean the same thing. Embeddings makes this easy.
Embeddings are also a one-time cost and are incredibly cheap. In most cases, you'll never need to compute the same document twice.
At the moment, we're embedding tokens at high performance for $1 per 100,000,000 tokens. That's the length of the Bible, 10 times, per dollar.
I was surprised when I couldn't find any open embedding datasets (research, law, finance, etc.), considering the immense value and low cost.
There's too much to be built here... so we're building an org. and doing it ourselves.
You can download the Arxiv embeddings (titles and abstracts, 6gb and 8gb respectively) at the link above. There's a lot of datasets to choose from, so we need your help to figure out what to work on next. Let us know by voting!
Note:
Embeddings are most often used for search / question answering, so we're building those ourselves. Our Arxiv embedding search launches next week, with more to come.
We're also experimenting on a AI agent personal research assistant that helps you learn, teach, and publish.
@chrismessina For now, running this as a non-profit public good, but there's a future angle around generating revenue from search tools + our ChatGPT plugin.
MIT. License is still being solidified as relating to the contents of the papers themselves, but all should be under a permissive and commercial license, more info soon.
GPT-4o
GPT-4o
GodmodeHQ ✊ | Custom AI agents for sales