Today, I'm announcing Alexandria, an open-source initiative to embed the internet. To start, we're releasing the embeddings for every research paper on the Arxiv. That's over 4m items, 600m tokens, and 3.07 billion vector dimensions. We're not stopping here. A significant number of the world's problems are just search, clustering, recommendation, or classification; all things embeddings are great at. For example, finding research papers via keywords is hard when there's 10 words that mean the same thing. Embeddings makes this easy. Embeddings are also a one-time cost and are incredibly cheap. In most cases, you'll never need to compute the same document twice. At the moment, we're embedding tokens at high performance for $1 per 100,000,000 tokens. That's the length of the Bible, 10 times, per dollar. I was surprised when I couldn't find any open embedding datasets (research, law, finance, etc.), considering the immense value and low cost. There's too much to be built here... so we're building an org. and doing it ourselves. You can download the Arxiv embeddings (titles and abstracts, 6gb and 8gb respectively) at the link above. There's a lot of datasets to choose from, so we need your help to figure out what to work on next. Let us know by voting! Note: Embeddings are most often used for search / question answering, so we're building those ourselves. Our Arxiv embedding search launches next week, with more to come. We're also experimenting on a AI agent personal research assistant that helps you learn, teach, and publish.

Will DePue

GPT-4o

Top Product

Maker

📌

2yr ago

Chris Messina

Ambassador

Hunter

@willdepue this is pretty rad — what's your business model and what's the license on what you've released?

@chrismessina For now, running this as a non-profit public good, but there's a future angle around generating revenue from search tools + our ChatGPT plugin. MIT. License is still being solidified as relating to the contents of the papers themselves, but all should be under a permissive and commercial license, more info soon.

Mert Deveci

GodmodeHQ ✊ | Custom AI agents for sales

The tabs are not working - are they not live? When will the SEC filings be live?

Leah Baloyi

This is crazy 🤩 Congrats on your launch.

The Alexandria Index

Massive internet datasets, embedded, open-sourced, and free

Massive internet datasets, embedded, open-sourced, and free

Do you use The Alexandria Index?

Engineering & Development

AI

Work & Productivity

Marketing & Sales

Design & Creative

Social & Community

Finance

Product add-ons

Trending categories

Top reviewed

Trending products

Top forum threads