Zac Zuo

SmolDocling - 256M VLM for end-to-end document AI

SmolDocling, from Hugging Face and IBM Research, is the ultra-compact (256M) open VLM for end-to-end document conversion. Extracts text, layout, tables, code, and more from images.

Add a comment

Replies

Best
Zac Zuo
Hunter
๐Ÿ“Œ

Hi everyone!

Check out SmolDocling, a new open-source vision-language model from Hugging Face and IBM Research! True to its name, it's incredibly small โ€“ only 256M parameters! โ€“ yet it's designed for full, end-to-end document conversion.

You feed it an image of a document page (a scanned PDF, a photo, etc.), and it outputs a structured representation (called "DocTags") that includes everything:

๐Ÿ“ Text (OCR): It extracts the text, of course.
๐Ÿ“‘ Layout: It understands the page layout (paragraphs, headings, lists, etc.).
๐Ÿ“Š Tables: It extracts table structure and content.
๐Ÿ’ป Code: It recognizes and formats code blocks (with indentation!).
โž• Equations: It handles mathematical formulas.
๐Ÿ–ผ๏ธ Figures: It identifies figures and links captions.

The key is that it does all of this in a single model, end-to-end, unlike traditional approaches that use separate OCR, layout analysis, and table extraction tools. And it does it with a model that's tiny compared to most VLMs.

It's built on SmolVLM (also open-source) and achieves competitive results with models many times its size.

You can try SmolDocling yourself here.

Zac Zuo
Hunter

@hamza_afzal_butt Good question! It's primarily English-focused, but the OCR should handle other languages. Best to test it with your specific documents, though, as mixed-language performance isn't specifically benchmarked.

Jun Shen

Automated document parsing is a great solution! ๐Ÿ‘€

Denis Sigal

I used Docling a couple of months ago, it was already cool, now this mini version sounds even cooler!