Hi everyone!

Check out SmolDocling, a new open-source vision-language model from Hugging Face and IBM Research! True to its name, it's incredibly small – only 256M parameters! – yet it's designed for full, end-to-end document conversion.

You feed it an image of a document page (a scanned PDF, a photo, etc.), and it outputs a structured representation (called "DocTags") that includes everything:

📝 Text (OCR): It extracts the text, of course.
📑 Layout: It understands the page layout (paragraphs, headings, lists, etc.).
📊 Tables: It extracts table structure and content.
💻 Code: It recognizes and formats code blocks (with indentation!).
➕ Equations: It handles mathematical formulas.
🖼️ Figures: It identifies figures and links captions.

The key is that it does all of this in a single model, end-to-end, unlike traditional approaches that use separate OCR, layout analysis, and table extraction tools. And it does it with a model that's tiny compared to most VLMs.

It's built on SmolVLM (also open-source) and achieves competitive results with models many times its size.

You can try SmolDocling yourself here.

Replies

Best

Zac Zuo

Ambassador

Hunter

📌

4mo ago

@hamza_afzal_butt Good question! It's primarily English-focused, but the OCR should handle other languages. Best to test it with your specific documents, though, as mixed-language performance isn't specifically benchmarked.

Jun Shen

Automated document parsing is a great solution! 👀

Denis Sigal

I used Docling a couple of months ago, it was already cool, now this mini version sounds even cooler!

SmolDocling - 256M VLM for end-to-end document AI

Replies