SmolDocling, from Hugging Face and IBM Research, is the ultra-compact (256M) open VLM for end-to-end document conversion. Extracts text, layout, tables, code, and more from images.
Check out SmolDocling, a new open-source vision-language model from Hugging Face and IBM Research! True to its name, it's incredibly small – only 256M parameters! – yet it's designed for full, end-to-end document conversion.
You feed it an image of a document page (a scanned PDF, a photo, etc.), and it outputs a structured representation (called "DocTags") that includes everything:
📝 Text (OCR): It extracts the text, of course. 📑 Layout: It understands the page layout (paragraphs, headings, lists, etc.). 📊 Tables: It extracts table structure and content. 💻 Code: It recognizes and formats code blocks (with indentation!). ➕ Equations: It handles mathematical formulas. 🖼️ Figures: It identifies figures and links captions.
The key is that it does all of this in a single model, end-to-end, unlike traditional approaches that use separate OCR, layout analysis, and table extraction tools. And it does it with a model that's tiny compared to most VLMs.
It's built on SmolVLM (also open-source) and achieves competitive results with models many times its size.
@zaczuo any built-in support for multiple languages or specialized vocabularies? I would love to try it on academic journals that mix English text with foreign-language citations.
@hamza_afzal_butt Good question! It's primarily English-focused, but the OCR should handle other languages. Best to test it with your specific documents, though, as mixed-language performance isn't specifically benchmarked.
Replies
Hi everyone!
Check out SmolDocling, a new open-source vision-language model from Hugging Face and IBM Research! True to its name, it's incredibly small – only 256M parameters! – yet it's designed for full, end-to-end document conversion.
You feed it an image of a document page (a scanned PDF, a photo, etc.), and it outputs a structured representation (called "DocTags") that includes everything:
📝 Text (OCR): It extracts the text, of course.
📑 Layout: It understands the page layout (paragraphs, headings, lists, etc.).
📊 Tables: It extracts table structure and content.
💻 Code: It recognizes and formats code blocks (with indentation!).
➕ Equations: It handles mathematical formulas.
🖼️ Figures: It identifies figures and links captions.
The key is that it does all of this in a single model, end-to-end, unlike traditional approaches that use separate OCR, layout analysis, and table extraction tools. And it does it with a model that's tiny compared to most VLMs.
It's built on SmolVLM (also open-source) and achieves competitive results with models many times its size.
You can try SmolDocling yourself here.
Success.ai
@zaczuo any built-in support for multiple languages or specialized vocabularies? I would love to try it on academic journals that mix English text with foreign-language citations.
@hamza_afzal_butt Good question! It's primarily English-focused, but the OCR should handle other languages. Best to test it with your specific documents, though, as mixed-language performance isn't specifically benchmarked.
Automated document parsing is a great solution! 👀
I used Docling a couple of months ago, it was already cool, now this mini version sounds even cooler!