OpenAI o3 and o4-mini are reasoning models that think with images & agentically use tools (Search, Code, DALL-E). SOTA multimodal performance. Available in ChatGPT & API.
OpenAI brings their next-generation reasoning models, o3 and o4-mini, and they come with two major advancements.
First, they introduce "Thinking with Images". These models don't just see images; they can apparently manipulate them (crop, zoom, rotate via internal tools) as part of their reasoning process to solve complex visual problems, even with imperfect inputs like diagrams or handwriting. This is pushing them to SOTA on multimodal benchmarks.
Second, building on that, they have full agentic tool access. They intelligently decide when and how to use all available tools: Search, Code Interpreter, DALL-E, and these new image manipulation capabilities. Often chaining them together to tackle multi-faceted tasks.
Quick model difference: (because the naming puzzle, right?:))
o3: The most powerful, particularly excels at complex visual reasoning.
o4-mini: Optimized for speed and efficiency, still very capable across tasks.
Both models have better instruction following and more natural conversation using memory. They're rolling out now in ChatGPT and available via the API. (OpenAI also released the open-source Codex CLI alongside).
The capabilities of o3 and o4-mini are seriously impressive—multimodal reasoning combined with tool usage takes things to a whole new level. Exciting to see how these models expand what's possible with real-time search, code execution, and visual understanding. Huge leap forward!
Replies
Hi everyone!
OpenAI brings their next-generation reasoning models, o3 and o4-mini, and they come with two major advancements.
First, they introduce "Thinking with Images". These models don't just see images; they can apparently manipulate them (crop, zoom, rotate via internal tools) as part of their reasoning process to solve complex visual problems, even with imperfect inputs like diagrams or handwriting. This is pushing them to SOTA on multimodal benchmarks.
Second, building on that, they have full agentic tool access. They intelligently decide when and how to use all available tools: Search, Code Interpreter, DALL-E, and these new image manipulation capabilities. Often chaining them together to tackle multi-faceted tasks.
Quick model difference: (because the naming puzzle, right?:))
o3: The most powerful, particularly excels at complex visual reasoning.
o4-mini: Optimized for speed and efficiency, still very capable across tasks.
Both models have better instruction following and more natural conversation using memory. They're rolling out now in ChatGPT and available via the API. (OpenAI also released the open-source Codex CLI alongside).
Loving the visual reasoning features! 👍
The capabilities of o3 and o4-mini are seriously impressive—multimodal reasoning combined with tool usage takes things to a whole new level. Exciting to see how these models expand what's possible with real-time search, code execution, and visual understanding. Huge leap forward!