
OmniParser ‘tokenizes’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.
OmniParser ‘tokenizes’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.
Raycast
Microsoft Research has unveiled their own Computer Use model trained on a ton of labeled screenshots.
The v2 achieves a 60% improvement in latency compared to V1 (avg latency: 0.6s/frame on A100, 0.8s on single 4090).
Really cool! Hopefully it will be ported to more languages soon!
Shram
Congrats on the launch and lots of wins to the team :)