List of AI tools that can interact with user interfaces. PRs welcome.
These are VLMs that support pointing / bounding boxes for user interaction.
- Qwen 2.5-VL (Jan 2025)
- Moondream
- Llama 3.2 (Sep 2024): The two largest models of the Llama 3.2 collection, 11B and 90B, support image reasoning use cases, such as document-level understanding including charts and graphs, captioning of images, and visual grounding tasks such as directionally pinpointing objects in images based on natural language descriptions.
- Molmo (Sep 2024): VLM that matches GPT-4V performance with pointing ability.
- CogAgent (Dec 2023): CogAgent is an open-source visual language model that can identify regions and points of UIs to interact with.
- Florence 2 (Nov 2023): Vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks including producing bounding boxes.
- OpenAI Operator (Jan 2025): Backed by a Computer-Using Model.
- Claude 3.5 Computer Use (Oct 2024): Version of the Claude 3.5 model which supports computer use structured text and image tool inputs and actionable text outputs.
- Qwen 2.5-VL Cookbook
- OpenAdapt.AI: AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
- ScreenAgent
- Mobile-Agent
- UI-ACT: An AI agent for interacting with a computer using the graphical user interface
- OpenInterpreter: Uses code to interact with operating system.
- AIOS: Can interact with operating system as backend.
- Manus AI March 2025
- Claude 3.5 Computer Use Cookbook
- Adept: Company looking to automate user interface interaction through ML
These are still mostly text-based
- OpenAI Operator: A system using the Computer-Using Agent (CUA) model to interact with the user interface and ask for clarification from the user in your browser.
- Google Project Mariner: Browser extention to interact with pages.
- HyperWrite AI Agent