-
-
Notifications
You must be signed in to change notification settings - Fork 174
OpenAdapt Architecture (draft)
OpenAdapt is the open source AI-first Process Automation library. We have created, tested, and documented a number of reusable software components that we believe can serve as building blocks for AI-First Process Automation. We provide this back to the community at no cost (MIT license).
We are seeking feedback on our proposed process automation architecture (below).
- Client is installed on user's desktop computer (Windows or Mac)
- User triggers "start recording" via Tray Icon to start recording time-aligned user Action events (mouse/keyboard), associated Screenshots, and active Window State (retrieved from operating system accessibility API)
- User triggers "stop recording" via Tray Icon to stop recording
- Operating-system level events (e.g. 100 mouse movements sampled at 100 Hz) are merged/reduced into Process-level events (e.g. a single mouse position)
- Personal Health Information (PHI) / Personal Identifiable Information (PII) is scrubbed from all recorded data
- Screenshots are segmented via Segment Anything (https://arxiv.org/abs/2304.02643) and Marks are overlaid on objects for Set-of-Mark prompting (https://arxiv.org/abs/2310.11441).
- Large Language Models (LLMs) / Large Multimodal Models (LMMs) are repeatedly prompted to summarize the Recording into a Process Description (i.e. high level python code) using Chain of Code prompting (https://arxiv.org/abs/2312.04474), in which function calls represent Process Steps (e.g. "scroll_in_options_tab_until_save_button()", “click_save_button()”)
- LLMs/LMMs are prompted to generate the Next Action given the current Marked Screenshot and the current Process Step in the Process Description.
- Next Action is played
- LLMs/LMMs are prompted to determine whether the current Process Step in the Process Description was successfully completed
- If successfully completed, advance to the next Process Step and continue from step 8. Otherwise, start a Recording, and alert the user that assistance is required.
- If assistance is required, the user is asked to take corrective action, then to stop the recording and/or resume replay via the Tray Icon.
- Recordings and other data are optionally transferrable peer-to-perr in a decentralized and safe open source protocol (magic wormhole).
Event Merging optimizes user interaction data by applying a series of functions:
- merge_consecutive_keyboard_events
- merge_consecutive_mouse_move_events
- merge_consecutive_mouse_scroll_events
- remove_redundant_mouse_move_events
- merge_consecutive_mouse_click_events
These functions condense similar events, creating a cleaner and more efficient dataset.
PHI/PII Scrubbing uses tools like AWS Comprehend, Presidio, and Private AI for data anonymization. Local models provide assessments of the more advanced capabilities of hosted models. Comprehensive and user-friendly visualizations are provided.
Segment Anything hosted on an EC2 server offers easy infrastructure deployment and teardown via openadapt
CLI and SDK. Used to enable Set-of-Mark propmting. Offline alternatives (e.g. LLaVA) available for development and testing.
A repeatable and versioned Process Description is extracted by Combining Set-of-Mark and Chain-of-Code prompting,
- Set-of-Mark Analysis: Marks key objects in screenshots.
- Chain of Code Prompting: Generates high-level Python code, with function calls as process steps (e.g., "scroll_in_options_tab_until_save_button()", “click_save_button()”).
A Process Step is complete when a path from Start Step to End Step in the Process Graph is traversed. Nodes represent Process Steps, edges represent Completion Criteria. Completion Criteria are determined through Set-of-Mark + Chain-of-Code prompting.
-
Set-of-Mark Analysis Prompt:
- "Identify and mark key elements in the current screenshot that indicate the completion of the current Process Step."
-
Chain-of-Code Analysis Prompt:
- "Given the marks on the screenshot, generate Python code to verify if the specific conditions for the Process Step completion are met."
- Example: If the Process Step is 'click_save_button', the code might check for a confirmation message or a change in the button state.
-
Completion Criteria Validation Prompt:
- "Based on the Python code, determine if the current Process Step has been successfully completed and describe the outcome."
- This step involves executing the generated code and interpreting its results to confirm if the Process Step criteria have been satisfied.
- Alternatively, prompt a Large Multimodal Model with the current application state (e.g. latest screenshot, active window states, open files/sockets) to determine whether the Completion Criteria have been satisfied.
Please submit comments/questions at https://github.com/OpenAdaptAI/OpenAdapt/discussions/552 🙏