Skip to content

OpenAdapt Architecture (draft)

Richard Abrich edited this page Dec 19, 2023 · 18 revisions

Background

OpenAdapt is the open source AI-first Process Automation library. We have created, tested, and documented a number of reusable software components that we believe can serve as building blocks for AI-First Process Automation. We provide back to the community at no cost (MIT license).

We are seeking feedback on our proposed process automation architecture (below).

Overview

  1. Client is installed on user's desktop computer (Windows or Mac)
  2. User triggers "start recording" via Tray Icon to start recording time-aligned user Action events (mouse/keyboard), associated Screenshots, and active Window State (retrieved from operating system accessibility API)
  3. User triggers "stop recording" via Tray Icon to stop recording
  4. Operating-system level events (e.g. 100 mouse movements sampled at 100 Hz) are merged/reduced into parent events (e.g. a single mouse position)
  5. Personal Health Information (PHI) / Personal Identifiable Information (PII) is scrubbed from all recorded data
  6. Screenshots are segmented via Segment Anything (https://arxiv.org/abs/2304.02643) and Marks are overlaid on objects for Set-of-Mark prompting (https://arxiv.org/abs/2310.11441).
  7. Large Language Models (LLMs) / Large Multimodal Models (LMMs) are repeatedly prompted to summarize the Recording into a Process Description (i.e. high level python code) using Chain of Code prompting (https://arxiv.org/abs/2312.04474), in which function calls represent Process Steps (e.g. "scroll_in_options_tab_until_save_button()", “click_save_button()”)
  8. LLMs/LMMs are prompted to generate the Next Action given the current Marked Screenshot and the current Process Step in the Process Description.
  9. Next Action is played
  10. LLMs/LMMs are prompted to determine whether the current Process Step in the Process Description was successfully completed
  11. If successfully completed, advance to the next Process Step and continue from step 8. Otherwise, start a Recording, and alert the user that assistance is required.
  12. If assistance is required, the user is asked to take corrective action, then to stop the recording and/or resume replay via the Tray Icon.

Event Merging

Event Merging optimizes user interaction data by applying a series of functions:

  • merge_consecutive_keyboard_events
  • merge_consecutive_mouse_move_events
  • merge_consecutive_mouse_scroll_events
  • remove_redundant_mouse_move_events
  • merge_consecutive_mouse_click_events

These functions condense similar events, creating a cleaner and more efficient dataset.

PHI/PII Scrubbing

PHI/PII Scrubbing uses tools like AWS Comprehend, Presidio, and Private AI for data anonymization. Local models provide assessments of the more advanced capabilities of hosted models. Comprehensive and user-friendly visualizations are provided.

Segment Anything

Segment Anything hosted on an EC2 server offers easy infrastructure deployment and teardown via openadapt CLI and SDK. Used to enable Set-of-Mark propmting. Offline alternatives (e.g. LLaVA) available for development and testing.

Process Description via Set-of-Mark + Chain of Code Prompting

A repeatable and versioned Process Description is extracted by Combining Set-of-Mark and Chain-of-Code prompting,

  1. Set-of-Mark Analysis: Marks key objects in screenshots.
  2. Chain of Code Prompting: Generates high-level Python code, with function calls as process steps (e.g., "scroll_in_options_tab_until_save_button()", “click_save_button()”).

Process Step Completion Criteria

A Process Step is complete when a path from Start Step to End Step in the Process Graph is traversed. Nodes represent Process Steps, edges represent Completion Criteria. Completion Criteria are determined through Set-of-Mark + Chain-of-Code prompting.

Example Prompt Chain for Determining Process Step Completion Criteria

  1. Set-of-Mark Analysis Prompt:

    • "Identify and mark key elements in the current screenshot that indicate the completion of the current Process Step."
  2. Chain-of-Code Analysis Prompt:

    • "Given the marks on the screenshot, generate Python code to verify if the specific conditions for the Process Step completion are met."
    • Example: If the Process Step is 'click_save_button', the code might check for a confirmation message or a change in the button state.
  3. Completion Criteria Validation Prompt:

    • "Based on the Python code, determine if the current Process Step has been successfully completed and describe the outcome."
    • This step involves executing the generated code and interpreting its results to confirm if the Process Step criteria have been satisfied.
    • Alternatively, prompt a Large Multimodal Model with the current application state (e.g. latest screenshot, active window states, open files/sockets) to determine whether the Completion Criteria have been satisfied.

Request for Comments

Please submit comments/questions at https://github.com/OpenAdaptAI/OpenAdapt/discussions/552 🙏

Clone this wiki locally