Skip to content

OpenAdapt Architecture (draft)

Richard Abrich edited this page Dec 19, 2023 · 18 revisions
  1. Client is installed on user's desktop computer (Windows or Mac)
  2. User triggers "start recording" via Tray Icon to start recording user Action events (mouse/keyboard), associated Screenshots, and active Window State (retrieved from operating system accessibility API)
  3. User triggers "stop recording" via Tray Icon to stop recording
  4. Operating-system level events (e.g. 100 mouse movements sampled at 100 Hz) are merged/reduced into parent events (e.g. a single mouse position)
  5. Personal Health Information (PHI) / Personal Identifiable Information (PII) is scrubbed from all recorded data
  6. Screenshots are segmented via Segment Anything (https://arxiv.org/abs/2304.02643) and Marks are overlaid on objects for Set-of-Mark prompting (https://arxiv.org/abs/2310.11441).
  7. Large Language Models (LLMs) / Large Multimodal Models (LMMs) are repeatedly prompted to summarize the Recording into a Process Description (i.e. high level python code) using Chain of Code prompting (https://arxiv.org/abs/2312.04474), in which function calls represent Process Steps (e.g. "scroll_in_options_tab_until_save_button()", “click_save_button()”)
  8. LLMs/LMMs are prompted to generate the Next Action given the current Marked Screenshot and the current Process Step in the Process Description.
  9. Next Action is played
  10. LLMs/LMMs are prompted to determine whether the current Process Step in the Process Description was successfully completed
  11. If successfully completed, advance to the next Process Step and continue from step 8. Otherwise, start a Recording, and alert the user that assistance is required.
  12. If assistance is required, the user is asked to take corrective action, then to stop the recording and/or resume replay via the Tray Icon.

Comments: https://github.com/OpenAdaptAI/OpenAdapt/discussions/552

Clone this wiki locally