-
-
Notifications
You must be signed in to change notification settings - Fork 174
OpenAdapt Architecture (draft)
Richard Abrich edited this page Dec 19, 2023
·
18 revisions
- Client is installed on user's desktop computer (Windows or Mac)
- User triggers "start recording" via Tray Icon to start recording user Action events (mouse/keyboard), associated Screenshots, and active Window State (retrieved from operating system accessibility API)
- User triggers "stop recording" via Tray Icon to stop recording
- Operating-system level events (e.g. 100 mouse movements sampled at 100 Hz) are merged/reduced into parent events (e.g. a single mouse position)
- Personal Health Information (PHI) / Personal Identifiable Information (PII) is scrubbed from all recorded data
- Screenshots are segmented via Segment Anything (https://arxiv.org/abs/2304.02643) and Marks are overlaid on objects for Set-of-Mark prompting (https://arxiv.org/abs/2310.11441).
- Large Language Models (LLMs) / Large Multimodal Models (LMMs) are repeatedly prompted to summarize the Recording into a Process Description (i.e. high level python code) using Chain of Code prompting (https://arxiv.org/abs/2312.04474), in which function calls represent Process Steps (e.g. "scroll_in_options_tab_until_save_button()", “click_save_button()”)
- LLMs/LMMs are prompted to generate the Next Action given the current Marked Screenshot and the current Process Step in the Process Description.
- Next Action is played
- LLMs/LMMs are prompted to determine whether the current Process Step in the Process Description was successfully completed
- If successfully completed, advance to the next Process Step and continue from step 8. Otherwise, start a Recording, and alert the user that assistance is required.
- If assistance is required, the user is asked to take corrective action, then to stop the recording and/or resume replay via the Tray Icon.
Comments: https://github.com/OpenAdaptAI/OpenAdapt/discussions/552