OpenAdapt Architecture (draft)

Client is installed on user's desktop computer (Windows or Mac)
User triggers "start recording" via Tray Icon to start recording user Action events (mouse/keyboard), associated Screenshots, and active Window State (retrieved from operating system accessibility API)
User triggers "stop recording" via Tray Icon to stop recording
Operating-system level events (e.g. 100 mouse movements sampled at 100 Hz) are merged/reduced into parent events (e.g. a single mouse position)
Personal Health Information (PHI) / Personal Identifiable Information (PII) is scrubbed from all recorded data
Screenshots are segmented via Segment Anything (https://arxiv.org/abs/2304.02643) and Marks are overlaid on objects for Set-of-Mark prompting (https://arxiv.org/abs/2310.11441).
Large Language Models (LLMs) / Large Multimodal Models (LMMs) are repeatedly prompted to summarize the Recording into a Process Description (i.e. high level python code) using Chain of Code prompting (https://arxiv.org/abs/2312.04474), in which function calls represent Process Steps (e.g. "scroll_in_options_tab_until_save_button()", “click_save_button()”)
LLMs/LMMs are prompted to generate the Next Action given the current Marked Screenshot and the current Process Step in the Process Description.
Next Action is played
LLMs/LMMs are prompted to determine whether the current Process Step in the Process Description was successfully completed
If successfully completed, advance to the next Process Step and continue from step 8. Otherwise, start a Recording, and alert the user that assistance is required.
If assistance is required, the user is asked to take corrective action, then to stop the recording and/or resume replay via the Tray Icon.

Comments: https://github.com/OpenAdaptAI/OpenAdapt/discussions/552

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

OpenAdapt Architecture (draft)

Uh oh!

Uh oh!

Clone this wiki locally