Typical async actions that OH must be good at:
- OH must poll github CIs
- OH must poll gcloud kube deployments
Typical errors:
- actions is run, agent checks initial state but doesn't pause itself to make sure it works end-to-end
- agent struggles at finding the error logs
What should be done:
- Find a way to evaluate how good OH is at those tasks - We should add an integration test for this behavior
- Improve OH's system prompt for these type of interactive tasks if any failures