●builderIf you're building desktop or browser agents, this benchmark tests the authenticated, personalized task space your users actually care about.
●researcherCloses the evaluation gap between impersonal sandboxes and real personal assistant deployments by including authenticated, context-dependent web tasks.