Evaluating Multimodal Agents In Real Computer Environments

2 months ago

OSWORLD is a comprehensive, integrated platform for evaluating open-ended computer tasks involving any application. Researchers have developed a benchmark comprising 369 computer tasks that use real web and desktop applications, involve operating system file I/O, and incorporate workflows across multiple applications. Each task is based on actual computer use scenarios and includes a detailed setup for the initial state and a custom script for execution-based evaluation to ensure reliable, repeatable results.

Link to document: https://arxiv.org/pdf/2404.07972

Loading comments...