OSExpert

Motivation

The Problem

Why do OpenClaw-style agents (General Computer-Use Agents) still fall short?

Because their skills are still limited, shallow, and non-generalizable: they may solve familiar patterns, but break quickly once the query departs from what they have effectively seen before.
OSExpert-Eval shows that even strong general-purpose agents remain trapped in test-time scaling and repeated trial-and-error, leading to slow inference, brittle trajectories, and late-stage failures.
They especially struggle to generalize across (1) long-horizon tasks, where cascading errors usually occur, (2) unseen or creative user interfaces, where superficial pattern matching is not enough to recover the right procedure, and (3) fine-grained low-level actions, where reliable expert behavior depends on precise, reusable procedural skills rather than one-off guesses.

The Paradigm Shift

Stop manually annotating skills. Let the agent discover them.

Instead of asking humans to handcraft an ever-growing library of workflows, OSExpert learns to directly discover skills from the target environment through automatic LLM-driven exploration.
The core idea is to replace static demonstrations with verifiable interaction: discover what actually works, keep what is validated, and discard what is not.
This is why we introduce GUI-DFS: to systematically uncover reusable unit skills, expose capability boundaries, and make skill acquisition scalable.
Once these skills are discovered, the agent can retrieve and compose them at inference time, solving the root problem rather than paying the same trial-and-error cost again and again.

The Discovery

OSExpert turns scalable exploration into reusable professional skills.

GUI-DFS exploration: systematically traverse the interface, discover unit functions, and retain only verified capabilities.
Self-constructed curriculum: compose discovered unit skills into higher-level procedures that support long-horizon professional workflows.
Fine-grained skill construction: invoke curated action primitives during exploration and keep only verified solutions for precise low-level control.
Skill-aware efficient inference: plan once, execute with reusable procedural knowledge, and stop early with explicit skill-boundary checks.

OSExpert-Eval overview — Figure 1. Our OSExpert-Eval shows that current computer-use agents remain far from expert-level: they struggle with long-horizon tasks, generalize poorly to unseen UI designs, lack fine-grained control over action sequences, and still fall well short of human expert efficiency.

Abstract

General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a ∼20% performance gain on OSExpert-Eval and closing the efficiency gap to humans by ∼80%.

Motivation and method diagram — Figure 2. Up: Current general-purpose computer-use agents rely on inference-time scaling, yet remain prone to failures and high latency. Left: Prior approaches explore digital environments using human-curated queries or tutorial-derived queries, which are often unavailable or difficult to obtain for arbitrary environments. Right: Our framework does not require external data or human effort for exploration queries and more comprehensively discover the unit functions of the digital environment, and benefits both performance and efficiency. We introduce how we handle the fine-grained actions during the exploration and how we organize the learned skill set in Figure 3.

Workflow and skillset diagram — Figure 3. Left: How our framework organizes and utilize the self-constructed skill set for robust and efficient inference. The unit functions are obtained from the terminal states as shown in Figure 2. Right: How we handled potential fine-grained actions during exploration stage. The fine-grained action handling is usually triggered by an error state in the exploration, as shown in Figure 2. The selected primitive fine-grained action template will be added to the skill set for solving future queries if verified helpful.

OSExpert-Eval Benchmark

Task categories

Long-horizon compositional workflows
Generalization to unseen and creative user interfaces
Fine-grained low-level actions
Execution efficiency

Scale

113 total tasks
Long Horizon: 30 tasks
Unseen UI: 50 tasks
Fine-Grained: 33 tasks

Task distribution pie chart — The inner ring shows the three high-level categories, while the outer ring breaks each category down by environment; slice sizes are proportional to the number of tasks.

OSExpert Method

OSExpert learns verifiable skills from bottom-up self-exploration and reuses them for robust, efficient inference.

GUI-DFS exploration to discover and verify unit functions.
Curriculum construction by composing unit skills into composite procedures.
Fine-grained skill construction by invoking curated action primitives during exploration and retaining only verified solutions.
Efficiency via single-pass fast planning and a skill-boundary check for early stopping.

Key reported gains

~20%success gain on OSExpert-Eval

~80%efficiency-gap reduction toward human experts

Representative Task Examples

Each example includes the initialization state, the natural-language task instruction, and the ground-truth outcome.

Representative examples from OSExpert-Eval across three task categories. The figure aggregates six examples illustrating the breadth of professional computer-use skills evaluated in our benchmark. Top (Long-Horizon Compositional Workflows): multi-step tasks in LibreOffice Calc and Writer that require composing several unit operations in the correct order, including spreadsheet completion with interface adjustments (e.g., zoom) and document-wide formatting with image insertion. Middle (Fine-Grained Action Execution): precise image-editing tasks in GIMP, including tightly cropping to a specified 200×200 region and performing accurate background removal while preserving object integrity. For each example, we show the initial environment state, the natural-language instruction, and the corresponding ground-truth outcome. Bottom (Unseen UI Generalization): tasks in Tableau and MiniWord that test transfer to unfamiliar interfaces and interaction patterns, such as building a world map visualization with sales-based coloring and category filtering, and importing external text followed by image insertion and alignment in a novel editor layout.

Citation

If you use OSExpert or OSExpert-Eval, please cite:

@misc{liu2026osexpertcomputeruseagentslearning,
      title={OSExpert: Computer-Use Agents Learning Professional Skills via Exploration},
      author={Jiateng Liu and Zhenhailong Wang and Rushi Wang and Bingxuan Li and Jeonghwan Kim and Aditi Tiwari and Pengfei Yu and Denghui Zhang and Heng Ji},
      year={2026},
      eprint={2603.07978},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.07978},
}