# UseDesktop Evals

UseDesktop Evals publishes verifier-backed RL environments for computer-use agents.

The site is the public evidence layer for UseDesktop workflow environments: task prompts, mock app state, grader contracts, verifier assumptions, pass@k model results, failure modes, and run evidence. It is designed for researchers, data buyers, and teams evaluating computer-use agent post-training data.

Core entities:

- RL environment: the app or workflow state a model is situated in.
- Task: the prompt or objective the model must complete.
- Grader: the quantitative scoring contract for the task.
- Run: one model attempt with score, reward, verdict, and trace evidence.
- Model: a tested CUA adapter with pass@k and average score summaries.

UseDesktop focuses on evidence, not volume: solvability checks, verifier audits, difficulty calibration, contamination control, and model-run evidence for workflow data.

Canonical sections:

- `/` introduces the public eval surface and current catalog.
- `/tasks` lists verifier-backed task prompts and grader summaries.
- `/environments` lists public RL environments.
- `/environments/{slug}` shows source workflow, reset behavior, action space, grader, and model result summaries.
- `/environments/{slug}/tasks/{task}` shows a single task's prompt, environment state, grader contract, pass@k results, and failure taxonomy.
- `/models` lists tested computer-use models.
- `/runs/{runId}` shows a single model attempt with score evidence.

Related UseDesktop sites:

- Main site: https://usedesktop.com
- Blog: https://blog.usedesktop.com
- App: https://app.usedesktop.com/dashboard