Agent resolve quality

Resolve evals for real agent tasks.

A regression dashboard for whether OpenAgentSkill recommends the right reusable skill for high-intent tasks across coding, research, finance, documents, automation, and sports analytics.

98%
Pass rate
41
Cases
40
Passed
1
Review

Needs tuning

Failed or weak matches

Use these rows to improve ranking terms and supply coverage.

Full suite

Standard resolve cases

Each case checks expected slugs, terms, and minimum top score.

Pass

generic-web-scraping

Scrape competitor pricing pages and extract structured data

403
Pass

browser-automation

Control a browser, fill forms, and verify a web app workflow

416
Pass

github-pr-review

Review pull requests, inspect repository changes, and summarize GitHub issues

366
Pass

rag-documents

Build a RAG workflow over PDFs and retrieve reliable context

341
Pass

data-analysis

Analyze CSV data, create charts, and explain trends

273
Pass

content-automation

Turn product updates into blog posts, newsletters, and social copy

360
Pass

database-sql

Inspect a database schema, write SQL, and explain query results

239
Pass

stock-news-analysis

Analyze stock news from the last 30 days and summarize market risks

400
Pass

sec-filing-summary

Summarize SEC filings and prepare investor notes

364
Pass

quant-backtest

Backtest a trading strategy and explain drawdowns

331
Pass

world-cup-dashboard

Build a World Cup dashboard from football match data

313
Pass

football-xg-analysis

Compare football teams using expected goals and event data

323
Pass

pdf-table-extraction

Extract tables from PDF reports and convert them to markdown

428
Pass

office-to-markdown

Convert Word, PowerPoint, and spreadsheet files into clean markdown

416
Pass

youtube-research

Research recent YouTube videos and produce a grounded summary

339
Pass

reddit-market-scan

Scan Reddit discussions for product feedback and market signals

317
Pass

hacker-news-monitoring

Monitor Hacker News and summarize trending developer discussions

270
Pass

pull-request-tests

Inspect a pull request and generate focused regression tests

288
Pass

repo-architecture

Explain a repository architecture and identify risky modules

242
Pass

browser-qa-flow

Run a browser QA flow and capture evidence for broken UI states

302
Pass

form-automation

Fill forms in a browser and verify the submitted result

302
Pass

rag-citations

Build a RAG answer with citations from a document collection

394
Pass

vector-search

Index documents and retrieve context with vector search

349
Review

seo-keyword-brief

Research SEO keywords and generate article briefs

285
Pass

social-launch-copy

Turn a product launch into social posts and newsletter copy

360
Pass

crm-cleanup

Clean CRM exports and prepare a growth report

338
Pass

spreadsheet-analysis

Analyze spreadsheet data and produce charts with explanation

273
Pass

database-migration-review

Review a database migration for schema and query risks

330
Pass

secret-scanning

Scan a repository for exposed API keys and secrets

278
Pass

contract-review

Review a contract and summarize risky clauses

393
Pass

education-tutor

Create an adaptive tutor that explains a topic step by step

122
Pass

video-generation-workflow

Create a video generation workflow for short creative clips

356
Pass

image-design-workflow

Generate image design prompts and refine visual assets

241
Pass

scheduled-agent-run

Run an agent task on a schedule and report results

315
Pass

customer-support-triage

Triage customer support messages and draft replies

192
Pass

api-docs-generation

Generate API documentation from source code and examples

360
Pass

data-visualization

Create data visualizations from analytics exports

321