Decision filters

Choose skills by scenario, quality, and trust signals.

14 skills matching "benchmark"

Best blend of quality, stars, freshness, and agent usage

1

Cua

VERIFIEDEXCELLENT · 100

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

$ npx skills add trycua/cua
17.0K stars73 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
htmlai-agents
by trycuaQuick view
2

Zvec

VERIFIEDEXCELLENT · 100

A lightweight, lightning-fast, in-process vector database

$ npx skills add alibaba/zvec
9.7K stars71 quality
High-confidence pick with strong adoption and healthy maintenance signals.
by alibabaQuick view
3

Chinese Llm Benchmark

VERIFIEDEXCELLENT · 100

非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。

$ npx skills add jeinlee1991/chinese-llm-benchmark
6.0K stars68 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
llm
by jeinlee1991Quick view
4

AutoRAG

VERIFIEDEXCELLENT · 100

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

$ npx skills add Marker-Inc-Korea/AutoRAG
4.8K stars67 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonrag
by Marker-Inc-KoreaQuick view
5

Evalscope

VERIFIEDEXCELLENT · 100

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

$ npx skills add modelscope/evalscope
2.8K stars66 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonrag
by modelscopeQuick view
6

Awesome GraphRAG

VERIFIEDEXCELLENT · 100

Awesome-GraphRAG: A curated list of resources (surveys, papers, benchmarks, and opensource projects) on graph-based retrieval-augmented generation.

$ npx skills add DEEP-PolyU/Awesome-GraphRAG
2.4K stars65 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
rag
by DEEP-PolyUQuick view
7

Awesome LLM Long Context Modeling

VERIFIEDEXCELLENT · 100

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

$ npx skills add Xnhyacinth/Awesome-LLM-Long-Context-Modeling
2.1K stars65 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
rag
by XnhyacinthQuick view
8

Agentops

VERIFIEDEXCELLENT · 100

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

$ npx skills add AgentOps-AI/agentops
5.6K stars65 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllm
by AgentOps-AIQuick view
9

Awesome Web Agents

VERIFIEDEXCELLENT · 98

🔥 A list of tools, frameworks, and resources for building AI web agents

$ npx skills add steel-dev/awesome-web-agents
1.4K stars64 qualityClaude Code + Browser agents
High-confidence pick with strong adoption and healthy maintenance signals.
pythonbrowser-automation
by steel-devQuick view
10

OpenOCR

VERIFIEDEXCELLENT · 100

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

$ npx skills add Topdu/OpenOCR
1.4K stars64 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonocr
by TopduQuick view
11

Docext

VERIFIEDEXCELLENT · 98

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

$ npx skills add NanoNets/docext
2.0K stars62 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonrag
by NanoNetsQuick view
12

AgentBench

VERIFIEDEXCELLENT · 97

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

$ npx skills add THUDM/AgentBench
3.4K stars59 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllm
by THUDMQuick view
13

Beir

VERIFIEDEXCELLENT · 87

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

$ npx skills add beir-cellar/beir
2.2K stars54 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonrag
by beir-cellarQuick view
14

WindowsAgentArena

STRONG · 81

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

$ npx skills add microsoft/WindowsAgentArena
861 stars51 qualityClaude Code
Solid option that is likely worth shortlisting for production workflows.
pythonai-agents
by microsoftQuick view