Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
$ npx skills add trycua/cuaDecision filters
14 skills matching "benchmark"
Best blend of quality, stars, freshness, and agent usage
Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
$ npx skills add trycua/cuaA lightweight, lightning-fast, in-process vector database
$ npx skills add alibaba/zvec非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
$ npx skills add jeinlee1991/chinese-llm-benchmarkAutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
$ npx skills add Marker-Inc-Korea/AutoRAGA streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
$ npx skills add modelscope/evalscopeAwesome-GraphRAG: A curated list of resources (surveys, papers, benchmarks, and opensource projects) on graph-based retrieval-augmented generation.
$ npx skills add DEEP-PolyU/Awesome-GraphRAG📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
$ npx skills add Xnhyacinth/Awesome-LLM-Long-Context-ModelingPython SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
$ npx skills add AgentOps-AI/agentops🔥 A list of tools, frameworks, and resources for building AI web agents
$ npx skills add steel-dev/awesome-web-agentsOpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.
$ npx skills add Topdu/OpenOCRAn on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
$ npx skills add NanoNets/docextA Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
$ npx skills add THUDM/AgentBenchA Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
$ npx skills add beir-cellar/beirWindows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
$ npx skills add microsoft/WindowsAgentArena