Decision filters

Choose skills by scenario, quality, and trust signals.

23 skills matching "evaluation"

Best blend of quality, stars, freshness, and agent usage

1

Mlflow

VERIFIEDEXCELLENT · 100

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

$ npx skills add mlflow/mlflow
26.1K stars74 qualityClaude Code + LangChain
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by mlflowQuick view
2

Promptfoo

VERIFIEDEXCELLENT · 100

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

$ npx skills add promptfoo/promptfoo
21.5K stars74 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptrag
by promptfooQuick view
3

Opik

VERIFIEDEXCELLENT · 100

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

$ npx skills add comet-ml/opik
19.4K stars73 qualityClaude Code + LangChain
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by comet-mlQuick view
4

WeKnora

VERIFIEDEXCELLENT · 100

Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.

$ npx skills add Tencent/WeKnora
15.4K stars73 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
gorag
by TencentQuick view
5

Bisheng

VERIFIEDEXCELLENT · 100

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

$ npx skills add dataelement/bisheng
11.4K stars72 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptrag
by dataelementQuick view
6

Phoenix

VERIFIEDEXCELLENT · 100

AI Observability & Evaluation

$ npx skills add Arize-ai/phoenix
9.8K stars70 qualityClaude Code + LangChain
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by Arize-aiQuick view
7

Chinese Llm Benchmark

VERIFIEDEXCELLENT · 100

非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。

$ npx skills add jeinlee1991/chinese-llm-benchmark
6.0K stars68 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
llm
by jeinlee1991Quick view
8

Helicone

VERIFIEDEXCELLENT · 100

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

$ npx skills add Helicone/helicone
5.7K stars68 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptllmops
by HeliconeQuick view
9

Coze Loop

VERIFIEDEXCELLENT · 100

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

$ npx skills add coze-dev/coze-loop
5.5K stars68 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
gollmops
by coze-devQuick view
10

Giskard Oss

VERIFIEDEXCELLENT · 100

🐢 Open-Source Evaluation & Testing library for LLM Agents

$ npx skills add Giskard-AI/giskard-oss
5.4K stars68 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by Giskard-AIQuick view
11

Agenta

VERIFIEDEXCELLENT · 100

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

$ npx skills add Agenta-AI/agenta
4.1K stars67 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptllmops
by Agenta-AIQuick view
12

Trulens

VERIFIEDEXCELLENT · 100

Evaluation and Tracking for LLM Experiments and AI Agents

$ npx skills add truera/trulens
3.3K stars66 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by trueraQuick view
13

Langwatch

VERIFIEDEXCELLENT · 100

The platform for LLM evaluations and AI agent testing

$ npx skills add langwatch/langwatch
3.3K stars66 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptllmops
by langwatchQuick view
14

Lmnr

VERIFIEDEXCELLENT · 100

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

$ npx skills add lmnr-ai/lmnr
2.9K stars66 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptllmops
by lmnr-aiQuick view
15

RagaAI Catalyst

VERIFIEDEXCELLENT · 100

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

$ npx skills add raga-ai-hub/RagaAI-Catalyst
16.2K stars66 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by raga-ai-hubQuick view
16

Openlit

VERIFIEDEXCELLENT · 100

Open source platform for AI Engineering: OpenTelemetry-native LLM Observability, GPU Monitoring, Guardrails, Evaluations, Prompt Management, Vault, Playground. 🚀💻 Integrates with 50+ LLM Providers, VectorDBs, Agent Frameworks and GPUs.

$ npx skills add openlit/openlit
2.5K stars65 qualityClaude Code + LangChain
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptllm
by openlitQuick view
17

LLM Engineers Handbook

VERIFIEDEXCELLENT · 100

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

$ npx skills add PacktPublishing/LLM-Engineers-Handbook
5.1K stars65 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by PacktPublishingQuick view
18

Observal

VERIFIEDEXCELLENT · 98

Observal is an Observability and Evaluation platform for human-in-the-loop agents

$ npx skills add BlazeUp-AI/Observal
1.3K stars64 qualityClaude Code + Cursor
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by BlazeUp-AIQuick view
19

Intellagent

VERIFIEDEXCELLENT · 100

A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions

$ npx skills add plurai-ai/intellagent
1.2K stars63 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by plurai-aiQuick view
20

Agent Skills Eval

STRONG · 84

A test runner for agentskills.io-style AI agent skills

$ npx skills add darkrishabh/agent-skills-eval
522 stars53 qualityClaude Code + OpenAI Agents
Solid option that is likely worth shortlisting for production workflows.
typescriptai-agents
by darkrishabhQuick view
21

Awesome papers involving LLMs in Social Science.

$ npx skills add ValueByte-AI/Awesome-LLM-in-Social-Science
623 stars50 qualityClaude Code
Solid option that is likely worth shortlisting for production workflows.
llm
by ValueByte-AIQuick view
22

Contoso Chat

PROMISING · 68

This sample has the full End2End process of creating RAG application with Prompty and Azure AI Foundry. It includes GPT-4 LLM application code, evaluations, deployment automation with AZD CLI, GitHub actions for evaluation and deployment and intent mapping for multiple LLM task mapping.

$ npx skills add Azure-Samples/contoso-chat
761 stars43 qualityClaude Code + OpenAI Agents
Useful candidate, but compare it with alternatives before adopting.
bicepllmops
by Azure-SamplesQuick view
23

Continuous Eval

NEEDS REVIEW · 53

Data-Driven Evaluation for LLM-Powered Applications

$ npx skills add relari-ai/continuous-eval
516 stars38 qualityClaude Code
Inspect the repository carefully before adding it to an agent workflow.Check: Repository looks stale
pythonllmops
by relari-aiQuick view