Decision filters

Choose skills by scenario, quality, and trust signals.

11 skills matching "llm-eval"

Best blend of quality, stars, freshness, and agent usage

1

Mlflow

VERIFIEDEXCELLENT · 100

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

$ npx skills add mlflow/mlflow
26.1K stars74 qualityClaude Code + LangChain
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by mlflowQuick view
2

Promptfoo

VERIFIEDEXCELLENT · 100

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

$ npx skills add promptfoo/promptfoo
21.5K stars74 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptrag
by promptfooQuick view
3

Opik

VERIFIEDEXCELLENT · 100

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

$ npx skills add comet-ml/opik
19.4K stars73 qualityClaude Code + LangChain
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by comet-mlQuick view
4

Chinese Llm Benchmark

VERIFIEDEXCELLENT · 100

非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。

$ npx skills add jeinlee1991/chinese-llm-benchmark
6.0K stars68 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
llm
by jeinlee1991Quick view
5

Giskard Oss

VERIFIEDEXCELLENT · 100

🐢 Open-Source Evaluation & Testing library for LLM Agents

$ npx skills add Giskard-AI/giskard-oss
5.4K stars68 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by Giskard-AIQuick view
6

Agenta

VERIFIEDEXCELLENT · 100

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

$ npx skills add Agenta-AI/agenta
4.1K stars67 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptllmops
by Agenta-AIQuick view
7

Trulens

VERIFIEDEXCELLENT · 100

Evaluation and Tracking for LLM Experiments and AI Agents

$ npx skills add truera/trulens
3.3K stars66 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by trueraQuick view
8

LLM Engineers Handbook

VERIFIEDEXCELLENT · 100

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

$ npx skills add PacktPublishing/LLM-Engineers-Handbook
5.1K stars65 qualityClaude Code
High-confidence pick with strong adoption and healthy maintenance signals.
pythonllmops
by PacktPublishingQuick view
9

Agent Skills Eval

STRONG · 84

A test runner for agentskills.io-style AI agent skills

$ npx skills add darkrishabh/agent-skills-eval
522 stars53 qualityClaude Code + OpenAI Agents
Solid option that is likely worth shortlisting for production workflows.
typescriptai-agents
by darkrishabhQuick view
10

Awesome papers involving LLMs in Social Science.

$ npx skills add ValueByte-AI/Awesome-LLM-in-Social-Science
623 stars50 qualityClaude Code
Solid option that is likely worth shortlisting for production workflows.
llm
by ValueByte-AIQuick view
11

Continuous Eval

NEEDS REVIEW · 53

Data-Driven Evaluation for LLM-Powered Applications

$ npx skills add relari-ai/continuous-eval
516 stars38 qualityClaude Code
Inspect the repository carefully before adding it to an agent workflow.Check: Repository looks stale
pythonllmops
by relari-aiQuick view