Decision filters

Choose skills by scenario, quality, and trust signals.

2 skills matching "evaluation-framework"

Best blend of quality, stars, freshness, and agent usage

1

Promptfoo

VERIFIEDEXCELLENT · 100

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

$ npx skills add promptfoo/promptfoo
21.5K stars74 qualityClaude Code + OpenAI Agents
High-confidence pick with strong adoption and healthy maintenance signals.
typescriptrag
by promptfooQuick view
2

Continuous Eval

NEEDS REVIEW · 53

Data-Driven Evaluation for LLM-Powered Applications

$ npx skills add relari-ai/continuous-eval
516 stars38 qualityClaude Code
Inspect the repository carefully before adding it to an agent workflow.Check: Repository looks stale
pythonllmops
by relari-aiQuick view