{"slug":"huggingface-evaluation-guidebook","name":"Evaluation Guidebook","description":"Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","tagline":"Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","category":"ml-automation","tags":["machine-learning","automation","ml-media","evaluation","evaluation-metrics","guidebook","large-language-models","llm","tutorial","jupyter notebook"],"author":{"name":"huggingface","verified":true,"url":"https://github.com/huggingface"},"attribution":{"status":"community_indexed","statusLabel":"Community indexed","shortLabel":"COMMUNITY INDEXED","sourceLabel":"GitHub star discovery","sourceDetail":"huggingface/evaluation-guidebook","creatorName":"huggingface","creatorUrl":"https://github.com/huggingface","sourceUrl":"https://github.com/huggingface/evaluation-guidebook","indexedBy":"OpenAgentSkill community index","claimUrl":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook#claim-this-skill","claimCta":"Claim this skill","trustNote":"This listing was indexed from public sources and is not marked official until a maintainer claim is approved.","publicNote":"Attribution links to the public repository or creator profile. Creators can claim the listing to update ownership signals."},"stats":{"stars":2124,"forks":123,"downloads":0,"rating":0,"review_count":0,"quality_score":53.99},"quality":{"score":82,"tier":"strong","label":"Strong","summary":"Solid option that is likely worth shortlisting for production workflows.","signals":[{"label":"GitHub stars","value":"2.1K","tone":"positive"},{"label":"Freshness","value":"7mo ago","tone":"positive"},{"label":"Install ready","value":"Yes","tone":"positive"},{"label":"License","value":"Unknown","tone":"neutral"}],"warnings":[]},"trust":{"version":"trust-score-v4","score":83,"tier":"strong","label":"Strong shortlist","summary":"Good trust signals with a few areas worth checking before rollout.","recommendedAction":"Test in a sandbox workflow and compare its install path with close alternatives.","dimensions":[{"id":"github_adoption","label":"GitHub adoption","score":86,"weight":0.13,"status":"pass","detail":"2.1K GitHub stars"},{"id":"repo_activity","label":"Stars/forks activity","score":77,"weight":0.08,"status":"info","detail":"2.1K stars, 123 forks; issue activity unavailable in current metadata"},{"id":"maintenance","label":"Recent maintenance","score":62,"weight":0.14,"status":"info","detail":"7mo since push"},{"id":"license","label":"License clarity","score":42,"weight":0.09,"status":"warn","detail":"Unknown"},{"id":"documentation","label":"README/SKILL.md completeness","score":90,"weight":0.14,"status":"pass","detail":"Metadata includes enough usage and workflow context"},{"id":"dependency_risk","label":"Dependency/runtime risk","score":90,"weight":0.12,"status":"pass","detail":"no major dependency risk hints in public metadata"},{"id":"installability","label":"Install availability","score":92,"weight":0.1,"status":"pass","detail":"npx skills add huggingface/evaluation-guidebook"},{"id":"install_safety","label":"Install command safety","score":92,"weight":0.1,"status":"pass","detail":"standard package or runtime install path"},{"id":"permission_surface","label":"Permission surface","score":86,"weight":0.07,"status":"pass","detail":"filesystem or document access"},{"id":"repository","label":"Repository evidence","score":86,"weight":0.04,"status":"pass","detail":"https://github.com/huggingface/evaluation-guidebook"},{"id":"review_status","label":"Review status","score":88,"weight":0.05,"status":"pass","detail":"AI review data available"},{"id":"agent_outcomes","label":"Agent Proven outcomes","score":54,"weight":0.13,"status":"info","detail":"No agent outcome data yet"}],"checks":[{"status":"pass","label":"GitHub adoption","detail":"2.1K GitHub stars"},{"status":"info","label":"Stars/forks activity","detail":"2.1K stars, 123 forks; issue activity unavailable in current metadata"},{"status":"info","label":"Recent maintenance","detail":"7mo since push"},{"status":"warn","label":"License clarity","detail":"Unknown"},{"status":"pass","label":"README/SKILL.md completeness","detail":"Metadata includes enough usage and workflow context"},{"status":"pass","label":"Dependency/runtime risk","detail":"no major dependency risk hints in public metadata"},{"status":"pass","label":"Install availability","detail":"npx skills add huggingface/evaluation-guidebook"},{"status":"pass","label":"Install command safety","detail":"standard package or runtime install path"},{"status":"pass","label":"Permission surface","detail":"filesystem or document access"},{"status":"pass","label":"Repository evidence","detail":"https://github.com/huggingface/evaluation-guidebook"},{"status":"pass","label":"Review status","detail":"AI review data available"},{"status":"info","label":"Agent Proven outcomes","detail":"No agent outcome data yet"},{"status":"pass","label":"Ownership","detail":"Listing manually verified"},{"status":"info","label":"OpenAgentSkill usage","detail":"No local usage activity yet"},{"status":"info","label":"Agent outcomes","detail":"No agent outcome data yet"}],"strengths":["Manually verified listing","AI review approved","Install path is available","Repository evidence is available","Meaningful GitHub adoption signal","Install command has no obvious high-risk pattern"],"warnings":["License is unclear","Quality score needs review","License clarity: Unknown"],"evidence":{"stars":"2.1K GitHub stars","repoActivity":"2.1K stars, 123 forks","lastPushed":"7mo since push","license":"Unknown","repository":"https://github.com/huggingface/evaluation-guidebook","install":"npx skills add huggingface/evaluation-guidebook","installSafety":"standard package or runtime install path","permissionSurface":"filesystem or document access","documentation":"Strong README/SKILL.md context","agentOutcomes":"No agent outcome data yet"},"installReadiness":{"ready":true,"command":"npx skills add huggingface/evaluation-guidebook","policy":"human_review_before_install","label":"Human review before install","notes":["Install path is available","Repository evidence is available","License is unclear","No Agent Proven outcome evidence yet","7mo since push"]},"agentCompatibility":["Jupyter Notebook","Machine Learning","Codex","Claude Code","Cursor","OpenAgentSkill CLI"],"riskSummary":{"level":"medium","label":"Review before production","notes":["License is unclear","Quality score needs review","License clarity: Unknown"]},"outcomeEvidence":{"total":0,"successes":0,"failures":0,"notRelevant":0,"successRate":null,"installAttempts":0,"riskBlocked":0,"setupRequired":0,"installSuccessRate":null,"avgOutputQuality":null,"avgTimeToUsefulMs":null,"productionOutcomes":0,"humanReviewRequired":0,"recentSuccessRate":null,"recentFailureRate":null,"uniqueAgents":0,"agentProvenScore":0,"agentProvenLabel":"Needs first agent run","lastOutcomeAt":null,"label":"No agent outcome data yet"},"autoInstall":{"allowed":false,"sandboxRequired":true,"policy":"human_review_before_install","reason":"Human review or sandbox validation is required before automatic installation."},"bestFor":["ml-automation","machine-learning","automation","ml-media","evaluation","evaluation-metrics"],"doNotUseFor":["Production credentials, payments, or irreversible account changes without explicit human review","Sensitive private data before reviewing repository code, license, and permission surface","Automatic installation in a production workspace","Commercial reuse before clarifying license terms"],"knownRisks":["License is unclear","Quality score needs review","License clarity: Unknown"]},"safety":{"score":64,"level":"review_before_install","label":"Review before install","safety_tier":{"tier":"reviewed","label":"Reviewed with permission notes","badge":"REVIEWED","summary":"Usable candidate, but the agent should surface permission and audit notes before installation.","recommended_action":"Require human approval before installing into a real workspace.","auto_install_policy":"review","reasons":["License is unclear","64/100 agent safety score"]},"auto_install_allowed":false,"human_review_required":true,"blocked":false,"audit_risk":"needs_review","permission_hints":[{"id":"network","label":"Network access","reason":"Skill likely fetches remote pages, APIs, repositories, or external services.","severity":"medium"},{"id":"filesystem","label":"Filesystem access","reason":"Skill may read or write project files, documents, generated artifacts, or local workspace state.","severity":"medium"}],"policy_warnings":["License is unclear"],"constraints_applied":{"max_risk":"medium","needs_install_command":true,"min_stars":0}},"safety_gate":{"tier":"reviewed","label":"Reviewed with permission notes","badge":"REVIEWED","auto_install_policy":"review","auto_install_allowed":false,"human_review_required":true,"blocked":false,"recommended_action":"Require human approval before installing into a real workspace.","reasons":["License is unclear","64/100 agent safety score"]},"supply_profile":{"track":{"slug":"data","label":"Data, BI, and analytics","shortLabel":"Data","description":"CSV, SQL, notebooks, dashboards, data pipelines, BI, ETL, and spreadsheet analysis."},"scenario":{"label":"Data analysis","description":"I need my agent to analyze CSV data, produce insights, and explain trends.","useCases":[{"slug":"rag-knowledge","title":"RAG and knowledge"},{"slug":"coding-agents","title":"Coding agents"},{"slug":"browser-automation","title":"Browser automation"}]},"applicableAgents":["Claude Code","CLI","Codex","Cursor","Jupyter Notebook"],"install":{"ready":true,"command":"npx skills add huggingface/evaluation-guidebook","primaryTarget":"CLI","targetCount":4},"githubQuality":{"stars":2124,"starsLabel":"2.1K","forks":123,"license":"Unknown","qualityScore":82,"trustScore":83,"auditScore":80},"maintenance":{"status":"stable","label":"7mo since push","daysSincePush":212,"lastPushedAt":"2025-12-03T14:45:05+00:00"},"risk":{"level":"needs_review","label":"Needs review","requiresReview":true,"notes":["License is unclear","Quality score needs review","License clarity: Unknown","Needs review"]},"coverageTags":["Data","Data analysis","ml-automation","machine-learning","automation","ml-media","evaluation","evaluation-metrics"]},"audit":{"audit_score":80,"risk_level":"needs_review","risk_label":"Needs review","warnings":["License is unclear","Quality score needs review","License clarity: Unknown"]},"decision":{"readiness_score":84,"readiness_label":"Production-ready","headline":"Primary pick for RAG and knowledge","role":"Primary pick","primary_fit":"RAG and knowledge","best_for":["RAG and knowledge workflows","Claude Code teams","teams that value GitHub adoption signals"],"risks":["No OpenAgentSkill engagement data yet"],"next_steps":["Install it in a sandbox agent and run one RAG and knowledge task end to end.","Compare output quality, latency, and failure behavior against at least one alternative.","Promote it into production only after reviewing repository permissions, license, and maintenance signals."]},"agent_readable_metadata":{"version":"openagentskill-agent-metadata-v2","skill":{"slug":"huggingface-evaluation-guidebook","name":"Evaluation Guidebook","description":"Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","category":"ml-automation","url":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook","repository":"https://github.com/huggingface/evaluation-guidebook","github_repo":"huggingface/evaluation-guidebook"},"suited_tasks":["RAG and knowledge workflows","Claude Code teams","teams that value GitHub adoption signals","Chunk documents","Create embeddings","Retrieve and cite relevant passages","Inspect source files","Explain architecture"],"suited_agents":["Jupyter Notebook","Machine Learning","Codex","Claude Code","Cursor","OpenAgentSkill CLI","CLI"],"install":{"command":"npx skills add huggingface/evaluation-guidebook","ready":true,"targets":[{"id":"openagentskill-cli","label":"CLI","kind":"command","value":"npx skills add huggingface/evaluation-guidebook"},{"id":"codex","label":"Codex","kind":"agent-prompt","value":"Install the \"Evaluation Guidebook\" agent skill from https://github.com/huggingface/evaluation-guidebook. Read its SKILL.md or equivalent instructions first, install only the files needed for this workspace, and summarize any required setup before using it. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!"},{"id":"claude-code","label":"Claude Code","kind":"agent-prompt","value":"Add \"Evaluation Guidebook\" as a Claude Code skill from https://github.com/huggingface/evaluation-guidebook. Inspect the skill instructions, place the reusable skill files in the appropriate local skills location for this project, and report the activation steps. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!"},{"id":"cursor","label":"Cursor","kind":"agent-prompt","value":"Turn \"Evaluation Guidebook\" from https://github.com/huggingface/evaluation-guidebook into a reusable Cursor project rule or agent instruction. Preserve the core workflow, adapt paths to this repo, and keep the rule scoped to tasks where it is relevant. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!"}],"handoff_url":"https://www.openagentskill.com/api/skills/huggingface-evaluation-guidebook/install","manifest_url":"https://www.openagentskill.com/api/registry/manifest/huggingface-evaluation-guidebook"},"trust":{"score":83,"label":"Strong shortlist","version":"trust-score-v4","install_policy":"human_review_before_install","evidence":{"stars":"2.1K GitHub stars","repoActivity":"2.1K stars, 123 forks","lastPushed":"7mo since push","license":"Unknown","repository":"https://github.com/huggingface/evaluation-guidebook","install":"npx skills add huggingface/evaluation-guidebook","installSafety":"standard package or runtime install path","permissionSurface":"filesystem or document access","documentation":"Strong README/SKILL.md context","agentOutcomes":"No agent outcome data yet"},"outcome_evidence":{"total":0,"successes":0,"failures":0,"not_relevant":0,"success_rate":null,"recent_success_rate":null,"recent_failure_rate":null,"install_attempts":0,"install_success_rate":null,"risk_blocked":0,"setup_required":0,"avg_output_quality":null,"production_outcomes":0,"last_outcome_at":null,"label":"No agent outcome data yet"},"auto_install":{"allowed":false,"sandbox_required":true,"reason":"Human review or sandbox validation is required before automatic installation."},"best_for":["ml-automation","machine-learning","automation","ml-media","evaluation","evaluation-metrics"],"known_risks":["License is unclear","Quality score needs review","License clarity: Unknown"]},"agent_proven":{"version":"agent-proven-v1","score":0,"tier":"unproven","label":"Needs first agent run","summary":"No agent outcome reports yet. Use Resolve, run one narrow sandbox task, then report the result.","metrics":{"totalOutcomes":0,"successfulOutcomes":0,"failedOutcomes":0,"installAttempts":0,"installSuccessRate":null,"successRate":null,"recentSuccessRate":null,"recentFailureRate":null,"riskBlocked":0,"setupRequired":0,"notRelevant":0,"avgOutputQuality":null,"avgTimeToUsefulMs":null,"productionOutcomes":0,"humanReviewRequired":0,"uniqueAgents":0,"lastOutcomeAt":null},"signals":[],"penalties":["No real agent outcome evidence yet"]},"audit":{"score":80,"risk_level":"needs_review","risk_label":"Needs review","warnings":["License is unclear","Quality score needs review","License clarity: Unknown"]},"safety_gate":{"tier":"reviewed","label":"Reviewed with permission notes","auto_install_policy":"review","auto_install_allowed":false,"human_review_required":true,"blocked":false,"recommended_action":"Require human approval before installing into a real workspace."},"quality":{"score":82,"label":"Strong"},"supply":{"track":"Data, BI, and analytics","scenario":"Data analysis","maintenance":"7mo since push","risk":"Needs review"},"alternative_skills":[],"do_not_use_when":["teams that need a vendor-supported SLA","high-compliance environments without internal security review","No OpenAgentSkill engagement data yet","License is unclear","Quality score needs review","License clarity: Unknown","Production credentials, payments, or irreversible account changes without explicit human review","Sensitive private data before reviewing repository code, license, and permission surface"],"agent_contract":{"task_input":"Use Evaluation Guidebook in an agent workflow","recommended_action":"Require human approval before installing into a real workspace.","install_policy":"review","minimum_review_before_use":["Trust: 83/100 Strong shortlist","Audit: 80/100 Needs review","Safety: 64/100 Review before install","Review repository, license, install command, and permission surface before production use."],"expected_agent_output":{"selected_skill":"huggingface-evaluation-guidebook (Evaluation Guidebook)","install_command":"npx skills add huggingface/evaluation-guidebook","risk_summary":"Needs review; Reviewed with permission notes; Review before production","verification_result":"Report the smallest successful task, files touched, warnings, and any missing setup."}},"outcome_feedback":{"endpoint":"https://www.openagentskill.com/api/agent/outcome","method":"POST","requires_resolve_event_id":true,"event_id_source":"Use install_receipt.outcome_feedback.event_id or feedback.event_id returned by /api/agent/resolve for the current task.","expected_outcomes":["success","failed","not_relevant","blocked_by_risk","setup_required"],"payload_template":{"event_id":"<install_receipt.outcome_feedback.event_id or feedback.event_id from /api/agent/resolve>","skill_slug":"huggingface-evaluation-guidebook","task":"Use Evaluation Guidebook in an agent workflow","agent":"codex","outcome":"success","install_used":true,"risk_blocked":false,"setup_required":false,"task_success":true,"output_quality":4,"error_type":null,"human_review_required":false,"workspace":"sandbox","time_to_useful_ms":120000,"notes":"Report the smallest successful task, setup friction, files touched, and risk notes."}},"endpoints":{"web":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook","api":"https://www.openagentskill.com/api/agent/skills/huggingface-evaluation-guidebook","audit":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook/audit","eval":"https://www.openagentskill.com/api/agent/evals?slug=huggingface-evaluation-guidebook&task=Use%20Evaluation%20Guidebook%20in%20an%20agent%20workflow&max_risk=medium","resolve":"https://www.openagentskill.com/api/agent/resolve?task=Use%20Evaluation%20Guidebook%20in%20an%20agent%20workflow&agent=codex&max_risk=medium","receipt":"https://www.openagentskill.com/api/agent/receipt?task=Use%20Evaluation%20Guidebook%20in%20an%20agent%20workflow&agent=codex&max_risk=medium&format=text","install":"https://www.openagentskill.com/api/skills/huggingface-evaluation-guidebook/install","manifest":"https://www.openagentskill.com/api/registry/manifest/huggingface-evaluation-guidebook"}},"machine_metadata":{"version":"openagentskill-agent-metadata-v2","skill":{"slug":"huggingface-evaluation-guidebook","name":"Evaluation Guidebook","description":"Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","category":"ml-automation","url":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook","repository":"https://github.com/huggingface/evaluation-guidebook","github_repo":"huggingface/evaluation-guidebook"},"suited_tasks":["RAG and knowledge workflows","Claude Code teams","teams that value GitHub adoption signals","Chunk documents","Create embeddings","Retrieve and cite relevant passages","Inspect source files","Explain architecture"],"suited_agents":["Jupyter Notebook","Machine Learning","Codex","Claude Code","Cursor","OpenAgentSkill CLI","CLI"],"install":{"command":"npx skills add huggingface/evaluation-guidebook","ready":true,"targets":[{"id":"openagentskill-cli","label":"CLI","kind":"command","value":"npx skills add huggingface/evaluation-guidebook"},{"id":"codex","label":"Codex","kind":"agent-prompt","value":"Install the \"Evaluation Guidebook\" agent skill from https://github.com/huggingface/evaluation-guidebook. Read its SKILL.md or equivalent instructions first, install only the files needed for this workspace, and summarize any required setup before using it. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!"},{"id":"claude-code","label":"Claude Code","kind":"agent-prompt","value":"Add \"Evaluation Guidebook\" as a Claude Code skill from https://github.com/huggingface/evaluation-guidebook. Inspect the skill instructions, place the reusable skill files in the appropriate local skills location for this project, and report the activation steps. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!"},{"id":"cursor","label":"Cursor","kind":"agent-prompt","value":"Turn \"Evaluation Guidebook\" from https://github.com/huggingface/evaluation-guidebook into a reusable Cursor project rule or agent instruction. Preserve the core workflow, adapt paths to this repo, and keep the rule scoped to tasks where it is relevant. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!"}],"handoff_url":"https://www.openagentskill.com/api/skills/huggingface-evaluation-guidebook/install","manifest_url":"https://www.openagentskill.com/api/registry/manifest/huggingface-evaluation-guidebook"},"trust":{"score":83,"label":"Strong shortlist","version":"trust-score-v4","install_policy":"human_review_before_install","evidence":{"stars":"2.1K GitHub stars","repoActivity":"2.1K stars, 123 forks","lastPushed":"7mo since push","license":"Unknown","repository":"https://github.com/huggingface/evaluation-guidebook","install":"npx skills add huggingface/evaluation-guidebook","installSafety":"standard package or runtime install path","permissionSurface":"filesystem or document access","documentation":"Strong README/SKILL.md context","agentOutcomes":"No agent outcome data yet"},"outcome_evidence":{"total":0,"successes":0,"failures":0,"not_relevant":0,"success_rate":null,"recent_success_rate":null,"recent_failure_rate":null,"install_attempts":0,"install_success_rate":null,"risk_blocked":0,"setup_required":0,"avg_output_quality":null,"production_outcomes":0,"last_outcome_at":null,"label":"No agent outcome data yet"},"auto_install":{"allowed":false,"sandbox_required":true,"reason":"Human review or sandbox validation is required before automatic installation."},"best_for":["ml-automation","machine-learning","automation","ml-media","evaluation","evaluation-metrics"],"known_risks":["License is unclear","Quality score needs review","License clarity: Unknown"]},"agent_proven":{"version":"agent-proven-v1","score":0,"tier":"unproven","label":"Needs first agent run","summary":"No agent outcome reports yet. Use Resolve, run one narrow sandbox task, then report the result.","metrics":{"totalOutcomes":0,"successfulOutcomes":0,"failedOutcomes":0,"installAttempts":0,"installSuccessRate":null,"successRate":null,"recentSuccessRate":null,"recentFailureRate":null,"riskBlocked":0,"setupRequired":0,"notRelevant":0,"avgOutputQuality":null,"avgTimeToUsefulMs":null,"productionOutcomes":0,"humanReviewRequired":0,"uniqueAgents":0,"lastOutcomeAt":null},"signals":[],"penalties":["No real agent outcome evidence yet"]},"audit":{"score":80,"risk_level":"needs_review","risk_label":"Needs review","warnings":["License is unclear","Quality score needs review","License clarity: Unknown"]},"safety_gate":{"tier":"reviewed","label":"Reviewed with permission notes","auto_install_policy":"review","auto_install_allowed":false,"human_review_required":true,"blocked":false,"recommended_action":"Require human approval before installing into a real workspace."},"quality":{"score":82,"label":"Strong"},"supply":{"track":"Data, BI, and analytics","scenario":"Data analysis","maintenance":"7mo since push","risk":"Needs review"},"alternative_skills":[],"do_not_use_when":["teams that need a vendor-supported SLA","high-compliance environments without internal security review","No OpenAgentSkill engagement data yet","License is unclear","Quality score needs review","License clarity: Unknown","Production credentials, payments, or irreversible account changes without explicit human review","Sensitive private data before reviewing repository code, license, and permission surface"],"agent_contract":{"task_input":"Use Evaluation Guidebook in an agent workflow","recommended_action":"Require human approval before installing into a real workspace.","install_policy":"review","minimum_review_before_use":["Trust: 83/100 Strong shortlist","Audit: 80/100 Needs review","Safety: 64/100 Review before install","Review repository, license, install command, and permission surface before production use."],"expected_agent_output":{"selected_skill":"huggingface-evaluation-guidebook (Evaluation Guidebook)","install_command":"npx skills add huggingface/evaluation-guidebook","risk_summary":"Needs review; Reviewed with permission notes; Review before production","verification_result":"Report the smallest successful task, files touched, warnings, and any missing setup."}},"outcome_feedback":{"endpoint":"https://www.openagentskill.com/api/agent/outcome","method":"POST","requires_resolve_event_id":true,"event_id_source":"Use install_receipt.outcome_feedback.event_id or feedback.event_id returned by /api/agent/resolve for the current task.","expected_outcomes":["success","failed","not_relevant","blocked_by_risk","setup_required"],"payload_template":{"event_id":"<install_receipt.outcome_feedback.event_id or feedback.event_id from /api/agent/resolve>","skill_slug":"huggingface-evaluation-guidebook","task":"Use Evaluation Guidebook in an agent workflow","agent":"codex","outcome":"success","install_used":true,"risk_blocked":false,"setup_required":false,"task_success":true,"output_quality":4,"error_type":null,"human_review_required":false,"workspace":"sandbox","time_to_useful_ms":120000,"notes":"Report the smallest successful task, setup friction, files touched, and risk notes."}},"endpoints":{"web":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook","api":"https://www.openagentskill.com/api/agent/skills/huggingface-evaluation-guidebook","audit":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook/audit","eval":"https://www.openagentskill.com/api/agent/evals?slug=huggingface-evaluation-guidebook&task=Use%20Evaluation%20Guidebook%20in%20an%20agent%20workflow&max_risk=medium","resolve":"https://www.openagentskill.com/api/agent/resolve?task=Use%20Evaluation%20Guidebook%20in%20an%20agent%20workflow&agent=codex&max_risk=medium","receipt":"https://www.openagentskill.com/api/agent/receipt?task=Use%20Evaluation%20Guidebook%20in%20an%20agent%20workflow&agent=codex&max_risk=medium&format=text","install":"https://www.openagentskill.com/api/skills/huggingface-evaluation-guidebook/install","manifest":"https://www.openagentskill.com/api/registry/manifest/huggingface-evaluation-guidebook"}},"platforms":["Jupyter Notebook","Machine Learning","Claude Code"],"use_cases":[{"slug":"rag-knowledge","title":"RAG and knowledge","url":"https://www.openagentskill.com/use-cases/rag-knowledge"},{"slug":"coding-agents","title":"Coding agents","url":"https://www.openagentskill.com/use-cases/coding-agents"},{"slug":"browser-automation","title":"Browser automation","url":"https://www.openagentskill.com/use-cases/browser-automation"},{"slug":"workflow-automation","title":"Workflow automation","url":"https://www.openagentskill.com/use-cases/workflow-automation"}],"install":"npx skills add huggingface/evaluation-guidebook","install_targets":[{"id":"openagentskill-cli","label":"CLI","title":"OpenAgentSkill CLI","kind":"command","value":"npx skills add huggingface/evaluation-guidebook","description":"Use the registry command when your workflow supports the OpenAgentSkill installer.","copyLabel":"Copy command"},{"id":"codex","label":"Codex","title":"Codex install prompt","kind":"agent-prompt","value":"Install the \"Evaluation Guidebook\" agent skill from https://github.com/huggingface/evaluation-guidebook. Read its SKILL.md or equivalent instructions first, install only the files needed for this workspace, and summarize any required setup before using it. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","description":"Give Codex a repo-aware install prompt when the skill is not available through a local CLI.","copyLabel":"Copy prompt"},{"id":"claude-code","label":"Claude Code","title":"Claude Code skill prompt","kind":"agent-prompt","value":"Add \"Evaluation Guidebook\" as a Claude Code skill from https://github.com/huggingface/evaluation-guidebook. Inspect the skill instructions, place the reusable skill files in the appropriate local skills location for this project, and report the activation steps. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","description":"Use this prompt to ask Claude Code to add the skill and explain the local activation steps.","copyLabel":"Copy prompt"},{"id":"cursor","label":"Cursor","title":"Cursor rule prompt","kind":"agent-prompt","value":"Turn \"Evaluation Guidebook\" from https://github.com/huggingface/evaluation-guidebook into a reusable Cursor project rule or agent instruction. Preserve the core workflow, adapt paths to this repo, and keep the rule scoped to tasks where it is relevant. Skill purpose: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!","description":"Use this when installing as Cursor project rules or reusable agent instructions.","copyLabel":"Copy prompt"}],"repository":"https://github.com/huggingface/evaluation-guidebook","github_repo":"huggingface/evaluation-guidebook","version":"1.0.0","license":"Unknown","updated_at":"2026-06-20T22:55:42.25769+00:00","canonical_key":"huggingface/evaluation-guidebook","recommendation_reasons":["Useful GitHub adoption: 2,124 stars","Install handoff is available","Repository freshness signal is available"],"urls":{"web":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook","api":"https://www.openagentskill.com/api/agent/skills/huggingface-evaluation-guidebook","install_api":"https://www.openagentskill.com/api/skills/huggingface-evaluation-guidebook/install","audit":"https://www.openagentskill.com/skills/huggingface-evaluation-guidebook/audit","repository":"https://github.com/huggingface/evaluation-guidebook"},"meta":{"endpoint":"/api/registry/manifest/{slug}","canonical_agent_endpoint":"/api/agent/skills/huggingface-evaluation-guidebook","agent_friendly":true,"api_version":"1.0","generated_at":"2026-07-03T21:42:16.810Z"}}