{"eval":{"version":"openagentskill-skill-eval-v1","slug":"jeinlee1991-chinese-llm-benchmark","name":"Chinese Llm Benchmark","generated_at":"2026-07-04T01:47:34.317Z","task_input":"Evaluate Chinese Llm Benchmark before installing it in an AI agent workflow","status":"review","score":90,"risk_level":"medium","decision":{"recommendation":"manual_review","reason":"Review the audit page, then allow agent install in a sandboxed workflow.","auto_install_allowed":false,"policy":"review","human_review_required":true},"task_fit":{"score":94,"suited_tasks":["GitHub automation workflows","Claude Code teams","teams that value GitHub adoption signals","Inspect repository metadata","Compare code changes","Write concise engineering summaries","Inspect source files","Explain architecture"],"suited_agents":["LLM","Codex","Claude Code","Cursor","OpenAgentSkill CLI","OpenAI Agents","CLI"]},"install":{"command":"npx skills add jeinlee1991/chinese-llm-benchmark","ready":true,"policy":"review","safety_label":"Review before install","targets":[{"id":"openagentskill-cli","label":"CLI","kind":"command","value":"npx skills add jeinlee1991/chinese-llm-benchmark"},{"id":"codex","label":"Codex","kind":"agent-prompt","value":"Install the \"Chinese Llm Benchmark\" agent skill from https://github.com/jeinlee1991/chinese-llm-benchmark. Read its SKILL.md or equivalent instructions first, install only the files needed for this workspace, and summarize any required setup before using it. Skill purpose: 非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。"},{"id":"claude-code","label":"Claude Code","kind":"agent-prompt","value":"Add \"Chinese Llm Benchmark\" as a Claude Code skill from https://github.com/jeinlee1991/chinese-llm-benchmark. Inspect the skill instructions, place the reusable skill files in the appropriate local skills location for this project, and report the activation steps. Skill purpose: 非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。"},{"id":"cursor","label":"Cursor","kind":"agent-prompt","value":"Turn \"Chinese Llm Benchmark\" from https://github.com/jeinlee1991/chinese-llm-benchmark into a reusable Cursor project rule or agent instruction. Preserve the core workflow, adapt paths to this repo, and keep the rule scoped to tasks where it is relevant. Skill purpose: 非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。"}]},"trust":{"score":89,"label":"Production candidate","version":"trust-score-v4","evidence":{"stars":"6.2K GitHub stars","repoActivity":"6.2K stars, 250 forks","lastPushed":"26d since push","license":"Unknown","repository":"https://github.com/jeinlee1991/chinese-llm-benchmark","install":"npx skills add jeinlee1991/chinese-llm-benchmark","installSafety":"standard package or runtime install path","permissionSurface":"filesystem or document access","documentation":"Strong README/SKILL.md context","agentOutcomes":"No agent outcome data yet"}},"audit":{"score":93,"risk_level":"safe_to_try","risk_label":"Safe to try","warnings":["License is unclear","License clarity: Unknown"]},"safety_gate":{"score":77,"tier":"reviewed","label":"Reviewed","auto_install_policy":"review","blocked":false,"permission_hints":[{"id":"network","label":"Network access","reason":"Skill likely fetches remote pages, APIs, repositories, or external services.","severity":"medium"},{"id":"filesystem","label":"Filesystem access","reason":"Skill may read or write project files, documents, generated artifacts, or local workspace state.","severity":"medium"}],"policy_warnings":["License is unclear"]},"checks":[{"id":"task_fit","label":"Task fit","status":"pass","score":94,"required_for_auto_install":true,"detail":"Task wording matches this skill metadata.","evidence":["Evaluate Chinese Llm Benchmark before installing it in an AI agent workflow","agent-frameworks","GitHub automation workflows; Claude Code teams; teams that value GitHub adoption signals"]},{"id":"install_path","label":"Install path","status":"pass","score":92,"required_for_auto_install":true,"detail":"Install handoff is available.","evidence":["npx skills add jeinlee1991/chinese-llm-benchmark"]},{"id":"install_safety","label":"Install command safety","status":"pass","score":92,"required_for_auto_install":true,"detail":"standard package or runtime install path","evidence":["npx skills add jeinlee1991/chinese-llm-benchmark"]},{"id":"trust_score","label":"Trust score","status":"pass","score":89,"required_for_auto_install":true,"detail":"Strong OpenAgentSkill Trust Score across adoption, recent maintenance, license clarity, documentation, dependency/runtime risk, install safety, permission surface, and install availability.","evidence":["Production candidate","6.2K GitHub stars","Unknown"]},{"id":"audit_score","label":"Audit score","status":"pass","score":93,"required_for_auto_install":true,"detail":"Safe to try","evidence":["License is unclear"]},{"id":"agent_safety_gate","label":"Agent safety gate","status":"warn","score":77,"required_for_auto_install":true,"detail":"Good audit and safety signals with no high-risk permission hints in public metadata.","evidence":["Review the audit page, then allow agent install in a sandboxed workflow.","Safe-to-try audit"]},{"id":"readme_skillmd_completeness","label":"README/SKILL.md completeness","status":"pass","score":90,"required_for_auto_install":false,"detail":"Metadata includes enough usage and workflow context","evidence":["Strong README/SKILL.md context"]},{"id":"license_clarity","label":"License clarity","status":"warn","score":42,"required_for_auto_install":true,"detail":"Unknown","evidence":["Unknown"]},{"id":"recent_maintenance","label":"Recent maintenance","status":"pass","score":100,"required_for_auto_install":false,"detail":"26d since push","evidence":["26d since push"]},{"id":"permission_surface","label":"Permission surface","status":"pass","score":86,"required_for_auto_install":true,"detail":"filesystem or document access","evidence":["Network access: medium","Filesystem access: medium"]},{"id":"alternatives","label":"Alternatives available","status":"pass","score":82,"required_for_auto_install":false,"detail":"Alternative skills are available for comparison.","evidence":["significant-gravitas-autogpt","langchain-ai-langchain","nousresearch-hermes-agent","firecrawl-firecrawl"]}],"blockers":[],"warnings":["Agent safety gate: Good audit and safety signals with no high-risk permission hints in public metadata.","License clarity: Unknown","License is unclear"],"validation_plan":["Inspect repository, README/SKILL.md, license, and recent commits before production use.","Install in an isolated workspace or sandbox with no production secrets available.","Run the smallest representative task and record files touched, commands run, network access, and outputs.","Compare the selected skill against at least one alternative when the eval status is review or failed.","Promote only after the agent reports a successful verification result and unresolved warnings are accepted."],"do_not_use_when":["teams that need a vendor-supported SLA","high-compliance environments without internal security review","No major risk signals from current metadata","License is unclear","License clarity: Unknown","Production credentials, payments, or irreversible account changes without explicit human review","Sensitive private data before reviewing repository code, license, and permission surface","Automatic installation in a production workspace"],"alternatives":[{"slug":"significant-gravitas-autogpt","name":"AutoGPT","url":"https://www.openagentskill.com/skills/significant-gravitas-autogpt","stars":185244,"install_command":"npx skills add Significant-Gravitas/AutoGPT","trust_score":86,"audit_score":92},{"slug":"langchain-ai-langchain","name":"Langchain","url":"https://www.openagentskill.com/skills/langchain-ai-langchain","stars":140782,"install_command":"npx skills add langchain-ai/langchain","trust_score":92,"audit_score":95},{"slug":"nousresearch-hermes-agent","name":"Hermes Agent","url":"https://www.openagentskill.com/skills/nousresearch-hermes-agent","stars":205451,"install_command":"npx skills add NousResearch/hermes-agent","trust_score":92,"audit_score":95},{"slug":"firecrawl-firecrawl","name":"Firecrawl","url":"https://www.openagentskill.com/skills/firecrawl-firecrawl","stars":139273,"install_command":"npx skills add firecrawl/firecrawl","trust_score":91,"audit_score":94}],"machine_metadata":{"version":"openagentskill-agent-metadata-v2","skill":{"slug":"jeinlee1991-chinese-llm-benchmark","name":"Chinese Llm Benchmark","description":"非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。","category":"agent-frameworks","url":"https://www.openagentskill.com/skills/jeinlee1991-chinese-llm-benchmark","repository":"https://github.com/jeinlee1991/chinese-llm-benchmark","github_repo":"jeinlee1991/chinese-llm-benchmark"},"suited_tasks":["GitHub automation workflows","Claude Code teams","teams that value GitHub adoption signals","Inspect repository metadata","Compare code changes","Write concise engineering summaries","Inspect source files","Explain architecture"],"suited_agents":["LLM","Codex","Claude Code","Cursor","OpenAgentSkill CLI","OpenAI Agents","CLI"],"install":{"command":"npx skills add jeinlee1991/chinese-llm-benchmark","ready":true,"targets":[{"id":"openagentskill-cli","label":"CLI","kind":"command","value":"npx skills add jeinlee1991/chinese-llm-benchmark"},{"id":"codex","label":"Codex","kind":"agent-prompt","value":"Install the \"Chinese Llm Benchmark\" agent skill from https://github.com/jeinlee1991/chinese-llm-benchmark. Read its SKILL.md or equivalent instructions first, install only the files needed for this workspace, and summarize any required setup before using it. Skill purpose: 非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。"},{"id":"claude-code","label":"Claude Code","kind":"agent-prompt","value":"Add \"Chinese Llm Benchmark\" as a Claude Code skill from https://github.com/jeinlee1991/chinese-llm-benchmark. Inspect the skill instructions, place the reusable skill files in the appropriate local skills location for this project, and report the activation steps. Skill purpose: 非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。"},{"id":"cursor","label":"Cursor","kind":"agent-prompt","value":"Turn \"Chinese Llm Benchmark\" from https://github.com/jeinlee1991/chinese-llm-benchmark into a reusable Cursor project rule or agent instruction. Preserve the core workflow, adapt paths to this repo, and keep the rule scoped to tasks where it is relevant. Skill purpose: 非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。"}],"handoff_url":"https://www.openagentskill.com/api/skills/jeinlee1991-chinese-llm-benchmark/install","manifest_url":"https://www.openagentskill.com/api/registry/manifest/jeinlee1991-chinese-llm-benchmark"},"trust":{"score":89,"label":"Production candidate","version":"trust-score-v4","install_policy":"human_review_before_install","evidence":{"stars":"6.2K GitHub stars","repoActivity":"6.2K stars, 250 forks","lastPushed":"26d since push","license":"Unknown","repository":"https://github.com/jeinlee1991/chinese-llm-benchmark","install":"npx skills add jeinlee1991/chinese-llm-benchmark","installSafety":"standard package or runtime install path","permissionSurface":"filesystem or document access","documentation":"Strong README/SKILL.md context","agentOutcomes":"No agent outcome data yet"},"outcome_evidence":{"total":0,"successes":0,"failures":0,"not_relevant":0,"success_rate":null,"recent_success_rate":null,"recent_failure_rate":null,"install_attempts":0,"install_success_rate":null,"risk_blocked":0,"setup_required":0,"avg_output_quality":null,"production_outcomes":0,"last_outcome_at":null,"label":"No agent outcome data yet"},"auto_install":{"allowed":false,"sandbox_required":true,"reason":"Human review or sandbox validation is required before automatic installation."},"best_for":["agent-frameworks","llm-agent","agents","agentic-ai","artificial-intelligence","llm-evaluation"],"known_risks":["License is unclear","License clarity: Unknown"]},"agent_proven":{"version":"agent-proven-v1","score":0,"tier":"unproven","label":"Needs first agent run","summary":"No agent outcome reports yet. Use Resolve, run one narrow sandbox task, then report the result.","metrics":{"totalOutcomes":0,"successfulOutcomes":0,"failedOutcomes":0,"installAttempts":0,"installSuccessRate":null,"successRate":null,"recentSuccessRate":null,"recentFailureRate":null,"riskBlocked":0,"setupRequired":0,"notRelevant":0,"avgOutputQuality":null,"avgTimeToUsefulMs":null,"productionOutcomes":0,"humanReviewRequired":0,"uniqueAgents":0,"lastOutcomeAt":null},"signals":[],"penalties":["No real agent outcome evidence yet"]},"audit":{"score":93,"risk_level":"safe_to_try","risk_label":"Safe to try","warnings":["License is unclear","License clarity: Unknown"]},"safety_gate":{"tier":"reviewed","label":"Reviewed","auto_install_policy":"review","auto_install_allowed":false,"human_review_required":true,"blocked":false,"recommended_action":"Review the audit page, then allow agent install in a sandboxed workflow."},"quality":{"score":100,"label":"Excellent"},"supply":{"track":"Coding and developer agents","scenario":"GitHub automation","maintenance":"26d since push","risk":"Safe to try"},"alternative_skills":[{"slug":"significant-gravitas-autogpt","name":"AutoGPT","url":"https://www.openagentskill.com/skills/significant-gravitas-autogpt","stars":185244,"install_command":"npx skills add Significant-Gravitas/AutoGPT","trust_score":86,"audit_score":92},{"slug":"langchain-ai-langchain","name":"Langchain","url":"https://www.openagentskill.com/skills/langchain-ai-langchain","stars":140782,"install_command":"npx skills add langchain-ai/langchain","trust_score":92,"audit_score":95},{"slug":"nousresearch-hermes-agent","name":"Hermes Agent","url":"https://www.openagentskill.com/skills/nousresearch-hermes-agent","stars":205451,"install_command":"npx skills add NousResearch/hermes-agent","trust_score":92,"audit_score":95},{"slug":"firecrawl-firecrawl","name":"Firecrawl","url":"https://www.openagentskill.com/skills/firecrawl-firecrawl","stars":139273,"install_command":"npx skills add firecrawl/firecrawl","trust_score":91,"audit_score":94}],"do_not_use_when":["teams that need a vendor-supported SLA","high-compliance environments without internal security review","No major risk signals from current metadata","License is unclear","License clarity: Unknown","Production credentials, payments, or irreversible account changes without explicit human review","Sensitive private data before reviewing repository code, license, and permission surface","Automatic installation in a production workspace"],"agent_contract":{"task_input":"Evaluate Chinese Llm Benchmark before installing it in an AI agent workflow","recommended_action":"Review the audit page, then allow agent install in a sandboxed workflow.","install_policy":"review","minimum_review_before_use":["Trust: 89/100 Production candidate","Audit: 93/100 Safe to try","Safety: 77/100 Review before install","Review repository, license, install command, and permission surface before production use."],"expected_agent_output":{"selected_skill":"jeinlee1991-chinese-llm-benchmark (Chinese Llm Benchmark)","install_command":"npx skills add jeinlee1991/chinese-llm-benchmark","risk_summary":"Safe to try; Reviewed; Review before production","verification_result":"Report the smallest successful task, files touched, warnings, and any missing setup."}},"outcome_feedback":{"endpoint":"https://www.openagentskill.com/api/agent/outcome","method":"POST","requires_resolve_event_id":true,"event_id_source":"Use install_receipt.outcome_feedback.event_id or feedback.event_id returned by /api/agent/resolve for the current task.","expected_outcomes":["success","failed","not_relevant","blocked_by_risk","setup_required"],"payload_template":{"event_id":"<install_receipt.outcome_feedback.event_id or feedback.event_id from /api/agent/resolve>","skill_slug":"jeinlee1991-chinese-llm-benchmark","task":"Evaluate Chinese Llm Benchmark before installing it in an AI agent workflow","agent":"codex","outcome":"success","install_used":true,"risk_blocked":false,"setup_required":false,"task_success":true,"output_quality":4,"error_type":null,"human_review_required":false,"workspace":"sandbox","time_to_useful_ms":120000,"notes":"Report the smallest successful task, setup friction, files touched, and risk notes."}},"endpoints":{"web":"https://www.openagentskill.com/skills/jeinlee1991-chinese-llm-benchmark","api":"https://www.openagentskill.com/api/agent/skills/jeinlee1991-chinese-llm-benchmark","audit":"https://www.openagentskill.com/skills/jeinlee1991-chinese-llm-benchmark/audit","eval":"https://www.openagentskill.com/api/agent/evals?slug=jeinlee1991-chinese-llm-benchmark&task=Evaluate%20Chinese%20Llm%20Benchmark%20before%20installing%20it%20in%20an%20AI%20agent%20workflow&max_risk=medium","resolve":"https://www.openagentskill.com/api/agent/resolve?task=Evaluate%20Chinese%20Llm%20Benchmark%20before%20installing%20it%20in%20an%20AI%20agent%20workflow&agent=codex&max_risk=medium","receipt":"https://www.openagentskill.com/api/agent/receipt?task=Evaluate%20Chinese%20Llm%20Benchmark%20before%20installing%20it%20in%20an%20AI%20agent%20workflow&agent=codex&max_risk=medium&format=text","install":"https://www.openagentskill.com/api/skills/jeinlee1991-chinese-llm-benchmark/install","manifest":"https://www.openagentskill.com/api/registry/manifest/jeinlee1991-chinese-llm-benchmark"}},"endpoints":{"web":"https://www.openagentskill.com/skills/jeinlee1991-chinese-llm-benchmark","api":"https://www.openagentskill.com/api/agent/skills/jeinlee1991-chinese-llm-benchmark","eval":"https://www.openagentskill.com/api/agent/evals?slug=jeinlee1991-chinese-llm-benchmark","audit":"https://www.openagentskill.com/skills/jeinlee1991-chinese-llm-benchmark/audit","resolve":"https://www.openagentskill.com/api/agent/resolve?task=Evaluate%20Chinese%20Llm%20Benchmark%20before%20installing%20it%20in%20an%20AI%20agent%20workflow&agent=codex&max_risk=medium"}},"meta":{"endpoint":"/api/agent/evals","mode":"skill_eval","purpose":"Pre-install eval contract for a single skill. Agents should read this before installing a reusable skill.","generated_at":"2026-07-04T01:47:34.317Z"}}