CLAUDE CODE MARKETPLACES

phoenix-evals

Build and run evaluators for AI/LLM applications using Phoenix.

npx skills add https://github.com/github/awesome-copilot --skill phoenix-evals
SKILL.md

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalscommon-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Reference Categories

PrefixDescription
fundamentals-*Types, scores, anti-patterns
observe-*Tracing, sampling
error-analysis-*Finding failures
axial-coding-*Categorizing failures
evaluators-*Code, LLM, RAG evaluators
experiments-*Datasets, running experiments
validation-*Validating evaluator accuracy against human labels
production-*CI/CD, monitoring

Key Principles

PrincipleAction
Error analysis firstCan't automate what you haven't observed
Custom > genericBuild from your failures
Code firstDeterministic before LLM
Validate judges>80% TPR/TNR
Binary > LikertPass/fail, not 1-5
Installs665
GitHub Stars33.0k
LanguagePython
AddedJun 11, 2025
View on GitHub