# Fundamentals
Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.
## Evaluator Types
| Type | Speed | Cost | Use Case |
| ---- | ----- | ---- | -------- |
| **Code** | Fast | Cheap | Regex, JSON, format, exact match |
| **LLM** | Medium | Medium | Subjective quality, complex criteria |
| **Human** | Slow | Expensive | Ground truth, calibration |
**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.
## Score Structure
| Property | Required | Description |
| -------- | -------- | ----------- |
| `name` | Yes | Evaluator name |
| `kind` | Yes | `"code"`, `"llm"`, `"human"` |
| `score` | No* | 0-1 numeric |
| `label` | No* | `"pass"`, `"fail"` |
| `explanation` | No | Rationale |
*One of `score` or `label` required.
## Binary > Likert
Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.
```python
# Multiple binary checks instead of one Likert scale
evaluators = [
AnswersQuestion(), # Yes/No
UsesContext(), # Yes/No
NoHallucination(), # Yes/No
]
```
## Quick Patterns
### Code Evaluator
```python
from phoenix.evals import create_evaluator
@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
return bool(re.search(r'\[\d+\]', output))
```
### LLM Evaluator
```python
from phoenix.evals import ClassificationEvaluator, LLM
evaluator = ClassificationEvaluator(
name="helpfulness",
prompt_template="...",
llm=LLM(provider="openai", model="gpt-4o"),
choices={"not_helpful": 0, "helpful": 1}
)
```
### Run Experiment
```python
from phoenix.client.experiments import run_experiment
experiment = run_experiment(
dataset=dataset,
task=my_task,
evaluators=[evaluator1, evaluator2],
)
print(experiment.aggregate_scores)
```