# Experiments: Generating Synthetic Test Data
Creating diverse, targeted test data for evaluation.
## Dimension-Based Approach
Define axes of variation, then generate combinations:
```python
dimensions = {
"issue_type": ["billing", "technical", "shipping"],
"customer_mood": ["frustrated", "neutral", "happy"],
"complexity": ["simple", "moderate", "complex"],
}
```
## Two-Step Generation
1. **Generate tuples** (combinations of dimension values)
2. **Convert to natural queries** (separate LLM call per tuple)
```python
# Step 1: Create tuples
tuples = [
("billing", "frustrated", "complex"),
("shipping", "neutral", "simple"),
]
# Step 2: Convert to natural query
def tuple_to_query(t):
prompt = f"""Generate a realistic customer message:
Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
Write naturally, include typos if appropriate. Don't be formulaic."""
return llm(prompt)
```
## Target Failure Modes
Dimensions should target known failures from error analysis:
```python
# From error analysis findings
dimensions = {
"timezone": ["EST", "PST", "UTC", "ambiguous"], # Known failure
"date_format": ["ISO", "US", "EU", "relative"], # Known failure
}
```
## Quality Control
- **Validate**: Check for placeholder text, minimum length
- **Deduplicate**: Remove near-duplicate queries using embeddings
- **Balance**: Ensure coverage across dimension values
## When to Use
| Use Synthetic | Use Real Data |
| ------------- | ------------- |
| Limited production data | Sufficient traces |
| Testing edge cases | Validating actual behavior |
| Pre-launch evals | Post-launch monitoring |
## Sample Sizes
| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Comprehensive eval | 100-500 |
| Per-dimension | 10-20 per combination |