explore_features
Analyze interpretability features including dictionary features, attribution graphs, persona vectors, and cross-domain behavioral persistence to understand model mechanisms.
Instructions
Explore interpretability features: dictionary features, attribution graphs, persona vectors, and cross-domain behavioral persistence. Inspired by Anthropic's mechanistic interpretability research.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| mode | Yes | ||
| domain | No | ||
| compare_domain | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |