Prompt Caching
Glama supports prompt caching to optimize costs across supported providers and models. Here's how it works with different providers:
Monitoring
Cache effectiveness can be monitored via the analytics and logs dashboards or the /gateway/v1/completion-requests/:uuid
API.
Provider-Specific Details
OpenAI
- Automatic caching with no configuration needed
- No cost for cache writes
- Cache reads: 50% of original input price
- Minimum prompt size: 1024 tokens
Additional Resources
Anthropic Claude
- Requires
cache_control
breakpoints - Cache writes: 125% of original input price
- Cache reads: 10% of original input price
- Maximum 4 breakpoints with 5-minute expiration
- Best for large text bodies (character cards, RAG data, etc.)
Example Cache Control Usage:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "LARGE_TEXT_CONTENT",
"cache_control": { "type": "ephemeral" }
},
{ "type": "text", "text": "Your question here" }
]
}
]
}
Additional Resources
DeepSeek
- Automatic caching with no configuration needed
- Cache writes: Same as original input price
- Cache reads: 10% of original input price
Additional Resources
Note: Cache pricing and features are subject to change. Check our API documentation for the most current information.