Prompt Caching

Glama supports prompt caching to optimize costs across supported providers and models. Here's how it works with different providers:

Monitoring

Cache effectiveness can be monitored via the analytics and logs dashboards or the /gateway/v1/completion-requests/:uuid API.

Provider-Specific Details

OpenAI

  • Automatic caching with no configuration needed
  • No cost for cache writes
  • Cache reads: 50% of original input price
  • Minimum prompt size: 1024 tokens

Additional Resources

Anthropic Claude

  • Requires cache_control breakpoints
  • Cache writes: 125% of original input price
  • Cache reads: 10% of original input price
  • Maximum 4 breakpoints with 5-minute expiration
  • Best for large text bodies (character cards, RAG data, etc.)

Example Cache Control Usage:

{ "messages": [ { "role": "user", "content": [ { "type": "text", "text": "LARGE_TEXT_CONTENT", "cache_control": { "type": "ephemeral" } }, { "type": "text", "text": "Your question here" } ] } ] }

Additional Resources

DeepSeek

  • Automatic caching with no configuration needed
  • Cache writes: Same as original input price
  • Cache reads: 10% of original input price

Additional Resources

Note: Cache pricing and features are subject to change. Check our API documentation for the most current information.