Skip to main content
Glama

Prompt Caching

Glama supports prompt caching to optimize costs across supported providers and models. Here's how it works with different providers:

Monitoring

Cache effectiveness can be monitored via the analytics and logs dashboards or the /gateway/v1/completion-requests/:id API.

Provider-Specific Details

OpenAI

  • Automatic caching with no configuration needed
  • No cost for cache writes
  • Cache reads: 50% of original input price
  • Minimum prompt size: 1024 tokens

Additional Resources

Anthropic Claude

  • Requires cache_control breakpoints
  • Cache writes: 125% of original input price
  • Cache reads: 10% of original input price
  • Maximum 4 breakpoints with 5-minute expiration
  • Best for large text bodies (character cards, RAG data, etc.)

Example Cache Control Usage:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "LARGE_TEXT_CONTENT",
          "cache_control": { "type": "ephemeral" }
        },
        { "type": "text", "text": "Your question here" }
      ]
    }
  ]
}

Additional Resources

DeepSeek

  • Automatic caching with no configuration needed
  • Cache writes: Same as original input price
  • Cache reads: 10% of original input price

Additional Resources

Note: Cache pricing and features are subject to change. Check our API documentation for the most current information.