Skip to main content
Glama
Ownership verified

Server Details

The world's first named AI prompt quality score. Score, optimize, and compare LLM prompts before they hit any model. Free tier available. Built on PEEM, RAGAS, G-Eval, and MT-Bench frameworks. x402-native on Base.

Status
Healthy
Last Tested
Transport
Streamable HTTP
URL

Glama MCP Gateway

Connect through Glama MCP Gateway for full control over tool access and complete visibility into every call.

MCP client
Glama
MCP server

Full call logging

Every tool call is logged with complete inputs and outputs, so you can debug issues and audit what your agents are doing.

Tool access control

Enable or disable individual tools per connector, so you decide what your agents can and cannot do.

Managed credentials

Glama handles OAuth flows, token storage, and automatic rotation, so credentials never expire on your clients.

Usage analytics

See which tools your agents call, how often, and when, so you can understand usage patterns and catch anomalies.

100% free. Your data is private.

Tool Definition Quality

Score is being calculated. Check back soon.

Available Tools

3 tools
compare_modelsAInspect

Compare Claude vs GPT-4o on the same prompt. Scored head-to-head by a third model judge. Returns winner, scores, and recommendation. Costs $0.50 USDC via x402.

ParametersJSON Schema
NameRequiredDescriptionDefault
promptYesThe prompt to compare
api_keyYesPQS API key. Get one at pqs.onchainintel.net
verticalNoDomain context. Defaults to general.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes the process (comparison with a third-model judge), output (winner, scores, recommendation), and a critical behavioral trait (cost of $0.50 USDC via x402). However, it does not cover other potential behaviors like rate limits, error handling, or execution time.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is highly concise and front-loaded, with every sentence adding value: it explains the comparison process, the output, and the cost. There is no wasted text, and the structure efficiently communicates essential information in a compact form.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of a comparative AI evaluation tool with no annotations and no output schema, the description does a good job by covering the process, output, and cost. However, it lacks details on the output format (e.g., structure of scores, what 'recommendation' entails) and any prerequisites beyond the api_key, leaving some gaps for full contextual understanding.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all parameters thoroughly. The description does not add any additional meaning or context beyond what the schema provides for the parameters, such as explaining the significance of the 'vertical' enum choices or how the 'api_key' is used. Baseline 3 is appropriate when the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('compare Claude vs GPT-4o on the same prompt'), the resource (the two AI models), and the mechanism ('scored head-to-head by a third model judge'). It distinguishes itself from sibling tools like 'optimize_prompt' and 'score_prompt' by focusing on comparative evaluation rather than optimization or scoring of a single prompt.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for comparing two specific AI models, but does not explicitly state when to use this tool versus alternatives like 'score_prompt' or 'optimize_prompt'. It mentions the cost, which provides some context, but lacks explicit guidance on scenarios where this comparison is preferred over other methods.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

optimize_promptAInspect

Score AND optimize any LLM prompt using PQS. Returns score, optimized prompt, and 8-dimension breakdown based on PEEM, RAGAS, G-Eval, MT-Bench. Costs $0.025 USDC via x402.

ParametersJSON Schema
NameRequiredDescriptionDefault
promptYesThe prompt to optimize
api_keyYesPQS API key. Get one at pqs.onchainintel.net
verticalNoDomain context. Defaults to general.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively adds context beyond the input schema by specifying the return values ('score, optimized prompt, and 8-dimension breakdown'), cost details ('Costs $0.025 USDC via x402'), and the frameworks used ('based on PEEM, RAGAS, G-Eval, MT-Bench'). However, it lacks information on rate limits, error handling, or authentication requirements beyond the API key.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is highly concise and front-loaded, packing essential information into two sentences: the core functionality and the cost. Every sentence earns its place by providing critical details without redundancy, making it efficient for an AI agent to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (scoring and optimization with multiple frameworks) and the absence of annotations and output schema, the description does a good job of covering key aspects like return values and cost. However, it could improve by mentioning potential limitations or the format of the 8-dimension breakdown, leaving some gaps in completeness.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all parameters thoroughly. The description does not add any meaningful semantic details beyond what the schema provides (e.g., it doesn't explain the significance of the 'vertical' parameter or provide examples). Baseline 3 is appropriate as the schema handles parameter documentation adequately.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('score AND optimize') and resource ('any LLM prompt using PQS'), distinguishing it from sibling tools like 'score_prompt' (which only scores) and 'compare_models' (which compares models rather than optimizing prompts). It explicitly mentions the dual functionality of scoring and optimization.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage context by mentioning 'any LLM prompt' and the cost, but it does not explicitly state when to use this tool versus alternatives like 'score_prompt' or 'compare_models'. No exclusions or prerequisites are provided, leaving the agent to infer usage based on the need for both scoring and optimization.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

score_promptAInspect

Score any LLM prompt for quality using PQS. Returns a grade (A-F), score out of 40, and percentile. Free — no payment required. Use before sending any prompt to an LLM.

ParametersJSON Schema
NameRequiredDescriptionDefault
promptYesThe prompt to score
verticalNoDomain context for scoring. Defaults to general.
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions the tool is 'free — no payment required' (useful context) and describes the output format (grade, score, percentile), but lacks details on rate limits, error handling, or performance characteristics.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded, with three concise sentences that each add value: stating the purpose, describing the output, and providing usage guidance with no wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (2 parameters, no output schema, no annotations), the description is reasonably complete—covering purpose, output format, and usage context—though it could benefit from more behavioral details like error cases or limitations.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents both parameters thoroughly. The description adds no additional parameter semantics beyond what the schema provides, maintaining the baseline score.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('score any LLM prompt for quality using PQS') and resources ('prompt'), and distinguishes it from siblings by specifying its unique function of quality scoring rather than comparison or optimization.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context for when to use the tool ('use before sending any prompt to an LLM'), but does not explicitly mention when not to use it or name alternatives like sibling tools compare_models or optimize_prompt.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Discussions

No comments yet. Be the first to start the discussion!

Try in Browser

Your Connectors

Sign in to create a connector for this server.

Resources