actors-mcp-server

Official

Overview Schema Related Servers Score Discussions

README.md•14.6 KiB

# Workflow Evaluation System Tests AI agents performing multi-turn conversations with Apify MCP tools, evaluated by an LLM judge. --- ## Quick Start **Prerequisites:** - Node.js installed - Apify account with API token - OpenRouter API key **Run evaluations:** ```bash # 1. Set environment variables export APIFY_TOKEN="your_apify_token" export OPENROUTER_API_KEY="your_openrouter_key" # 2. Build the MCP server npm run build # 3. Run tests npm run evals:workflow ``` **Common options:** ```bash # Filter by category npm run evals:workflow -- --category search # Run specific test npm run evals:workflow -- --id search-google-maps # Show detailed conversation logs npm run evals:workflow -- --verbose # Increase timeout for long-running Actors (default: 60s) npm run evals:workflow -- --tool-timeout 300 # Run tests in parallel (default: 4) npm run evals:workflow -- --concurrency 8 # Save results to JSON file npm run evals:workflow -- --output ``` **Exit codes:** - `0` = All tests passed ✅ - `1` = Any test failed or error occurred ❌ --- ## Technical Overview Tests AI agents executing tasks using Apify MCP server tools through multi-turn conversations evaluated by an LLM judge. **Core features:** - Multi-turn conversations with tool calling - Dynamic tool discovery during execution - MCP server instructions automatically added to agent system prompt - LLM-based evaluation against requirements - Isolated MCP server per test - Configurable tool call timeout (default: 60 seconds) - Strict pass/fail (all tests must pass) ## Critical Design Decisions ### 1. MCP Server Isolation Per Test **Decision:** Each test gets a fresh MCP server instance. **Why:** - Tools like `call-actor` create persistent state (datasets, runs) on Apify platform - State from one test can contaminate subsequent tests - Each test must start with clean state **Implementation:** ```typescript for (const test of tests) { const mcpClient = new McpClient(); try { await mcpClient.start(apifyToken); // Run test } finally { await mcpClient.cleanup(); // Always cleanup } } ``` **Trade-off:** ~20-30% slower (1-2s spawn overhead per test) but guarantees isolation. **Location:** `run-workflow-evals.ts` ### 2. Dynamic Tool Fetching Per Turn **Decision:** Refresh tools from MCP server after each conversation turn. **Why:** - MCP server supports dynamic tool registration at runtime - `add-actor` tool can register new Actor tools mid-conversation - LLM must see updated tool list to use new tools **Implementation:** ```typescript while (turnNumber < maxTurns) { // Call LLM with current tools const llmResponse = await llmClient.callLlm(messages, model, tools); // Execute tool calls for (const toolCall of llmResponse.toolCalls) { await mcpClient.callTool(toolCall); } // Refresh tools for next turn tools = mcpToolsToOpenAiTools(mcpClient.getTools()); } ``` **Trade-off:** ~10-15% slower (100-200ms per turn) but supports dynamic workflows. **Location:** `conversation-executor.ts` ### 3. Strict Pass/Fail (No Threshold) **Decision:** ALL tests must pass for exit code 0. Any failure = exit code 1. **Why:** - Clear CI/CD signal - No ambiguity about which tests are critical - Quality bar: all functionality must work **Exit codes:** - `0`: ALL tests passed - `1`: ANY test failed or error occurred **Location:** `run-workflow-evals.ts` ### 4. Judge Sees Tool Calls, Not Results **Decision:** Judge sees tool calls with arguments and agent responses, but NOT raw tool results. **Why:** - Evaluates agent behavior (tool selection, arguments) - Tool results are often very long and noisy - Agent should summarize results, judge evaluates the summary **Judge input format:** ``` USER: Find actors for Google Maps AGENT: [Called tool: search-actors with args: {"keywords":"google maps","limit":5}] AGENT: I found 5 actors: 1. Google Maps Scraper... 2. ... ``` **Location:** `workflow-judge.ts` ### 5. LLM Client Shared, MCP Client Isolated **Decision:** One LLM client shared across tests, MCP client isolated per test. **Why:** - LLM client is stateless (OpenRouter/OpenAI SDK) - No cross-test contamination risk - Saves initialization overhead **Location:** `run-workflow-evals.ts` ### 6. Agent vs Judge Models **Agent:** `anthropic/claude-haiku-4.5` (fast, good at tools) **Judge:** `x-ai/grok-4.1-fast` (strong reasoning) Separation allows independent optimization for speed vs evaluation quality. **Location:** `config.ts` ### 7. MCP Server Instructions in System Prompt **Decision:** Automatically append MCP server instructions to agent system prompt. **Why:** - MCP servers can provide usage guidelines via the `instructions` field in the initialize response - Instructions contain important context about tool dependencies and disambiguation - Agents perform better when they understand tool relationships (e.g., `call-actor` requires two steps) - Avoids duplicating server instructions in our agent prompt **Implementation:** ```typescript // Retrieve instructions after connecting to MCP server await mcpClient.start(apifyToken); const serverInstructions = mcpClient.getInstructions(); // Append to agent system prompt const conversation = await executeConversation({ userPrompt: testCase.query, mcpClient, llmClient, serverInstructions, // Automatically appended to system prompt }); ``` **Instructions content:** - Actor concepts and execution workflow - Tool dependencies (e.g., `call-actor` two-step process) - Tool disambiguation (e.g., `search-actors` vs `apify/rag-web-browser`) - Storage types (datasets vs key-value stores) **Location:** `mcp-client.ts`, `conversation-executor.ts` ## System Components ### Core Files - `types.ts` - Type definitions - `config.ts` - Models, prompts, constants - `mcp-client.ts` - MCP server wrapper (spawn, connect, call, retrieve instructions) - `llm-client.ts` - OpenRouter wrapper - `convert-mcp-tools.ts` - MCP → OpenAI tool format - `conversation-executor.ts` - Multi-turn loop with dynamic tools and server instructions - `workflow-judge.ts` - Judge evaluation - `test-cases-loader.ts` - Load/filter test cases - `output-formatter.ts` - Results formatting - `run-workflow-evals.ts` - Main CLI entry ## Configuration ### Environment Variables (Required) ```bash export APIFY_TOKEN="your_apify_token" # Get from https://console.apify.com/account/integrations export OPENROUTER_API_KEY="your_openrouter_key" # Get from https://openrouter.ai/keys ``` ### CLI Options | Option | Alias | Description | Default | |--------|-------|-------------|---------| | `--category <name>` | | Filter tests by category | All categories | | `--id <id>` | | Run specific test by ID | All tests | | `--verbose` | | Show detailed conversation logs | `false` | | `--test-cases-path <path>` | | Custom test cases file path | `test-cases.json` | | `--agent-model <model>` | | Override agent model | `anthropic/claude-haiku-4.5` | | `--judge-model <model>` | | Override judge model | `x-ai/grok-4.1-fast` | | `--tool-timeout <seconds>` | | Tool call timeout | `60` | | `--concurrency <number>` | `-c` | Number of tests to run in parallel | `4` | | `--output` | `-o` | Save results to JSON file | `false` | | `--help` | | Show help message | - | ### Concurrency The `--concurrency` (or `-c`) option controls how many tests run in parallel. **Concurrency recommendations:** - **Default (4)**: Balanced performance for most systems - **8-12**: High-performance systems with good network bandwidth - **1**: Debug mode, run tests sequentially - **Higher values**: May hit API rate limits or resource constraints **Example:** ```bash # Run 8 tests in parallel npm run evals:workflow -- --concurrency 8 npm run evals:workflow -- -c 8 ``` **Note:** Each test spawns its own MCP server instance, so higher concurrency uses more system resources. ### Tool Timeout The `--tool-timeout` option sets the maximum time (in seconds) to wait for a single tool call to complete. **When a tool times out:** - Error returned: `"MCP error -32001: Request timed out"` - The LLM receives this error and can decide how to proceed **Timeout recommendations:** - **Default (60s)**: Suitable for most tools (search, fetch details) - **300s (5 min)**: For Actor calls that scrape moderate amounts of data - **600s (10 min)**: For large-scale scraping operations - **1s (testing)**: Use for testing timeout behavior **Example:** ```bash # Long-running Actor calls npm run evals:workflow -- --tool-timeout 300 ``` ### Saving Results to File The `--output` (or `-o`) option saves test results to `evals/workflows/results.json` for tracking over time. **How it works:** - Results are stored per combination of: `agentModel:judgeModel:testId` - Running the same test with the same models **overwrites** the previous result - Running with different model combinations **adds** new entries - Results are **versioned in git** for historical tracking **Data structure:** ```json { "version": "1.0", "results": { "anthropic/claude-haiku-4.5:x-ai/grok-4.1-fast:search-google-maps": { "timestamp": "2026-01-07T10:45:23.123Z", "agentModel": "anthropic/claude-haiku-4.5", "judgeModel": "x-ai/grok-4.1-fast", "testId": "search-google-maps", "verdict": "PASS", "reason": "Agent successfully searched for Google Maps actors", "durationMs": 5234, "turns": 3, "error": null } } } ``` **Each result contains:** - `timestamp` - ISO timestamp when test was run - `agentModel` - LLM model used for the agent - `judgeModel` - LLM model used for judging - `testId` - Test case identifier - `verdict` - `PASS` or `FAIL` - `reason` - Judge reasoning or error message - `durationMs` - Test duration in milliseconds - `turns` - Number of conversation turns - `error` - Error message if execution failed, `null` otherwise **Examples:** ```bash # Basic usage - save all test results npm run evals:workflow -- --output npm run evals:workflow -- -o # Save results for specific category npm run evals:workflow -- --category search --output # Compare different agent models npm run evals:workflow -- --agent-model anthropic/claude-haiku-4.5 --output npm run evals:workflow -- --agent-model openai/gpt-4o --output # Results file now contains entries for both models # Compare different judge models npm run evals:workflow -- --judge-model x-ai/grok-4.1-fast --output npm run evals:workflow -- --judge-model openai/gpt-4o --output ``` **Partial runs:** When using filters (`--category`, `--id`), only the filtered tests are updated in the results file. Other entries remain unchanged. **Version control:** The `results.json` file is tracked in git, allowing you to: - See result changes over time in commits - Compare results across branches - Track performance regressions in PRs ### Test Case Format File: `test-cases.json` ```json [ { "id": "test-001", "category": "basic", "prompt": "User prompt for agent", "requirements": "What agent must do to pass", "maxTurns": 10, "tools": ["actors", "docs"] } ] ``` **Required fields:** - `id` - Unique identifier - `category` - For filtering - `prompt` - User request - `requirements` - Success criteria for judge **Optional:** - `maxTurns` - Override default (10) - `tools` - List of tools to enable for this test (e.g., `["actors", "docs", "apify/rag-web-browser"]`). If omitted, all default tools are enabled. Passed to MCP server as `--tools` argument. ## Performance **Per test overhead:** - MCP spawn: ~1-2s - Tool refresh/turn: ~100-200ms - LLM call/turn: ~1-5s - Judge evaluation: ~2-4s **5 tests (2-3 turns each):** ~45s total **vs shared MCP (previous):** ~37s (18% faster but unsafe) Trade-off: Slower execution for correctness and isolation is acceptable. ## Key Insights ### MCP Tools Are Stateful Unlike typical function calling: - Create persistent state (datasets, runs) on Apify platform - Can modify tool registry dynamically - Have side effects affecting subsequent calls **Implication:** Test isolation critical. ### Dynamic Tool Registration - `add-actor` dynamically registers new Actor tools - Tool list NOT static - Must refresh after tool execution **Implication:** Cannot cache tools at conversation start. ### Error Propagation Tool errors passed to LLM in tool result message: - LLM can retry, use different tool, or explain to user - No automatic retry by system **Rationale:** LLM should handle errors intelligently. ### Conversation State OpenAI-compatible message history maintained: ```typescript [ { role: 'system', content: '...' }, { role: 'user', content: '...' }, { role: 'assistant', tool_calls: [...] }, { role: 'tool', tool_call_id: '...', content: '...' }, { role: 'assistant', content: '...' } ] ``` Format must be exact for LLM context understanding. ## Common Issues ### Tests interfere with each other **Symptom:** Test 2 fails after Test 1, passes alone. **Solution:** ✅ Isolated MCP instances per test. ### LLM can't use newly added tool **Symptom:** Agent uses `add-actor` but can't call new tool. **Solution:** ✅ Dynamic tool fetching per turn. ### Judge too strict/lenient **Symptom:** Incorrect verdicts. **Solution:** Tune `JUDGE_PROMPT_TEMPLATE` in `config.ts`. ### Tests timeout (hit maxTurns) **Symptom:** Conversations don't complete. **Solutions:** - Review agent system prompt - Check tool results are helpful - Reduce `maxTurns` to fail faster - Try different LLM model ## Future Enhancements ### Three-LLM Conversational Approach **Concept:** More realistic simulation of MCP usage through chat interface. **Architecture:** 1. **User LLM** - Given a goal, prompts the MCP Server LLM to accomplish tasks 2. **MCP Server LLM** - Receives prompts from User LLM, uses MCP tools to fulfill requests 3. **Judge LLM** - Evaluates the entire conversation for correctness **Benefits:** - Simulates real-world chat interface usage pattern - Tests natural language interaction between user and MCP-enabled assistant - More realistic conversation flow with back-and-forth dialogue - Better evaluation of how users would actually interact with MCP tools **Current approach vs Future:** - **Current:** Single LLM directly given task → uses tools → judge evaluates - **Future:** User LLM with goal → prompts Server LLM → Server LLM uses tools → judge evaluates **Status:** Current two-LLM approach (agent + judge) is sufficient for validating tool functionality and basic workflows. The three-LLM approach would be valuable for testing conversational UX and more complex multi-turn interactions. ## References - [MCP Protocol Spec](https://modelcontextprotocol.io/) - [OpenAI Tool Calling](https://platform.openai.com/docs/guides/function-calling) - [Apify API](https://docs.apify.com/api/v2) - [OpenRouter](https://openrouter.ai/)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/apify/actors-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•14.6 KiB