Stata-MCP

Overview Schema Related Servers Score Discussions

Evaluation.md•5.04 KiB

# Evaluate Want to evaluate your LLM? We offer a framework. Follow the steps below to evaluate your large language model. > Reference: Tan, S., & Feng, M. (2025). How to use StataMCP improve your social science research? Shanghai Bayes Views Information Technology Co., Ltd. Bibtex follows: ```bibtex @techreport{tan2025stataMCP, author = {Tan, Song and Feng, Muyao}, title = {Stata-MCP: A research report on AI-assisted empirical research}, year = {2025}, month = {September}, day = {21}, language = {English}, address = {Shanghai, China}, institution = {Shanghai Bayes Views Information Technology Co., Ltd.}, url = {https://www.statamcp.com/reports/2025/09/21/stata_mcp_a_research_report_on_ai_assisted_empirical_research} } ``` ## Step 1: Set your environment Set api-key, base-url, and model-name ```bash export OPENAI_API_KEY=<your-api-key> export OPENAI_BASE_URL=https://api.openai.com/v1 export OPENAI_MODEL=gpt-3.5-turbo export CHAT-MODEL=gpt-3.5-turbo export THINKING_MODEL=gpt-5 # For DeepSeek models (alternative) export DEEPSEEK_API_KEY=<your-deepseek-api-key> export DEEPSEEK_BASE_URL=<your-deepseek-base-url> ``` ## Step 2: Run your evaluation task with AgentRunner We provide a convenient `AgentRunner` class to help you execute tasks and extract results. The AgentRunner supports OpenAI-compatible APIs and can process Stata-related tasks automatically. ### Option A: Using AgentRunner (Recommended) ```python from stata_mcp.evaluate import AgentRunner, ScoreModel # Define your evaluation task YOUR_TASK: str = ... GIVEN_ANSWER: str = ... # Initialize and run AgentRunner runner = AgentRunner( model="gpt-3.5-turbo", # or "deepseek-chat" for DeepSeek models api_key="your-api-key", base_url="https://api.openai.com/v1" # or your DeepSeek base URL ) # Execute the task result = runner.run(YOUR_TASK) # Extract conversation history and final answer HIST_MSG = AgentRunner.get_processer(result) FINAL_ANSWER = AgentRunner.get_final_result(result) print(f"Conversation has {len(HIST_MSG)} items") print(f"Final answer: {FINAL_ANSWER}") ``` ### Option B: Manual Agent Setup ```python # If you prefer to set up the agent manually from openai import OpenAI from agents import Agent, Runner client = OpenAI(api_key="your-api-key") agent = Agent( instructions="You are a helpful assistant specialized in Stata analysis.", model="gpt-3.5-turbo" ) result = client.agent.run(agent, input=YOUR_TASK) # Then extract data manually as needed ``` ## Step 3: Evaluate with ScoreModel Once you have the task results, use `ScoreModel` to evaluate the performance: ```python from stata_mcp.evaluate import ScoreModel # Convert conversation history to string format (required by ScoreModel) hist_msg_str = "\n".join([ f"{item['role']}: {item['content']}" for item in HIST_MSG ]) sm = ScoreModel( task=YOUR_TASK, reference_answer=GIVEN_ANSWER, processer=hist_msg_str, # Now supports string format from conversation history results=FINAL_ANSWER, task_id="eval_001" # Optional: set a unique ID for tracking ) # Get the evaluation score score = sm.score_it() print(f"Evaluation Score: {score}") # The ScoreModel evaluates: # - Task completion accuracy # - Quality of analysis # - Statistical correctness # - Clarity of explanation ``` ## Advanced Usage ### Batch Evaluation For evaluating multiple tasks: ```python tasks = [ { "task": "Analyze the relationship between education and income using census data", "reference": "Expected analysis includes correlation, regression, and policy implications" }, { "task": "Conduct a difference-in-differences analysis of a policy intervention", "reference": "Should include pre/post comparison, control group, and statistical significance" } ] runner = AgentRunner(model="gpt-3.5-turbo", api_key="your-api-key") results = [] for i, task_data in enumerate(tasks): result = runner.run(task_data["task"]) hist_msg = AgentRunner.get_processer(result) final_answer = AgentRunner.get_final_result(result) sm = ScoreModel( task=task_data["task"], reference_answer=task_data["reference"], processer="\n".join([f"{item['role']}: {item['content']}" for item in hist_msg]), results=final_answer, task_id=f"batch_eval_{i+1}" ) score = sm.score_it() results.append({"task_id": f"batch_eval_{i+1}", "score": score}) print("Batch Evaluation Results:") for result in results: print(f"Task {result['task_id']}: Score = {result['score']}") ``` ### Custom Evaluation Criteria You can extend the evaluation framework with custom metrics: ```python # The AgentRunner provides structured data that can be used for custom evaluation conversation_analysis = { "total_turns": len(HIST_MSG), "tool_usage_count": len([item for item in HIST_MSG if item["role"] == "tool"]), "has_stata_commands": any("stata" in item["content"].lower() for item in HIST_MSG), "final_answer_length": len(FINAL_ANSWER) } # Use these metrics alongside the ScoreModel score print(f"Conversation Analysis: {conversation_analysis}") ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SepineTam/stata-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

Evaluation.md•5.04 KiB