mcp-foresea
Integrates Brave Search API to gather web evidence for forecasts.
Reads track record and radar data from GitHub repositories for public endpoints.
Fetches news articles from Google News RSS as evidence for prediction.
Retrieves candidate markets from Reddit for the niche-market radar.
Uses SearXNG as a self-hosted web search engine for evidence retrieval.
Analyzing LLM Rationale
Conference artifact for studying how explicit rationale instructions affect LLM forecasting behavior on Metaculus-style binary forecasting questions. The codebase contains the prompt variants, batch inference runner, generated result tables, and plotting/analysis scripts used for the paper figures. The live Foresea API also supports prediction-market intelligence: typed forecasts, evidence retrieval, and model-vs-market edge analysis for binary and multiple-choice markets.
Live API
Deployed on Google Cloud Run — model gpt-oss-120b, variant variant0_neutral_baseline:
https://foresea.ink(The URL is printed in the GitHub Actions deploy-step output after the first push to main.)
# Health check
curl https://foresea.ink/health
# Single-record prediction
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will X happen by date Y?",
"question_type": "binary",
"description": "Context here.",
"news_articles": [],
"attach_evidence": true,
"evidence_top_k": 5,
"market_platform": "Polymarket",
"market_probability": 0.42,
"variant": "variant0_neutral_baseline"
}'When attach_evidence is true and no news_articles are supplied, /predict
fetches and ranks current news evidence from GDELT, Google News RSS, and Stooq by
default, injects it into the model prompt, and returns the selected
evidence_articles with the forecast. Supplying news_articles skips automatic
retrieval and uses the caller-provided evidence.
The response includes both the forecast and the evidence used by the model:
{
"question_type": "binary",
"predicted_answer": "Yes",
"confidence": 0.86,
"options": [],
"range_forecast": null,
"rationale": "Model-generated explanation for the forecast.",
"model_rationale": "Model-generated explanation for the forecast.",
"variant": "variant0_neutral_baseline",
"model_key": "gpt-oss-120b",
"evidence_sources": [
{
"source": "Reuters",
"title": "Article headline",
"url": "https://example.com/article",
"publish_date": "2026-05-29T00:00:00Z",
"relevance_score": 0.82
}
],
"evidence_articles": [
{
"title": "Article headline",
"summary": "Cleaned article summary.",
"source": "Reuters",
"url": "https://example.com/article",
"publish_date": "2026-05-29T00:00:00Z",
"relevance_score": 0.82,
"search_query": "query used for retrieval"
}
],
"evidence_error": null,
"market_analysis": {
"platform": "Polymarket",
"market_url": "https://example.com/market",
"outcome": "Yes",
"market_probability": 0.42,
"model_probability": 0.86,
"edge": 0.44,
"stance": "model_above_market",
"summary": "Foresea is 44 percentage points above the market on Yes."
}
}Use evidence_sources when a client only needs the source list and links. Use
evidence_articles when a client needs the article-level details that were
attached to the model prompt. rationale and model_rationale are generated by
gpt-oss-120b and explain why the model chose its answer and confidence.
When market_probability is supplied, market_analysis is computed
deterministically from the model probability and the market-implied probability.
Production Deployment Notes
Production is served from the custom domain:
https://foresea.inkThe Cloud Run service name, project ID, and region are set at deploy time via gcloud run deploy.
Required runtime environment:
SCADS_AI_API_KEY: Secret Manager secret used by hosted model calls.MODEL_DEVICE=cpu: production Cloud Run runs the CPU image.CUSTOM_DOMAIN=foresea.ink: redirects*.run.apprequests to the public domain.GOOGLE_CLIENT_ID: Google OAuth web client ID used by/auth/config.GITHUB_CLIENT_ID/GITHUB_CLIENT_SECRET: GitHub OAuth app credentials. The OAuth app's callback URL must be the site origin (e.g.https://foresea.ink/). When unset, the "Continue with GitHub" button is hidden and/auth/githubreturns 503. Sign-in also works with Google and email/password.SESSION_SECRET: long random string used to sign browser session JWTs.
The OAuth client must allow these JavaScript origins:
https://foresea.ink
https://www.foresea.ink
https://<cloud-run-service-url>.run.appTo update non-secret environment variables without replacing the existing
SESSION_SECRET, use --update-env-vars:
gcloud run services update <service-name> \
--region <region> \
--project <project-id> \
--update-env-vars MODEL_DEVICE=cpu,CUSTOM_DOMAIN=foresea.ink,GOOGLE_CLIENT_ID='<your-google-client-id>'Verify the deployed auth config and health endpoint:
curl https://foresea.ink/auth/config
curl https://foresea.ink/healthScaling and caching
The server is built to scale horizontally on Cloud Run:
Authentication supports Google One-Tap and email/password (
/auth/register,/auth/login). Passwords are stored as salted PBKDF2-HMAC-SHA256 hashes; accounts live in Cloud Datastore.Caching and rate limiting use Redis when
REDIS_URLis set, so they are shared across instances; otherwise they fall back to per-instance in-memory state and fail open./predict(non-personalised requests), evidence retrieval, and/extractURL fetches are cached; public GETs sendCache-Control.
Var | Default | Description |
| unset | Memorystore/Redis URL. Shares cache + rate limits across instances. |
|
| Cache TTL (s) for non-personalised |
|
| Cache TTL (s) for evidence retrieval. |
|
| Cache TTL (s) for |
|
| Max entries in the in-memory fallback cache. |
| unset | Enable web search as an evidence source. A self-hosted SearXNG is preferred when set, then Tavily, Serper, Brave. Tavily/Serper have free no-card tiers. When none is set, evidence comes from GDELT, Google News, and RSS. |
| unset | Enables NewsAPI as an evidence source. |
Live track record
GET /track-record serves the public forecast track record. The heavy tick loop
does not run on Cloud Run: .github/workflows/track-record-tick.yml runs hourly
on GitHub Actions, updates data/track_record_store.json as the source-of-truth
entity store, writes the public aggregate to static/track_record_live.json, and
commits both files back to main. At runtime, Cloud Run fetches the committed
aggregate from raw GitHub, falling back to the bundled file and then the static
backtest in static/track_record.json.
The Action discovers short-to-medium-horizon Polymarket/Kalshi markets in
separate close-date bands (2-7, 7-14, 14-30, 30-60 days by default) and
calls /predict once per newly snapshotted market/model. If /predict is
protected, set the GitHub secret PREDICT_API_KEY; no server-side
/track-record/tick endpoint is required. TRACK_RECORD_TOKEN is optional and
only enables the agent-enrolled market bridge.
Foresea Radar
GET /radar serves the public niche-market radar used by the web app's Radar
view. .github/workflows/radar-tick.yml runs hourly on GitHub Actions, fetches
Polymarket/Kalshi/Reddit candidates, calls /predict for fresh Foresea
probabilities, writes static/radar.json, and commits it back to main.
Cloud Run only serves that committed JSON artifact, so Radar avoids OOMs without
avoiding inference.
Radar items include market price, Foresea probability, edge, credibility score,
evidence links, tracking status, and tags such as thin liquidity,
resolution-rule risk, news catalyst, crowded sports market, near-term,
and tracked live. The endpoint is cached with RADAR_TTL (default 900
seconds).
Raise the Cloud Run throughput ceiling (no idle cost while min-instances=0):
gcloud run services update analyzing-llm-rationale --region us-central1 \
--max-instances 20 --concurrency 40 --memory 1GiOnce max-instances > 1, provision Memorystore for Redis (billable) and set
REDIS_URL so rate limiting and caching stay correct across instances:
gcloud services enable redis.googleapis.com vpcaccess.googleapis.com compute.googleapis.com
gcloud redis instances create foresea-cache --size=1 --region=us-central1 --tier=basic
gcloud compute networks vpc-access connectors create foresea-vpc \
--region=us-central1 --range=10.8.0.0/28
gcloud run services update analyzing-llm-rationale --region us-central1 \
--vpc-connector foresea-vpc \
--update-env-vars REDIS_URL=redis://<instance-host>:6379Using the API
The public Cloud Run API is the easiest integration target. It accepts forecasting questions and returns a typed forecast, model rationale, and optional evidence articles. It is built for resolvable forecasts, not general Q&A.
Endpoints
GET /health: service health check.GET /track-record: public live track record, falling back to the static backtest.GET /track-record/digest: shareable markdown summary of the live track record.GET /pr-agent: opt-in agent-to-agent outreach packet for Foresea discovery.POST /predict: public prediction endpoint.GET /markets/polymarket: fetch a live Polymarket quote (see below).GET /markets/kalshi: fetch a live Kalshi quote (see below).POST /agent/analyze: orchestrated end-to-end analysis of a live question (see below).GET /agent/scan: scan a venue for mispriced markets, ranked by edge (see below).GET /trading/accounts: authenticated trading-readiness status, no secrets returned.POST /trading/preview: authenticated dry-run order normalization.POST /trading/orders: authenticated live order submission with explicit confirmation.
Agent: automated intelligence layer
POST /agent/analyze runs the whole pipeline autonomously: resolve the market
(fetch a live Polymarket/Kalshi price when an identifier is given) → gather
evidence + forecast → price the edge → run any custom skills →
recommend. It returns one structured report.
curl -X POST https://foresea.ink/agent/analyze \
-H "Content-Type: application/json" \
-d '{
"platform": "polymarket",
"slug": "will-the-fed-cut-rates-in-2026",
"skills": [
{"name": "Base rate check", "instruction": "Compare to historical base rates."},
{"name": "Risk", "instruction": "What would most change this forecast?"}
]
}'Custom skills are your own analysis steps — each runs as an extra model pass
over the question, forecast, and evidence, and comes back as a named section in
the report. Provide a question directly, or a platform + market identifier
(slug/market_id for Polymarket, ticker for Kalshi). Pass history (prior
turns) for multi-turn follow-ups — with history, short follow-ups like "why?" or
"what about June?" are answered in context. BYOK fields (openrouter_api_key,
openrouter_model, provider_base_url) apply here too.
The report includes recommendation (buy_yes/buy_no/hold/no_market_price),
edge, model_probability, market_probability, thesis, evidence_sources,
and pipeline (the ordered steps that ran).
Edge scan — find mispriced markets
GET /agent/scan lists live markets on a venue, forecasts each, and returns the
ones whose model-vs-market gap clears min_edge, ranked by |edge|.
curl "https://foresea.ink/agent/scan?platform=polymarket&limit=4&min_edge=0.1"Params: platform (polymarket or kalshi), limit (markets to analyse, max 8),
min_edge (default 0.1), evidence_top_k. Each market runs a full forecast, so
it's bounded by limit and the result is cached briefly. Response: {platform,
scanned, opportunities: [{question, market_url, market_probability,
model_probability, edge, recommendation}]}. In the web app, the desk's
"⚡ Scan Polymarket for mispriced markets" button calls this.
MCP server: let AI agents call Foresea as tools
Foresea exposes a public remote MCP server at:
https://foresea.ink/mcp/It is advertised for discovery at:
https://foresea.ink/.well-known/mcp/server.jsonThe remote MCP server is a thin tool layer over the public API. It exposes:
foresea_forecast: callsPOST /predict.foresea_analyze_market: callsPOST /agent/analyze.foresea_scan_markets: callsGET /agent/scan.foresea_track_record: callsGET /track-record.foresea_edge_board: callsGET /edge-board— live model-vs-market disagreements ranked, each tagged with the resolved track record of gaps that size (by_edgecalibration +lead_lag).foresea_pr_agent: callsGET /pr-agent— concise copy and install metadata for agents/catalogs that ask how to describe Foresea.Resources:
foresea://track-record,foresea://pr-agent, andforesea://openapi.json.
PR agent — agent-to-agent distribution
GET /pr-agent?audience=mcp returns an opt-in outreach packet that other agents,
MCP catalogs, and tool directories can quote when introducing Foresea. It includes
the one-liner, install command, MCP/OpenAPI links, talking points, and an explicit
no-spam policy.
For operator-run cold outreach to explicit agent endpoints, prepare a target list
and use the local runner. It dry-runs by default and only sends with --send:
python scripts/pr_agent_outreach.py --targets outreach-targets.json
python scripts/pr_agent_outreach.py --targets outreach-targets.json --sendTarget file shape:
{
"targets": [
{
"name": "Example Agent Directory",
"endpoint": "https://agent-directory.example/inbox",
"audience": "catalog",
"headers": {"Authorization": "Bearer ..."}
}
]
}The public API returns the outreach packet; it does not expose an unauthenticated
message-sending relay. The scheduled GitHub Action
.github/workflows/pr-agent-outreach.yml runs every 5 minutes against
data/pr_outreach_targets.json, sends with --send, and records contacted
targets in data/pr_outreach_state.json so repeated scheduled runs do not
re-contact the same agent. For a literal always-running local process, run:
python scripts/pr_agent_outreach.py \
--targets data/pr_outreach_targets.json \
--state data/pr_outreach_state.json \
--send --watch --interval-s 300Header values can reference GitHub Actions secrets via environment variables, for
example "Authorization": "$PR_AGENT_TARGET_AUTH".
Seeded automated targets:
AgentNDX (
https://agentndx.ai/api/submit) — public MCP/A2A/x402 review form.MCP.Directory (
https://mcp.directory/api/submit-server) — public JSON submit route.mcpub (
https://mcpub.dev/mcp) — public MCP JSON-RPCsubmittool.
Additional listing work that is not suitable for the scheduled HTTP sender lives
in data/pr_manual_targets.json. Current manual/GitHub target: mcp.so issue
https://github.com/daodao97/chatmcp/issues/213.
Add Foresea to your agent (10 seconds)
It's a remote, anonymous Streamable-HTTP server — no key, no install. Point any MCP client at the URL:
# Claude Code
claude mcp add --transport http foresea https://foresea.ink/mcp/// Cursor / Cline / Claude Desktop (mcp.json)
{ "mcpServers": { "foresea": { "url": "https://foresea.ink/mcp/" } } }# Python — official MCP SDK (3.10+)
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
async with streamablehttp_client("https://foresea.ink/mcp/") as (r, w, _):
async with ClientSession(r, w) as s:
await s.initialize()
print(await s.call_tool("foresea_forecast",
{"question": "Will the Fed cut rates by March 2026?", "market_probability": 0.4}))# LangChain (langchain-mcp-adapters) — Foresea tools in any LangGraph agent
from langchain_mcp_adapters.client import MultiServerMCPClient
client = MultiServerMCPClient({"foresea": {"url": "https://foresea.ink/mcp/", "transport": "streamable_http"}})
tools = await client.get_tools() # foresea_forecast, foresea_analyze_market, ...A runnable end-to-end demo (scan → forecast → edge) is in
examples/foresea_agent_demo.py.
Use https://foresea.ink/mcp/ directly in MCP clients that support remote
Streamable HTTP servers. For clients that still require a local stdio command,
run the wrapper locally.
The repo targets Python 3.10+ because the official MCP Python SDK requires it.
To create a repo-local Python 3.11 MCP environment with uv:
uv venv --python 3.11 .venv-mcp
uv pip install --python .venv-mcp/bin/python --no-deps -e .
uv pip install --python .venv-mcp/bin/python "mcp>=1.27.1" requests pyyaml pip
source .venv-mcp/bin/activate
analyze-llm-rationale mcp-serverThat lightweight install avoids pulling the full inference dependency stack
(notably Torch/CUDA) when all you need is the MCP wrapper. In a full development
environment, pip install -e ".[mcp]" is also valid.
MCP client config example:
{
"mcpServers": {
"foresea": {
"url": "https://foresea.ink/mcp/"
}
}
}For a local HTTP MCP endpoint:
.venv-mcp/bin/analyze-llm-rationale mcp-server \
--transport streamable-http \
--host 127.0.0.1 \
--port 8787Connect MCP clients to http://127.0.0.1:8787/mcp. If a private deployment
requires auth, set FORESEA_API_KEY or pass --api-key; the wrapper forwards it
as X-API-Key.
Quick verification:
.venv-mcp/bin/python - <<'PY'
import importlib.metadata as md
from analyzing_llm_rationale.mcp_server import create_mcp_server
print(md.version("mcp"))
print(create_mcp_server().name)
PYFetch live market prices
Pull the current market-implied probability straight from a venue, then feed it
into /predict as market_probability to compute an edge.
# Polymarket — by market slug (or ?id=<numeric id>)
curl "https://foresea.ink/markets/polymarket?slug=will-the-fed-cut-rates-in-2026"
# Kalshi — by market ticker
curl "https://foresea.ink/markets/kalshi?ticker=KXFED-26SEP-C"Both return a normalised quote:
{
"platform": "Polymarket",
"question": "Will the Fed cut rates in 2026?",
"market_url": "https://polymarket.com/market/...",
"outcome": "Yes",
"probability": 0.54,
"outcomes": [
{"label": "Yes", "probability": 0.54},
{"label": "No", "probability": 0.46}
]
}probability is null for unpriced/illiquid markets. Quotes are cached briefly
(MARKET_CACHE_TTL, default 30s).
Trading execution: Polymarket and Kalshi
Foresea can submit guarded prediction-market orders, but live execution is
disabled by default. Keep this separate from /agent/analyze: the agent can
recommend buy_yes/buy_no, but order submission requires a signed-in user,
server-side exchange credentials, FORESEA_ENABLE_TRADING=true, execute=true,
and the exact confirmation phrase PLACE REAL ORDER.
Credentials are read only from the server environment, so use Cloud Run Secret Manager mounts or environment secrets. Do not collect private keys in the browser or store exchange secrets in Datastore.
# Global guardrails
export FORESEA_ENABLE_TRADING=false # must be true for live orders
export FORESEA_MAX_ORDER_NOTIONAL=50 # local cap per order, USD
export FORESEA_ALLOW_MARKET_ORDERS=false # separate gate for IOC/FOK-style orders
# Kalshi authenticated REST (RSA-PSS signing)
export KALSHI_API_KEY_ID=<kalshi-key-id>
export KALSHI_PRIVATE_KEY_FILE=/secrets/kalshi-private-key.pem
export KALSHI_BASE_URL=https://external-api.kalshi.com/trade-api/v2
# Polymarket CLOB SDK
export POLYMARKET_PRIVATE_KEY=<wallet-private-key>
export POLYMARKET_API_KEY=<clob-api-key>
export POLYMARKET_API_SECRET=<clob-api-secret>
export POLYMARKET_API_PASSPHRASE=<clob-api-passphrase>
export POLYMARKET_FUNDER_ADDRESS=<optional-funder-address>
export POLYMARKET_SIGNATURE_TYPE=<optional-signature-type>Install the optional SDKs in production with:
pip install -e ".[serve,trading]"The Docker image installs trading, so Cloud Run only needs secrets/env vars.
Check configured venues:
curl https://foresea.ink/trading/accounts \
-H "Authorization: Bearer $FORESEA_SESSION"Preview a Kalshi order without execution:
curl -X POST https://foresea.ink/trading/preview \
-H "Authorization: Bearer $FORESEA_SESSION" \
-H "Content-Type: application/json" \
-d '{
"platform": "kalshi",
"ticker": "KXFED-26SEP-C",
"action": "buy",
"outcome": "yes",
"price": 0.42,
"quantity": 1
}'Submit a live order only after reviewing the preview:
curl -X POST https://foresea.ink/trading/orders \
-H "Authorization: Bearer $FORESEA_SESSION" \
-H "Content-Type: application/json" \
-d '{
"platform": "kalshi",
"ticker": "KXFED-26SEP-C",
"action": "buy",
"outcome": "yes",
"price": 0.42,
"quantity": 1,
"execute": true,
"confirmation": "PLACE REAL ORDER"
}'For Polymarket, pass the CLOB token_id for the exact outcome, or pass
slug/market_id plus outcome and Foresea will resolve the token id from the
public market record. Limit orders use quantity as shares. Market-buy orders
use max_cost as USD spend when supplied and remain blocked unless
FORESEA_ALLOW_MARKET_ORDERS=true.
Request fields
Required:
question: forecasting question, such as"Will X happen by date Y?","Who will win X?","What will X be?", or"When will X happen?".
Optional:
question_type:binary,multiple_choice,numeric, ordate. If omitted, the model attempts to infer the type.options: answer choices formultiple_choicequestions.description: extra context for the question.resolution_criteria: how the question should resolve or be measured.categories: list of topic labels.news_articles: caller-supplied evidence articles. If provided, automatic evidence retrieval is skipped.attach_evidence: defaults totrue. When true andnews_articlesis empty, the API fetches current evidence from GDELT, Google News RSS, and Stooq.evidence_top_k: number of evidence articles to attach, capped by the server.market_platform: prediction market venue such asPolymarket,Kalshi,Manifold, orMetaculus.market_url: URL for the market being analyzed.market_outcome: outcome whose market price is supplied. Defaults toYesfor binary markets.market_probability: current market-implied probability formarket_outcome. Use0.42or42; the API normalizes percentages.variant: prompt variant. Defaults tovariant0_neutral_baseline.created_time,publish_time,resolve_time,days_open: optional forecasting metadata.openrouter_api_key+openrouter_model: run the forecast on your own model instead of the server default (see "Bring your own model" below).provider_base_url: optional OpenAI-compatible/chat/completionsendpoint to use with your key/model instead of OpenRouter. Must be public HTTPS.
Bring your own model
By default /predict runs on the server's hosted model. To use your own:
Via OpenRouter — pass
openrouter_api_keyandopenrouter_model(e.g.openai/gpt-4o,anthropic/claude-sonnet-4-5). The request is proxied through OpenRouter.Via any OpenAI-compatible endpoint — also pass
provider_base_url(e.g.https://api.openai.com/v1/chat/completions) with the matchingopenrouter_model(here just the provider's model ID, e.g.gpt-4o) and your key.
For safety, provider_base_url must be public HTTPS; loopback, private,
link-local, and cloud-metadata hosts are rejected. In the web app, the sidebar's
"Use your own model" panel exposes the provider, endpoint, key, and model.
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will X happen by 2027?",
"question_type": "binary",
"openrouter_api_key": "YOUR_KEY",
"openrouter_model": "gpt-4o",
"provider_base_url": "https://api.openai.com/v1/chat/completions"
}'Binary request
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
"question_type": "binary",
"market_platform": "Polymarket",
"market_probability": 42
}'Multiple-choice request
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Who will win the 2026 Formula 1 drivers championship?",
"question_type": "multiple_choice",
"options": ["Max Verstappen", "Lando Norris", "Charles Leclerc", "Lewis Hamilton", "Other"],
"attach_evidence": false
}'Numeric request
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "What will US CPI inflation be in December 2026?",
"question_type": "numeric",
"resolution_criteria": "Use the year-over-year CPI-U inflation rate for December 2026."
}'Request with caller-provided evidence
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will Company X report positive net income in Q4 2026?",
"description": "Resolve using the company earnings release.",
"resolution_criteria": "Yes if reported GAAP net income is positive.",
"attach_evidence": false,
"news_articles": [
{
"title": "Company X raises full-year guidance",
"source": "Example Business News",
"url": "https://example.com/company-x-guidance",
"publish_date": "2026-05-29",
"summary": "Company X raised revenue guidance and reported margin expansion."
}
]
}'Python client example
import requests
payload = {
"question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
"question_type": "binary",
"attach_evidence": True,
"evidence_top_k": 3,
"market_platform": "Polymarket",
"market_probability": 42,
}
response = requests.post(
"https://foresea.ink/predict",
json=payload,
timeout=180,
)
response.raise_for_status()
prediction = response.json()
print(prediction["predicted_answer"], prediction["confidence"])
print(prediction["model_rationale"])
if prediction.get("market_analysis"):
print(prediction["market_analysis"]["summary"])
for source in prediction["evidence_sources"]:
print(source["source"], source["url"])Response fields
question_type: detected or requested type:binary,multiple_choice,numeric, ordate.predicted_answer:"Yes","No", the top multiple-choice option, or the median numeric/date estimate.confidence: model confidence as a number from 0 to 1 for binary and multiple-choice forecasts;nullfor numeric/date forecasts.options: per-option probabilities for multiple-choice forecasts.range_forecast:p10,p50,p90, and optionalunitfor numeric/date forecasts.rationale: model-generated explanation.model_rationale: alias for the model-generated explanation, intended for API clients.evidence_sources: compact source list with article title, URL, publication date, and relevance score.evidence_articles: full evidence records attached to the prompt.evidence_error: retrieval error message, ornullwhen evidence retrieval succeeds.market_analysis: optional comparison against a supplied market price:market_probability,model_probability,edge,stance, and a short summary.edgeismodel_probability - market_probability.
Repository Contents
src/analyzing_llm_rationale/: packaged inference, provider, validation, and CLI logic.configs/: model and rationale-variant definitions.prompts/: system prompt and the nine rationale-variant prompts.scripts/: evaluation, recovery, SHAP, plotting, and utility scripts.slurm/: HPC launchers for the variant/temperature sweeps.results/: model outputs and run metadata.analysis/: aggregate metric tables and rationale-analysis outputs.paper/: paper figures, Draw.io sources, PDFs, and qualitative case studies.tests/: unit tests for the package and metric parsing.
See ARTIFACT_MANIFEST.md for the submission checklist and file-level notes.
Install
python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,analysis]"Use .[dev] for the core runner and tests only. Use .[analysis] when
regenerating plots or SHAP analyses.
Quick Validation
PYTHONPATH=src python -m analyzing_llm_rationale validate-dataset
python -m unittest discover -s tests
ruff check src tests scripts/*.pyPYTHONPATH=src is useful when the repository has not been installed yet or an
older user-local install shadows the working tree.
Primary Entry Point
Run the variant 3 pipeline with the packaged CLI:
analyze-llm-rationale run-batch --variant variant3_reasoning_typeFor a remote OpenAI-compatible provider:
export PROVIDER_API_KEY=your_token
analyze-llm-rationale run-batch --variant variant3_reasoning_type --model llama-3.3-70b-instructIf you do not want to install the package into the environment, invoke it directly:
PYTHONPATH=src python -m analyzing_llm_rationale run-batch --variant variant3_reasoning_typeUseful options:
--variant variant6_step_by_step_reasoning: choose the prompt/output contract.--model qwen2.5-7b-instruct: choose a configured model definition.--temperature 0.7: control generation temperature and output directory.--max-records 10: process only a bounded number of records.--reprocess-nulls: rerun existing rows withpredicted_answer = null.--drop-article-text: remove raw article text from prompts before inference.--device auto: selectcudawhen available, otherwisecpu.verify-results --variant ...: verify completeness, duplicates, malformed rows, and missing IDs.validate-dataset: validate the dataset schema before a run.
Foresea Autoresearch
Foresea has a Karpathy-style autoresearch harness for prompt experiments: edit
one candidate prompt, run a fixed benchmark slice, score one metric, and append
an auditable experiment log. The research surface is
autoresearch/candidate_prompt.txt; agent instructions live in
autoresearch/program.md. The default --model gpt-oss-120b uses the
SCADS-hosted OpenAI-compatible endpoint from configs/models.yaml
(SCADS_AI_API_KEY or SCADS_AI_API_KEY.txt).
Run one candidate experiment:
PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
--model gpt-oss-120b \
--candidate-prompt-path autoresearch/candidate_prompt.txt \
--max-records 50 \
--metric brier_scoreCompare against a baseline and promote only if the candidate improves:
PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
--model gpt-oss-120b \
--candidate-prompt-path autoresearch/candidate_prompt.txt \
--baseline-results-path results/GPT-OSS-120B/temperature_00/results_variant0_neutral_baseline.json \
--promote-to prompts/variant0_neutral_baseline.txt \
--max-records 50 \
--metric brier_score \
--min-delta 0.001Each run writes analysis/autoresearch/runs/<run_id>/score.json and appends a
machine-readable row to analysis/autoresearch/experiments.jsonl.
Reproducing Core Outputs
Validate an existing result file:
PYTHONPATH=src python -m analyzing_llm_rationale verify-results \
--model qwen2.5-7b-instruct \
--variant variant3_reasoning_type \
--temperature 0.0 \
--temperature-tag temperature_000Regenerate aggregate metrics from results/:
python scripts/evaluate_metrics.pyRun the DuckDB SQL analytics suite over the real Metaculus-style dataset and saved model outputs:
python scripts/sql_analytics.py \
--db analysis/forecasting_analytics.duckdb \
--ingest --replace \
--output-dir analysis/sql_analyticsThis writes a markdown report plus one CSV per query for 10 medium-level SQL problems: model accuracy, best variants, calibration bins, Brier score, consensus/disagreement cases, prompt lift over baseline, temperature sensitivity, overconfident errors, and category difficulty.
Run the LangChain-powered news retrieval wrapper:
PYTHONPATH=src analyze-llm-rationale fetch-and-rank \
--question "Will X happen by date Y?" \
--source gdelt \
--source google-news \
--source stooq \
--top-k 5The news pipeline uses LangChain for a query-planning step, article
summarization, and embedding-based relevance ranking before inference. Evidence
sources are configurable with --source for the CLI and --evidence-source
when serving the API.
Run or schedule the Prefect DAG for RSS/news fetch, inference, and DuckDB logging:
# One question
python flows/forecasting_flow.py --question-id 124 --top-k 5
# Small batch from the dataset
python flows/forecasting_flow.py --limit 3 --top-k 5
# Daily scheduled deployment at 06:00 UTC
prefect server start
python flows/forecasting_flow.py --deploy --limit 3 --cron "0 6 * * *"Regenerate paper figures after metrics are present:
python scripts/plot_model_variant_metric_heatmap.py
python scripts/plot_variant_delta_from_v0.py
python scripts/plot_temperature_frontier.py
python scripts/plot_frs_ablation_slopegraph.py
python scripts/plot_uncertainty_language_calibration_disconnect.py
python scripts/plot_shap_importance_attribute_gaps.pyScripts
Common runner and verification commands:
python scripts/run_variant.py --variant variant5_key_conditionspython scripts/run_variant.py --variant variant3_reasoning_type --temperature 0.7 --temperature-tag temperature_07python scripts/run_variant.py --variant variant4_credibility --model llama-3.3-70b-instructpython scripts/verify_results.py --variant variant3_reasoning_typepython download_qwen_model.pypython test_local_inference.py
Repo layout:
scripts/: modular runner entrypointslurm/: batch launchers
Auditability:
Each run writes
run_metadata_<variant>.jsonnext to the results file.Metadata includes provider, model key, resolved model identifier, temperature, output fields, and prompt SHA-256 hashes.
Existing malformed results JSON now fails fast instead of being silently ignored.
Quality checks
python -m unittest discover -s tests
ruff check src tests scripts/*.pyData, Models, and Secrets
The included dataset is forecasting_qa_news_metaculus_2025-02-01_to_today.metaculus_frs_format.json.
Model access is configured in configs/models.yaml. Open-weight Qwen models run
locally through Hugging Face; hosted models use OpenAI-compatible endpoints and
require API keys through environment variables or local key files.
Never commit key files or tokens. Large local caches (.cache/, envs/, .venv/)
are intentionally ignored and excluded from source archives.
Citation
If this repository supports a publication, cite the artifact with the metadata in
CITATION.cff and cite the upstream datasets/models according to their licenses.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/pareelamre/analyzing-llm-rationale'
If you have feedback or need assistance with the MCP directory API, please join our Discord server