Skip to main content
Glama

Analyzing LLM Rationale

Conference artifact for studying how explicit rationale instructions affect LLM forecasting behavior on Metaculus-style binary forecasting questions. The codebase contains the prompt variants, batch inference runner, generated result tables, and plotting/analysis scripts used for the paper figures. The live Foresea API also supports prediction-market intelligence: typed forecasts, evidence retrieval, and model-vs-market edge analysis for binary and multiple-choice markets.

Live API

Deployed on Google Cloud Run — model gpt-oss-120b, variant variant0_neutral_baseline:

https://foresea.ink

(The URL is printed in the GitHub Actions deploy-step output after the first push to main.)

# Health check
curl https://foresea.ink/health

# Single-record prediction
curl -X POST https://foresea.ink/predict \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Will X happen by date Y?",
    "question_type": "binary",
    "description": "Context here.",
    "news_articles": [],
    "attach_evidence": true,
    "evidence_top_k": 5,
    "market_platform": "Polymarket",
    "market_probability": 0.42,
    "variant": "variant0_neutral_baseline"
  }'

When attach_evidence is true and no news_articles are supplied, /predict fetches and ranks current news evidence from GDELT, Google News RSS, and Stooq by default, injects it into the model prompt, and returns the selected evidence_articles with the forecast. Supplying news_articles skips automatic retrieval and uses the caller-provided evidence.

The response includes both the forecast and the evidence used by the model:

{
  "question_type": "binary",
  "predicted_answer": "Yes",
  "confidence": 0.86,
  "options": [],
  "range_forecast": null,
  "rationale": "Model-generated explanation for the forecast.",
  "model_rationale": "Model-generated explanation for the forecast.",
  "variant": "variant0_neutral_baseline",
  "model_key": "gpt-oss-120b",
  "evidence_sources": [
    {
      "source": "Reuters",
      "title": "Article headline",
      "url": "https://example.com/article",
      "publish_date": "2026-05-29T00:00:00Z",
      "relevance_score": 0.82
    }
  ],
  "evidence_articles": [
    {
      "title": "Article headline",
      "summary": "Cleaned article summary.",
      "source": "Reuters",
      "url": "https://example.com/article",
      "publish_date": "2026-05-29T00:00:00Z",
      "relevance_score": 0.82,
      "search_query": "query used for retrieval"
    }
  ],
  "evidence_error": null,
  "market_analysis": {
    "platform": "Polymarket",
    "market_url": "https://example.com/market",
    "outcome": "Yes",
    "market_probability": 0.42,
    "model_probability": 0.86,
    "edge": 0.44,
    "stance": "model_above_market",
    "summary": "Foresea is 44 percentage points above the market on Yes."
  }
}

Use evidence_sources when a client only needs the source list and links. Use evidence_articles when a client needs the article-level details that were attached to the model prompt. rationale and model_rationale are generated by gpt-oss-120b and explain why the model chose its answer and confidence. When market_probability is supplied, market_analysis is computed deterministically from the model probability and the market-implied probability.

Production Deployment Notes

Production is served from the custom domain:

https://foresea.ink

The Cloud Run service name, project ID, and region are set at deploy time via gcloud run deploy.

Required runtime environment:

  • SCADS_AI_API_KEY: Secret Manager secret used by hosted model calls.

  • MODEL_DEVICE=cpu: production Cloud Run runs the CPU image.

  • CUSTOM_DOMAIN=foresea.ink: redirects *.run.app requests to the public domain.

  • GOOGLE_CLIENT_ID: Google OAuth web client ID used by /auth/config.

  • GITHUB_CLIENT_ID / GITHUB_CLIENT_SECRET: GitHub OAuth app credentials. The OAuth app's callback URL must be the site origin (e.g. https://foresea.ink/). When unset, the "Continue with GitHub" button is hidden and /auth/github returns 503. Sign-in also works with Google and email/password.

  • SESSION_SECRET: long random string used to sign browser session JWTs.

The OAuth client must allow these JavaScript origins:

https://foresea.ink
https://www.foresea.ink
https://<cloud-run-service-url>.run.app

To update non-secret environment variables without replacing the existing SESSION_SECRET, use --update-env-vars:

gcloud run services update <service-name> \
  --region <region> \
  --project <project-id> \
  --update-env-vars MODEL_DEVICE=cpu,CUSTOM_DOMAIN=foresea.ink,GOOGLE_CLIENT_ID='<your-google-client-id>'

Verify the deployed auth config and health endpoint:

curl https://foresea.ink/auth/config
curl https://foresea.ink/health

Scaling and caching

The server is built to scale horizontally on Cloud Run:

  • Authentication supports Google One-Tap and email/password (/auth/register, /auth/login). Passwords are stored as salted PBKDF2-HMAC-SHA256 hashes; accounts live in Cloud Datastore.

  • Caching and rate limiting use Redis when REDIS_URL is set, so they are shared across instances; otherwise they fall back to per-instance in-memory state and fail open. /predict (non-personalised requests), evidence retrieval, and /extract URL fetches are cached; public GETs send Cache-Control.

Var

Default

Description

REDIS_URL

unset

Memorystore/Redis URL. Shares cache + rate limits across instances.

PREDICT_CACHE_TTL

600

Cache TTL (s) for non-personalised /predict responses. 0 disables.

EVIDENCE_CACHE_TTL

900

Cache TTL (s) for evidence retrieval.

EXTRACT_CACHE_TTL

3600

Cache TTL (s) for /extract URL fetches.

LOCAL_CACHE_MAX

1024

Max entries in the in-memory fallback cache.

SEARXNG_URL / TAVILY_API_KEY / SERPER_API_KEY / BRAVE_API_KEY

unset

Enable web search as an evidence source. A self-hosted SearXNG is preferred when set, then Tavily, Serper, Brave. Tavily/Serper have free no-card tiers. When none is set, evidence comes from GDELT, Google News, and RSS.

NEWSAPI_KEY

unset

Enables NewsAPI as an evidence source.

Live track record

GET /track-record serves the public forecast track record. The heavy tick loop does not run on Cloud Run: .github/workflows/track-record-tick.yml runs hourly on GitHub Actions, updates data/track_record_store.json as the source-of-truth entity store, writes the public aggregate to static/track_record_live.json, and commits both files back to main. At runtime, Cloud Run fetches the committed aggregate from raw GitHub, falling back to the bundled file and then the static backtest in static/track_record.json.

The Action discovers short-to-medium-horizon Polymarket/Kalshi markets in separate close-date bands (2-7, 7-14, 14-30, 30-60 days by default) and calls /predict once per newly snapshotted market/model. If /predict is protected, set the GitHub secret PREDICT_API_KEY; no server-side /track-record/tick endpoint is required. TRACK_RECORD_TOKEN is optional and only enables the agent-enrolled market bridge.

Foresea Radar

GET /radar serves the public niche-market radar used by the web app's Radar view. .github/workflows/radar-tick.yml runs hourly on GitHub Actions, fetches Polymarket/Kalshi/Reddit candidates, calls /predict for fresh Foresea probabilities, writes static/radar.json, and commits it back to main. Cloud Run only serves that committed JSON artifact, so Radar avoids OOMs without avoiding inference.

Radar items include market price, Foresea probability, edge, credibility score, evidence links, tracking status, and tags such as thin liquidity, resolution-rule risk, news catalyst, crowded sports market, near-term, and tracked live. The endpoint is cached with RADAR_TTL (default 900 seconds).

Raise the Cloud Run throughput ceiling (no idle cost while min-instances=0):

gcloud run services update analyzing-llm-rationale --region us-central1 \
  --max-instances 20 --concurrency 40 --memory 1Gi

Once max-instances > 1, provision Memorystore for Redis (billable) and set REDIS_URL so rate limiting and caching stay correct across instances:

gcloud services enable redis.googleapis.com vpcaccess.googleapis.com compute.googleapis.com
gcloud redis instances create foresea-cache --size=1 --region=us-central1 --tier=basic
gcloud compute networks vpc-access connectors create foresea-vpc \
  --region=us-central1 --range=10.8.0.0/28
gcloud run services update analyzing-llm-rationale --region us-central1 \
  --vpc-connector foresea-vpc \
  --update-env-vars REDIS_URL=redis://<instance-host>:6379

Using the API

The public Cloud Run API is the easiest integration target. It accepts forecasting questions and returns a typed forecast, model rationale, and optional evidence articles. It is built for resolvable forecasts, not general Q&A.

Endpoints

  • GET /health: service health check.

  • GET /track-record: public live track record, falling back to the static backtest.

  • GET /track-record/digest: shareable markdown summary of the live track record.

  • GET /pr-agent: opt-in agent-to-agent outreach packet for Foresea discovery.

  • POST /predict: public prediction endpoint.

  • GET /markets/polymarket: fetch a live Polymarket quote (see below).

  • GET /markets/kalshi: fetch a live Kalshi quote (see below).

  • POST /agent/analyze: orchestrated end-to-end analysis of a live question (see below).

  • GET /agent/scan: scan a venue for mispriced markets, ranked by edge (see below).

  • GET /trading/accounts: authenticated trading-readiness status, no secrets returned.

  • POST /trading/preview: authenticated dry-run order normalization.

  • POST /trading/orders: authenticated live order submission with explicit confirmation.

Agent: automated intelligence layer

POST /agent/analyze runs the whole pipeline autonomously: resolve the market (fetch a live Polymarket/Kalshi price when an identifier is given) → gather evidence + forecastprice the edge → run any custom skillsrecommend. It returns one structured report.

curl -X POST https://foresea.ink/agent/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "polymarket",
    "slug": "will-the-fed-cut-rates-in-2026",
    "skills": [
      {"name": "Base rate check", "instruction": "Compare to historical base rates."},
      {"name": "Risk", "instruction": "What would most change this forecast?"}
    ]
  }'

Custom skills are your own analysis steps — each runs as an extra model pass over the question, forecast, and evidence, and comes back as a named section in the report. Provide a question directly, or a platform + market identifier (slug/market_id for Polymarket, ticker for Kalshi). Pass history (prior turns) for multi-turn follow-ups — with history, short follow-ups like "why?" or "what about June?" are answered in context. BYOK fields (openrouter_api_key, openrouter_model, provider_base_url) apply here too. The report includes recommendation (buy_yes/buy_no/hold/no_market_price), edge, model_probability, market_probability, thesis, evidence_sources, and pipeline (the ordered steps that ran).

Edge scan — find mispriced markets

GET /agent/scan lists live markets on a venue, forecasts each, and returns the ones whose model-vs-market gap clears min_edge, ranked by |edge|.

curl "https://foresea.ink/agent/scan?platform=polymarket&limit=4&min_edge=0.1"

Params: platform (polymarket or kalshi), limit (markets to analyse, max 8), min_edge (default 0.1), evidence_top_k. Each market runs a full forecast, so it's bounded by limit and the result is cached briefly. Response: {platform, scanned, opportunities: [{question, market_url, market_probability, model_probability, edge, recommendation}]}. In the web app, the desk's "⚡ Scan Polymarket for mispriced markets" button calls this.

MCP server: let AI agents call Foresea as tools

Foresea exposes a public remote MCP server at:

https://foresea.ink/mcp/

It is advertised for discovery at:

https://foresea.ink/.well-known/mcp/server.json

The remote MCP server is a thin tool layer over the public API. It exposes:

  • foresea_forecast: calls POST /predict.

  • foresea_analyze_market: calls POST /agent/analyze.

  • foresea_scan_markets: calls GET /agent/scan.

  • foresea_track_record: calls GET /track-record.

  • foresea_edge_board: calls GET /edge-board — live model-vs-market disagreements ranked, each tagged with the resolved track record of gaps that size (by_edge calibration + lead_lag).

  • foresea_pr_agent: calls GET /pr-agent — concise copy and install metadata for agents/catalogs that ask how to describe Foresea.

  • Resources: foresea://track-record, foresea://pr-agent, and foresea://openapi.json.

PR agent — agent-to-agent distribution

GET /pr-agent?audience=mcp returns an opt-in outreach packet that other agents, MCP catalogs, and tool directories can quote when introducing Foresea. It includes the one-liner, install command, MCP/OpenAPI links, talking points, and an explicit no-spam policy.

For operator-run cold outreach to explicit agent endpoints, prepare a target list and use the local runner. It dry-runs by default and only sends with --send:

python scripts/pr_agent_outreach.py --targets outreach-targets.json
python scripts/pr_agent_outreach.py --targets outreach-targets.json --send

Target file shape:

{
  "targets": [
    {
      "name": "Example Agent Directory",
      "endpoint": "https://agent-directory.example/inbox",
      "audience": "catalog",
      "headers": {"Authorization": "Bearer ..."}
    }
  ]
}

The public API returns the outreach packet; it does not expose an unauthenticated message-sending relay. The scheduled GitHub Action .github/workflows/pr-agent-outreach.yml runs every 5 minutes against data/pr_outreach_targets.json, sends with --send, and records contacted targets in data/pr_outreach_state.json so repeated scheduled runs do not re-contact the same agent. For a literal always-running local process, run:

python scripts/pr_agent_outreach.py \
  --targets data/pr_outreach_targets.json \
  --state data/pr_outreach_state.json \
  --send --watch --interval-s 300

Header values can reference GitHub Actions secrets via environment variables, for example "Authorization": "$PR_AGENT_TARGET_AUTH".

Seeded automated targets:

  • AgentNDX (https://agentndx.ai/api/submit) — public MCP/A2A/x402 review form.

  • MCP.Directory (https://mcp.directory/api/submit-server) — public JSON submit route.

  • mcpub (https://mcpub.dev/mcp) — public MCP JSON-RPC submit tool.

Additional listing work that is not suitable for the scheduled HTTP sender lives in data/pr_manual_targets.json. Current manual/GitHub target: mcp.so issue https://github.com/daodao97/chatmcp/issues/213.

Add Foresea to your agent (10 seconds)

It's a remote, anonymous Streamable-HTTP server — no key, no install. Point any MCP client at the URL:

# Claude Code
claude mcp add --transport http foresea https://foresea.ink/mcp/
// Cursor / Cline / Claude Desktop (mcp.json)
{ "mcpServers": { "foresea": { "url": "https://foresea.ink/mcp/" } } }
# Python — official MCP SDK (3.10+)
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

async with streamablehttp_client("https://foresea.ink/mcp/") as (r, w, _):
    async with ClientSession(r, w) as s:
        await s.initialize()
        print(await s.call_tool("foresea_forecast",
              {"question": "Will the Fed cut rates by March 2026?", "market_probability": 0.4}))
# LangChain (langchain-mcp-adapters) — Foresea tools in any LangGraph agent
from langchain_mcp_adapters.client import MultiServerMCPClient
client = MultiServerMCPClient({"foresea": {"url": "https://foresea.ink/mcp/", "transport": "streamable_http"}})
tools = await client.get_tools()   # foresea_forecast, foresea_analyze_market, ...

A runnable end-to-end demo (scan → forecast → edge) is in examples/foresea_agent_demo.py.

Use https://foresea.ink/mcp/ directly in MCP clients that support remote Streamable HTTP servers. For clients that still require a local stdio command, run the wrapper locally.

The repo targets Python 3.10+ because the official MCP Python SDK requires it. To create a repo-local Python 3.11 MCP environment with uv:

uv venv --python 3.11 .venv-mcp

uv pip install --python .venv-mcp/bin/python --no-deps -e .
uv pip install --python .venv-mcp/bin/python "mcp>=1.27.1" requests pyyaml pip

source .venv-mcp/bin/activate
analyze-llm-rationale mcp-server

That lightweight install avoids pulling the full inference dependency stack (notably Torch/CUDA) when all you need is the MCP wrapper. In a full development environment, pip install -e ".[mcp]" is also valid.

MCP client config example:

{
  "mcpServers": {
    "foresea": {
      "url": "https://foresea.ink/mcp/"
    }
  }
}

For a local HTTP MCP endpoint:

.venv-mcp/bin/analyze-llm-rationale mcp-server \
  --transport streamable-http \
  --host 127.0.0.1 \
  --port 8787

Connect MCP clients to http://127.0.0.1:8787/mcp. If a private deployment requires auth, set FORESEA_API_KEY or pass --api-key; the wrapper forwards it as X-API-Key.

Quick verification:

.venv-mcp/bin/python - <<'PY'
import importlib.metadata as md
from analyzing_llm_rationale.mcp_server import create_mcp_server

print(md.version("mcp"))
print(create_mcp_server().name)
PY

Fetch live market prices

Pull the current market-implied probability straight from a venue, then feed it into /predict as market_probability to compute an edge.

# Polymarket — by market slug (or ?id=<numeric id>)
curl "https://foresea.ink/markets/polymarket?slug=will-the-fed-cut-rates-in-2026"

# Kalshi — by market ticker
curl "https://foresea.ink/markets/kalshi?ticker=KXFED-26SEP-C"

Both return a normalised quote:

{
  "platform": "Polymarket",
  "question": "Will the Fed cut rates in 2026?",
  "market_url": "https://polymarket.com/market/...",
  "outcome": "Yes",
  "probability": 0.54,
  "outcomes": [
    {"label": "Yes", "probability": 0.54},
    {"label": "No", "probability": 0.46}
  ]
}

probability is null for unpriced/illiquid markets. Quotes are cached briefly (MARKET_CACHE_TTL, default 30s).

Trading execution: Polymarket and Kalshi

Foresea can submit guarded prediction-market orders, but live execution is disabled by default. Keep this separate from /agent/analyze: the agent can recommend buy_yes/buy_no, but order submission requires a signed-in user, server-side exchange credentials, FORESEA_ENABLE_TRADING=true, execute=true, and the exact confirmation phrase PLACE REAL ORDER.

Credentials are read only from the server environment, so use Cloud Run Secret Manager mounts or environment secrets. Do not collect private keys in the browser or store exchange secrets in Datastore.

# Global guardrails
export FORESEA_ENABLE_TRADING=false          # must be true for live orders
export FORESEA_MAX_ORDER_NOTIONAL=50         # local cap per order, USD
export FORESEA_ALLOW_MARKET_ORDERS=false     # separate gate for IOC/FOK-style orders

# Kalshi authenticated REST (RSA-PSS signing)
export KALSHI_API_KEY_ID=<kalshi-key-id>
export KALSHI_PRIVATE_KEY_FILE=/secrets/kalshi-private-key.pem
export KALSHI_BASE_URL=https://external-api.kalshi.com/trade-api/v2

# Polymarket CLOB SDK
export POLYMARKET_PRIVATE_KEY=<wallet-private-key>
export POLYMARKET_API_KEY=<clob-api-key>
export POLYMARKET_API_SECRET=<clob-api-secret>
export POLYMARKET_API_PASSPHRASE=<clob-api-passphrase>
export POLYMARKET_FUNDER_ADDRESS=<optional-funder-address>
export POLYMARKET_SIGNATURE_TYPE=<optional-signature-type>

Install the optional SDKs in production with:

pip install -e ".[serve,trading]"

The Docker image installs trading, so Cloud Run only needs secrets/env vars.

Check configured venues:

curl https://foresea.ink/trading/accounts \
  -H "Authorization: Bearer $FORESEA_SESSION"

Preview a Kalshi order without execution:

curl -X POST https://foresea.ink/trading/preview \
  -H "Authorization: Bearer $FORESEA_SESSION" \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "kalshi",
    "ticker": "KXFED-26SEP-C",
    "action": "buy",
    "outcome": "yes",
    "price": 0.42,
    "quantity": 1
  }'

Submit a live order only after reviewing the preview:

curl -X POST https://foresea.ink/trading/orders \
  -H "Authorization: Bearer $FORESEA_SESSION" \
  -H "Content-Type: application/json" \
  -d '{
    "platform": "kalshi",
    "ticker": "KXFED-26SEP-C",
    "action": "buy",
    "outcome": "yes",
    "price": 0.42,
    "quantity": 1,
    "execute": true,
    "confirmation": "PLACE REAL ORDER"
  }'

For Polymarket, pass the CLOB token_id for the exact outcome, or pass slug/market_id plus outcome and Foresea will resolve the token id from the public market record. Limit orders use quantity as shares. Market-buy orders use max_cost as USD spend when supplied and remain blocked unless FORESEA_ALLOW_MARKET_ORDERS=true.

Request fields

Required:

  • question: forecasting question, such as "Will X happen by date Y?", "Who will win X?", "What will X be?", or "When will X happen?".

Optional:

  • question_type: binary, multiple_choice, numeric, or date. If omitted, the model attempts to infer the type.

  • options: answer choices for multiple_choice questions.

  • description: extra context for the question.

  • resolution_criteria: how the question should resolve or be measured.

  • categories: list of topic labels.

  • news_articles: caller-supplied evidence articles. If provided, automatic evidence retrieval is skipped.

  • attach_evidence: defaults to true. When true and news_articles is empty, the API fetches current evidence from GDELT, Google News RSS, and Stooq.

  • evidence_top_k: number of evidence articles to attach, capped by the server.

  • market_platform: prediction market venue such as Polymarket, Kalshi, Manifold, or Metaculus.

  • market_url: URL for the market being analyzed.

  • market_outcome: outcome whose market price is supplied. Defaults to Yes for binary markets.

  • market_probability: current market-implied probability for market_outcome. Use 0.42 or 42; the API normalizes percentages.

  • variant: prompt variant. Defaults to variant0_neutral_baseline.

  • created_time, publish_time, resolve_time, days_open: optional forecasting metadata.

  • openrouter_api_key + openrouter_model: run the forecast on your own model instead of the server default (see "Bring your own model" below).

  • provider_base_url: optional OpenAI-compatible /chat/completions endpoint to use with your key/model instead of OpenRouter. Must be public HTTPS.

Bring your own model

By default /predict runs on the server's hosted model. To use your own:

  • Via OpenRouter — pass openrouter_api_key and openrouter_model (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-5). The request is proxied through OpenRouter.

  • Via any OpenAI-compatible endpoint — also pass provider_base_url (e.g. https://api.openai.com/v1/chat/completions) with the matching openrouter_model (here just the provider's model ID, e.g. gpt-4o) and your key.

For safety, provider_base_url must be public HTTPS; loopback, private, link-local, and cloud-metadata hosts are rejected. In the web app, the sidebar's "Use your own model" panel exposes the provider, endpoint, key, and model.

curl -X POST https://foresea.ink/predict \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Will X happen by 2027?",
    "question_type": "binary",
    "openrouter_api_key": "YOUR_KEY",
    "openrouter_model": "gpt-4o",
    "provider_base_url": "https://api.openai.com/v1/chat/completions"
  }'

Binary request

curl -X POST https://foresea.ink/predict \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
    "question_type": "binary",
    "market_platform": "Polymarket",
    "market_probability": 42
  }'

Multiple-choice request

curl -X POST https://foresea.ink/predict \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Who will win the 2026 Formula 1 drivers championship?",
    "question_type": "multiple_choice",
    "options": ["Max Verstappen", "Lando Norris", "Charles Leclerc", "Lewis Hamilton", "Other"],
    "attach_evidence": false
  }'

Numeric request

curl -X POST https://foresea.ink/predict \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What will US CPI inflation be in December 2026?",
    "question_type": "numeric",
    "resolution_criteria": "Use the year-over-year CPI-U inflation rate for December 2026."
  }'

Request with caller-provided evidence

curl -X POST https://foresea.ink/predict \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Will Company X report positive net income in Q4 2026?",
    "description": "Resolve using the company earnings release.",
    "resolution_criteria": "Yes if reported GAAP net income is positive.",
    "attach_evidence": false,
    "news_articles": [
      {
        "title": "Company X raises full-year guidance",
        "source": "Example Business News",
        "url": "https://example.com/company-x-guidance",
        "publish_date": "2026-05-29",
        "summary": "Company X raised revenue guidance and reported margin expansion."
      }
    ]
  }'

Python client example

import requests

payload = {
    "question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
    "question_type": "binary",
    "attach_evidence": True,
    "evidence_top_k": 3,
    "market_platform": "Polymarket",
    "market_probability": 42,
}

response = requests.post(
    "https://foresea.ink/predict",
    json=payload,
    timeout=180,
)
response.raise_for_status()
prediction = response.json()

print(prediction["predicted_answer"], prediction["confidence"])
print(prediction["model_rationale"])
if prediction.get("market_analysis"):
    print(prediction["market_analysis"]["summary"])
for source in prediction["evidence_sources"]:
    print(source["source"], source["url"])

Response fields

  • question_type: detected or requested type: binary, multiple_choice, numeric, or date.

  • predicted_answer: "Yes", "No", the top multiple-choice option, or the median numeric/date estimate.

  • confidence: model confidence as a number from 0 to 1 for binary and multiple-choice forecasts; null for numeric/date forecasts.

  • options: per-option probabilities for multiple-choice forecasts.

  • range_forecast: p10, p50, p90, and optional unit for numeric/date forecasts.

  • rationale: model-generated explanation.

  • model_rationale: alias for the model-generated explanation, intended for API clients.

  • evidence_sources: compact source list with article title, URL, publication date, and relevance score.

  • evidence_articles: full evidence records attached to the prompt.

  • evidence_error: retrieval error message, or null when evidence retrieval succeeds.

  • market_analysis: optional comparison against a supplied market price: market_probability, model_probability, edge, stance, and a short summary. edge is model_probability - market_probability.

Repository Contents

  • src/analyzing_llm_rationale/: packaged inference, provider, validation, and CLI logic.

  • configs/: model and rationale-variant definitions.

  • prompts/: system prompt and the nine rationale-variant prompts.

  • scripts/: evaluation, recovery, SHAP, plotting, and utility scripts.

  • slurm/: HPC launchers for the variant/temperature sweeps.

  • results/: model outputs and run metadata.

  • analysis/: aggregate metric tables and rationale-analysis outputs.

  • paper/: paper figures, Draw.io sources, PDFs, and qualitative case studies.

  • tests/: unit tests for the package and metric parsing.

See ARTIFACT_MANIFEST.md for the submission checklist and file-level notes.

Install

python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,analysis]"

Use .[dev] for the core runner and tests only. Use .[analysis] when regenerating plots or SHAP analyses.

Quick Validation

PYTHONPATH=src python -m analyzing_llm_rationale validate-dataset
python -m unittest discover -s tests
ruff check src tests scripts/*.py

PYTHONPATH=src is useful when the repository has not been installed yet or an older user-local install shadows the working tree.

Primary Entry Point

Run the variant 3 pipeline with the packaged CLI:

analyze-llm-rationale run-batch --variant variant3_reasoning_type

For a remote OpenAI-compatible provider:

export PROVIDER_API_KEY=your_token
analyze-llm-rationale run-batch --variant variant3_reasoning_type --model llama-3.3-70b-instruct

If you do not want to install the package into the environment, invoke it directly:

PYTHONPATH=src python -m analyzing_llm_rationale run-batch --variant variant3_reasoning_type

Useful options:

  • --variant variant6_step_by_step_reasoning: choose the prompt/output contract.

  • --model qwen2.5-7b-instruct: choose a configured model definition.

  • --temperature 0.7: control generation temperature and output directory.

  • --max-records 10: process only a bounded number of records.

  • --reprocess-nulls: rerun existing rows with predicted_answer = null.

  • --drop-article-text: remove raw article text from prompts before inference.

  • --device auto: select cuda when available, otherwise cpu.

  • verify-results --variant ...: verify completeness, duplicates, malformed rows, and missing IDs.

  • validate-dataset: validate the dataset schema before a run.

Foresea Autoresearch

Foresea has a Karpathy-style autoresearch harness for prompt experiments: edit one candidate prompt, run a fixed benchmark slice, score one metric, and append an auditable experiment log. The research surface is autoresearch/candidate_prompt.txt; agent instructions live in autoresearch/program.md. The default --model gpt-oss-120b uses the SCADS-hosted OpenAI-compatible endpoint from configs/models.yaml (SCADS_AI_API_KEY or SCADS_AI_API_KEY.txt).

Run one candidate experiment:

PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
  --model gpt-oss-120b \
  --candidate-prompt-path autoresearch/candidate_prompt.txt \
  --max-records 50 \
  --metric brier_score

Compare against a baseline and promote only if the candidate improves:

PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
  --model gpt-oss-120b \
  --candidate-prompt-path autoresearch/candidate_prompt.txt \
  --baseline-results-path results/GPT-OSS-120B/temperature_00/results_variant0_neutral_baseline.json \
  --promote-to prompts/variant0_neutral_baseline.txt \
  --max-records 50 \
  --metric brier_score \
  --min-delta 0.001

Each run writes analysis/autoresearch/runs/<run_id>/score.json and appends a machine-readable row to analysis/autoresearch/experiments.jsonl.

Reproducing Core Outputs

Validate an existing result file:

PYTHONPATH=src python -m analyzing_llm_rationale verify-results \
  --model qwen2.5-7b-instruct \
  --variant variant3_reasoning_type \
  --temperature 0.0 \
  --temperature-tag temperature_000

Regenerate aggregate metrics from results/:

python scripts/evaluate_metrics.py

Run the DuckDB SQL analytics suite over the real Metaculus-style dataset and saved model outputs:

python scripts/sql_analytics.py \
  --db analysis/forecasting_analytics.duckdb \
  --ingest --replace \
  --output-dir analysis/sql_analytics

This writes a markdown report plus one CSV per query for 10 medium-level SQL problems: model accuracy, best variants, calibration bins, Brier score, consensus/disagreement cases, prompt lift over baseline, temperature sensitivity, overconfident errors, and category difficulty.

Run the LangChain-powered news retrieval wrapper:

PYTHONPATH=src analyze-llm-rationale fetch-and-rank \
  --question "Will X happen by date Y?" \
  --source gdelt \
  --source google-news \
  --source stooq \
  --top-k 5

The news pipeline uses LangChain for a query-planning step, article summarization, and embedding-based relevance ranking before inference. Evidence sources are configurable with --source for the CLI and --evidence-source when serving the API.

Run or schedule the Prefect DAG for RSS/news fetch, inference, and DuckDB logging:

# One question
python flows/forecasting_flow.py --question-id 124 --top-k 5

# Small batch from the dataset
python flows/forecasting_flow.py --limit 3 --top-k 5

# Daily scheduled deployment at 06:00 UTC
prefect server start
python flows/forecasting_flow.py --deploy --limit 3 --cron "0 6 * * *"

Regenerate paper figures after metrics are present:

python scripts/plot_model_variant_metric_heatmap.py
python scripts/plot_variant_delta_from_v0.py
python scripts/plot_temperature_frontier.py
python scripts/plot_frs_ablation_slopegraph.py
python scripts/plot_uncertainty_language_calibration_disconnect.py
python scripts/plot_shap_importance_attribute_gaps.py

Scripts

Common runner and verification commands:

  • python scripts/run_variant.py --variant variant5_key_conditions

  • python scripts/run_variant.py --variant variant3_reasoning_type --temperature 0.7 --temperature-tag temperature_07

  • python scripts/run_variant.py --variant variant4_credibility --model llama-3.3-70b-instruct

  • python scripts/verify_results.py --variant variant3_reasoning_type

  • python download_qwen_model.py

  • python test_local_inference.py

Repo layout:

  • scripts/: modular runner entrypoint

  • slurm/: batch launchers

Auditability:

  • Each run writes run_metadata_<variant>.json next to the results file.

  • Metadata includes provider, model key, resolved model identifier, temperature, output fields, and prompt SHA-256 hashes.

  • Existing malformed results JSON now fails fast instead of being silently ignored.

Quality checks

python -m unittest discover -s tests
ruff check src tests scripts/*.py

Data, Models, and Secrets

The included dataset is forecasting_qa_news_metaculus_2025-02-01_to_today.metaculus_frs_format.json. Model access is configured in configs/models.yaml. Open-weight Qwen models run locally through Hugging Face; hosted models use OpenAI-compatible endpoints and require API keys through environment variables or local key files.

Never commit key files or tokens. Large local caches (.cache/, envs/, .venv/) are intentionally ignored and excluded from source archives.

Citation

If this repository supports a publication, cite the artifact with the metadata in CITATION.cff and cite the upstream datasets/models according to their licenses.

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pareelamre/analyzing-llm-rationale'

If you have feedback or need assistance with the MCP directory API, please join our Discord server