Skip to main content
Glama

Folio

A privacy-first, zero-AI-cost "chat with your documents" assistant. One engine, two front doors: a fully offline command-line app, and an OAuth-secured web app.

Point Folio at a folder and it can read, search, summarize, answer questions about, and reformat the files inside it — and nothing else. Your documents never leave your machine, and the generation runs on your own model, so there is no per-token AI bill.

Folio is built on the Model Context Protocol (MCP) and is designed to exercise its most advanced features in a way that is essential to the use case, not bolted on.

Folio — the Aurora web UI answering a question with live search progress

The Aurora web UI answering with live search progress. Demo recorded on a cloud provider for speed — Folio runs identically on local Ollama (just slower).


Why the design is meaningful

Feature

Why it matters here

Roots

The assistant can only touch the folders you explicitly grant — enforced on every file operation. The rest of your disk is unreachable. This is privacy by construction.

Sampling

The generation is done by the host's own model (local Ollama by default), not the server. A hosted Folio therefore never racks up AI bills, and each person's documents are processed by their own model.

Dual transport

The same server runs locally over stdio (the CLI) or remotely over Streamable HTTP (the web app).

OAuth 2.1

The web app identifies users with GitHub sign-in; the server validates every request's bearer token before running a tool.

Logging & progress

Long jobs ("search the whole folder") stream live status, so you can see real work happening instead of a frozen spinner.


Related MCP server: Local RAG

Features

  • 🔒 Granted-folder access only — a non-negotiable path guard on every tool.

  • 🔎 Search across files with live progress.

  • 📝 Summarize a file and answer questions grounded in your documents.

  • ✏️ Reformat / edit files in place.

  • 💻 Offline CLI (stdio) — private, $0, works with no internet.

  • 🌐 Web app (FastAPI + GitHub login) — a designed browser UI: upload or pick documents, ask questions, and watch live search progress stream in over SSE.

  • 🔁 Provider-agnostic — local Ollama by default, with Cerebras and OpenRouter as drop-in cloud fallbacks, plus Anthropic and OpenAI as optional bring-your-own-key upgrades in the CLI (switch via two lines in .env).


Architecture

                       ┌───────────────────────────────────────┐
  CLI (local):         │            mcp_server.py              │
   main.py ──stdio────▶│   ONE FastMCP engine:                 │
                       │     • tools   (list/read/search/...)  │
  Web (remote):        │     • resources (roots, files)        │
   Browser ⇄ FastAPI ──┤     • prompts (/summarize, /format)   │
        host           │   roots guard · sampling · logging ·  │
   ──HTTP + OAuth─────▶│   progress · OAuth (HTTP mode)        │
                       └───────────────────────────────────────┘

Both hosts speak to the same server; only the transport differs. In each case the host (the CLI or the FastAPI app) is the MCP client: it runs the agent loop, holds the model keys, and answers the server's sampling / roots / logging / progress callbacks.


Requirements

  • (First) run and build the previous repository to setup local ollama + litellm tool-calling environment repo-here

    ollama pull qwen2.5:7b

    (No API key needed for the local path. Cerebras / OpenRouter are optional cloud fallbacks.)

  • Python 3.10+ (developed on 3.13)

  • uv for dependency + run management

Install

uv sync

Configure

Copy the template and fill in values locally (the real .env is git-ignored):

cp .env.example .env

The default configuration uses local Ollama and needs no keys:

LLM_PROVIDER=ollama
LLM_MODEL=ollama_chat/qwen2.5:7b

Switch provider by changing LLM_PROVIDER + LLM_MODEL together (see .env.example for the Cerebras / OpenRouter forms).


Run the CLI

Grant one or more folders and start chatting:

uv run main.py path/to/your/folder

With no folder it defaults to the bundled sample-docs/. At the > prompt you can:

  • ask questions in plain English (e.g. "how long are backups retained?"),

  • mention a file with @, e.g. @policies/data-retention.md what does this say?,

  • run a command, e.g. /summarize README.md.

Exit with Ctrl+C.

Run the web app

The OAuth-secured FastAPI web app runs with:

uv run uvicorn web.app:app --port 8000

Then open http://localhost:8000 and sign in with GitHub. From there you can load the bundled sample documents or upload your own, click a file to ground a question, and watch live search progress as Folio answers. It runs the same MCP engine as the CLI, just over HTTP.


Screenshots

Onboarding — sign in, then pick the sample set or upload your own

Documents sidebar — click a document to insert its exact @mention

Sign in → load the bundled sample set or upload your own.

Click a document to drop its exact @mention into the question.

Live search log and progress while Folio works

The fully-offline CLI host answering from your documents

Live search log + progress stream while Folio works.

The fully-offline CLI host (stdio), grounded in your docs.


Benchmarks & tradeoffs

Folio is provider-agnostic, so which model you point it at is a real tradeoff. These numbers come from running the actual agent loop over a small e-commerce document set (5 grounded questions with known answers), paced to respect free-tier rate limits — see benchmarks/ for the reproducible harness and full results.

Model

Correct

Median latency/call

Notes

Cerebras gpt-oss-120b

5/5

~0.5s

fast + accurate

Cerebras zai-glm-4.7

5/5

~0.7s

fast + accurate

OpenRouter gpt-oss-120b:free

3/5*

~3.1s

*2 misses were free-tier rate-limit 429s, not wrong answers

Ollama qwen2.5:7b (local)

3/5

~9.5s

private + $0, but ~15–20× slower and less consistent

Three takeaways:

  • Speed — Cerebras answers ~15–20× faster per call than the local 7B (~0.5s vs ~9.5s).

  • Accuracy — the bigger cloud models are consistently correct; the small local 7B is inconsistent (it confabulated a non-existent file path and sometimes answered "no information").

  • Free-tier reality — free cloud tiers rate-limit/throttle under load (OpenRouter's free llama-3.3-70b was entirely unusable in a burst). For real throughput, bring your own key.

The honest tradeoff triangle: privacy (local Ollama) ↔ speed + quality (Cerebras) ↔ cost (free, but throttled). Reproduce with uv run python benchmarks/benchmark.py.

Which provider should I use?

  • Privacy / offline / $0ollama (local; slower and less consistent, but nothing leaves your machine).

  • Fast + accurate, freecerebras (near-instant; free tier throttles under heavy use).

  • Maximum quality (paid, CLI only)anthropic or openai with your own key (e.g. LLM_MODEL=anthropic/claude-opus-4-8). The web app never accepts keys — this is a CLI upgrade.


Limitations (honest)

  • Local 7B is slow and inconsistent. qwen2.5:7b is private and free but answers in seconds-to-tens-of-seconds and occasionally mis-uses tools (confabulates a path, or gives up). For reliable, fast answers, use a cloud provider.

  • Free cloud tiers throttle. Cerebras and OpenRouter free tiers rate-limit under sustained/burst use; the benchmark above was captured with fresh quota — re-running on an exhausted free tier shows worse numbers (a quota artifact, not the models). Bring your own key for real throughput.

  • The web app is a shared/hosted convenience, not the fully-private path. Uploaded documents go to the server (isolated per user, deleted on logout + a TTL sweep). For fully offline / private use, run the CLI with local Ollama.

  • Text documents only. Folio reads text files (Markdown, .txt, .csv, code, …) — no images/audio/video.

  • Anthropic / OpenAI need paid API credits. They are optional CLI upgrades, not required.


Tech stack

  • MCP Python SDK (FastMCP) — the server engine, the client session, both transports, and the OAuth modules.

  • litellm — one OpenAI-shaped API over Ollama, Cerebras, OpenRouter, Anthropic, and OpenAI (routes by the model-string prefix).

  • FastAPI + uvicorn — the async web host; its native async + SSE match the MCP SDK and the live-progress requirement.

  • sse-starlette — streams live log/progress events to the browser over Server-Sent Events.

  • itsdangerous — the web app remembers your GitHub sign-in in a small signed-cookie session; itsdangerous cryptographically signs that cookie so it can't be tampered with (a tamper-evident seal). It's what makes "stay logged in" trustworthy.

  • prompt-toolkit — the interactive CLI prompt, autocompletion, and history.


Security notes

  • The roots guard (is_path_allowed) is enforced in every file tool — the SDK provides the roots mechanism, but Folio enforces the policy.

  • OAuth applies to the HTTP transport only; the local stdio CLI needs none (you launched the process yourself).

  • Secrets live only in the git-ignored .env. .env.example ships blank placeholders.

Project status

Complete: the MCP engine (roots, sampling, logging/progress, dual transport, OAuth), the offline CLI, and the OAuth-secured FastAPI web app. A hosted public deployment is intentionally not provided — a shared demo on free model tiers would burn the operator's quota, and the web app deliberately never accepts a visitor's API key — so run it locally (it works fully on your own machine, with the steps above).

License & credits

MIT. Built by extending ollama-mcp-chat-cli, an earlier MCP chat-CLI project.

Install Server
A
license - permissive license
A
quality
B
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Shahrukh19S/folio-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server