NLWeb: Microsoft's Protocol for AI-Powered Website Search
Written by Frank Fiegel on .
- What Problem Does NLWeb Solve?
- How NLWeb Works
- Quick Start
- REST API
- Integrating with MCP
- Implementation Reality
- Should You Care?
- Bottom Line
Microsoft recently open-sourced NLWeb, a protocol for adding conversational interfaces to websites.1 It leverages Schema.org structured data that many sites already have and includes built-in support for MCP (Model Context Protocol), enabling both human conversations and agent-to-agent communication.
The key idea: NLWeb creates a standard protocol that turns any website into a conversational interface that both humans and AI agents can query naturally.
Want to try NLWeb?
As part of writing this post, I've implemented NLWeb search for MCP servers. Refer to the "Glama NLWeb Server" section below for usage examples.
What Problem Does NLWeb Solve?
Currently, websites have structured data (Schema.org) but no standard way for AI agents or conversational interfaces to access it. Every implementation is bespoke. Traditional search interfaces struggle with context-aware, multi-turn queries.
NLWeb creates a standard protocol for conversational access to web content. Like RSS did for syndication, NLWeb does for AI interactions - one implementation serves both human chat interfaces and programmatic agent access.
The key insight: Instead of building custom NLP for every site, NLWeb leverages LLMs' existing understanding of Schema.org to create instant conversational interfaces.
The real power comes from multi-turn conversations that preserve context:
- "Find recipes for dinner parties"
- "Only vegetarian options"
- "That can be prepared in under an hour"
Each query builds on the previous context - something traditional search interfaces struggle with.
How NLWeb Works
Two-Component System
- Protocol Layer: REST API (
/ask
endpoint) and MCP server (/mcp
endpoint) that accept natural language queries and return Schema.org JSON responses - Implementation Layer: Reference implementation that orchestrates multiple LLM calls for query processing
Query Processing Pipeline
In this flow, a single query may trigger 50+ targeted LLM calls for:
- Query decontextualization based on conversation history
- Relevancy scoring against site content
- Result ranking with custom prompts per content type
- Optional post-processing (summarization/generation)
The "fast track" optimization launches a parallel path to retrieval (step 3) while pre-processing occurs, but results are blocked until relevancy checks complete2.
Video Explanation
After I wrote this article, I was sent this video which includes a great introduction to NLWeb by R.V. Guha (creator of Schema.org, RSS, and RDF):
Why 50+ LLM Calls?
Instead of using one large prompt to handle everything, NLWeb breaks each query into dozens of small, specific questions:
- "Is this query about recipes?"
- "Does it reference something mentioned earlier?"
- "Is the user asking to remember dietary preferences?"
- "How relevant is this specific result?"
This approach has two major benefits:
- No hallucination - Results only come from your actual database
- Better accuracy - Each LLM call has one clear job it can do well
Think of it like having a team of specialists instead of one generalist.
Even if you don't use NLWeb, this pattern—using many focused LLM calls instead of one complex prompt—is worth borrowing.
Quick Start
The best way to wrap your head around NLWeb is to try it out.
Microsoft provides a quick start guide for setting up an example NLWeb server with Behind The Tech RSS feed.
Go to localhost:8000 and you should have a working NLWeb server.
I have also noticed that the repository contains a CLI to simplify configuration, testing, and execution of the application. However, I struggled to get it working.
Once you have the server running, you can ask it questions like:
which will return a JSON response like:
Glama NLWeb Server
As part of writing this post, I've built a simple NLWeb server using Node.js. You can use it to query our MCP server directory:
As far as I can tell, this is the first ever public NLWeb endpoint!
Due to the volume of LLM calls, it takes a few seconds to respond.
or, if you want to continue the conversation:
or, if you want to summarize the results:
Useful when you want an overview rather than just a list of results.
or, if you want to generate a response:
This mode attempts to answer the question using the retrieved results (like traditional RAG)
Things that made it easy to implement:
- We have existing embeddings for every MCP server and a vector store
- We already have a way to make LLM calls
A few questions came to mind as I was implementing this:
- It seems that NLWeb doesn't dictate where the
/ask
endpoint needs to be hosted—does it have to behttps://glama.ai/ask
or can it behttps://glama.ai/nlweb/ask
? - It wasn't super clear to me which Schema.org data is best suited to describe MCP servers.
Not surprisingly, the slowest part of the pipeline is the LLM calls.
REST API
Currently, NLWeb supports two APIs at the endpoints /ask
and /mcp
. The arguments are the same for both, as is most of the functionality. The /mcp
endpoint returns the answers in format that MCP clients can use. The /mcp
endpoint also supports the core MCP methods (list_tools
, list_prompts
, call_tool
and get_prompt
).
The /ask
endpoint supports the following parameters:
Parameter | Type | Description |
---|---|---|
query | string | Natural language question |
site | string | Scope to specific data subset |
prev | string | Comma-separated previous queries |
decontextualized_query | string | Skip decontextualization if provided |
streaming | bool | Enable SSE streaming |
query_id | string | Track conversation |
mode | string | list , summarize , or generate |
Integrating with MCP
Since NLWeb includes an MCP server by default, you can configure Claude for Desktop to talk to NLWeb.
If you already have the NLWeb server running, this should be as simple as adding the following to your ~/Library/Application Support/Claude/claude_desktop_config.json
configuration:
Implementation Reality
The documentation suggests you can get a basic prototype running quickly if you have existing Schema.org markup or RSS feeds.
What's actually straightforward:
- Loading RSS feeds or Schema.org data
- Basic search functionality with provided prompts
- Local development with Qdrant
What requires more effort:
- Production deployment at scale
- Optimizing 50+ LLM calls per query (mentioned in docs)
- Custom prompt engineering for your domain
- Maintaining data freshness between vector store and live data
I already had a lot of these components in place, so I was able to get a basic prototype running in an hour. However, to make this production-ready, I'd need to think a lot more time about the cost of the LLM calls.
Should You Care?
Yes if:
- You have structured data (Schema.org, RSS) already
- You want to enable conversational search beyond keywords
- You need programmatic AI agent access via MCP
- You can experiment with early-stage tech
No if:
- You need battle-tested production code
- You can't handle significant LLM API costs
- Your content isn't well-structured
- You expect plug-and-play simplicity
Bottom Line
NLWeb is more interesting as a strategic direction than as current technology. NLWeb was conceived and developed by R.V. Guha (creator of Schema.org, RSS, and RDF), now a CVP and Technical Fellow at Microsoft3. That's serious pedigree.
The O'Reilly prototype proves it's viable for content-heavy sites. The quick start shows it's approachable for developers. But "prototype in days" doesn't mean "production in weeks."
Think of it as an investment in making your content natively conversational. The technical foundation is solid—REST API, standard formats, proven vector stores. The vision is compelling. The code needs work.
Want to experiment? Clone the repo and try the quick start above.
Footnotes
Written by Frank Fiegel (@punkpeye)