Skip to main content
Glama

MD Webcrawl MCP

by jmh108

MD MCP 网络爬虫项目

基于 Python 的 MCP( https://modelcontextprotocol.io/introduction )网络爬虫,用于提取和保存网站内容。

特征

  • 提取网站内容并保存为 markdown 文件
  • 地图网站结构和链接
  • 批量处理多个 URL
  • 可配置的输出目录

安装

  1. 克隆存储库:
git clone https://github.com/yourusername/webcrawler.git cd webcrawler
  1. 安装依赖项:
pip install -r requirements.txt
  1. 可选:配置环境变量:
export OUTPUT_PATH=./output # Set your preferred output directory

输出

爬取的内容以markdown格式保存在指定的输出目录中。

配置

可以通过环境变量配置服务器:

  • OUTPUT_PATH :保存文件的默认输出目录
  • MAX_CONCURRENT_REQUESTS :最大并行请求数(默认值:5)
  • REQUEST_TIMEOUT :请求超时(秒)(默认值:30)

克劳德的设置

使用 FastMCP 安装fastmcp install server.py

或用户自定义设置直接使用 fastmcp 运行

"Crawl Server": { "command": "fastmcp", "args": [ "run", "/Users/mm22/Dev_Projekte/servers-main/src/Webcrawler/server.py" ], "env": { "OUTPUT_PATH": "/Users/user/Webcrawl" }

发展

实时开发

fastmcp dev server.py --with-editable .

调试

它有助于使用https://modelcontextprotocol.io/docs/tools/inspector进行调试

示例

示例 1:提取并保存内容

mcp call extract_content --url "https://example.com" --output_path "example.md"

示例 2:创建内容索引

mcp call scan_linked_content --url "https://example.com" | \ mcp call create_index --content_map - --output_path "index.md"

贡献

  1. 分叉存储库
  2. 创建功能分支( git checkout -b feature/AmazingFeature
  3. 提交您的更改( git commit -m 'Add some AmazingFeature'
  4. 推送到分支( git push origin feature/AmazingFeature
  5. 打开拉取请求

执照

根据 MIT 许可证分发。更多信息请参阅LICENSE

要求

  • Python 3.7+
  • FastMCP(uv pip 安装 fastmcp)
  • requirements.txt 中列出的依赖项
-
security - not tested
A
license - permissive license
-
quality - not tested

hybrid server

The server is able to function both locally and remotely, depending on the configuration or use case.

基于 Python 的 MCP 服务器,可抓取网站以提取内容并将其保存为 markdown 文件,并具有映射网站结构和链接的功能。

  1. 特征
    1. 安装
      1. 输出
        1. 配置
          1. 克劳德的设置
            1. 发展
              1. 实时开发
              2. 调试
            2. 示例
              1. 示例 1:提取并保存内容
              2. 示例 2:创建内容索引
            3. 贡献
              1. 执照
                1. 要求

                  Related MCP Servers

                  • A
                    security
                    A
                    license
                    A
                    quality
                    A powerful MCP server for fetching and transforming web content into various formats (HTML, JSON, Markdown, Plain Text) with ease.
                    Last updated -
                    4
                    146
                    12
                    TypeScript
                    MIT License
                    • Apple
                    • Linux
                  • A
                    security
                    A
                    license
                    A
                    quality
                    An MCP server that enables users to download webpages as markdown files using r.jina.ai service, with features for configurable download directories and automatic date-stamped filenames.
                    Last updated -
                    5
                    2
                    25
                    JavaScript
                    MIT License
                    • Linux
                    • Apple
                  • -
                    security
                    A
                    license
                    -
                    quality
                    A Python implementation of an MCP server that extracts webpage content, removes ads and non-essential elements, and transforms it into clean, LLM-optimized Markdown.
                    Last updated -
                    1
                    Python
                    MIT License
                    • Linux
                    • Apple
                  • -
                    security
                    F
                    license
                    -
                    quality
                    An MCP server that extracts meaningful content from websites and converts HTML to high-quality Markdown, using Mozilla's Readability engine.
                    Last updated -
                    11,993
                    2
                    JavaScript

                  View all related MCP servers

                  MCP directory API

                  We provide all the information about MCP servers via our MCP API.

                  curl -X GET 'https://glama.ai/api/mcp/v1/servers/jmh108/md-webcrawl-mcp'

                  If you have feedback or need assistance with the MCP directory API, please join our Discord server