Skip to main content
Glama

Data Extractor is a commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

🛠️ MCP Server Core Tools (14)

Web Page

工具名称

功能描述

主要参数

scrape_webpage

单页面抓取

url

,

method

(自动选择),

extract_config

(选择器配置),

wait_for_element

(CSS 选择器)

scrape_multiple_webpages

批量页面抓取

urls

(列表),

method

(统一方法),

extract_config

(全局配置)

scrape_with_stealth

反检测抓取

url

,

method

(selenium/playwright),

scroll_page

(滚动加载),

wait_for_element

fill_and_submit_form

表单自动化

url

,

form_data

(选择器:值),

submit

(是否提交),

submit_button_selector

extract_links

专业链接提取

url

,

filter_domains

(域名过滤),

exclude_domains

(排除域名),

internal_only

(仅内部)

extract_structured_data

结构化数据提取

url

,

data_type

(all/contact/social/content/products/addresses)

get_page_info

页面信息获取

url

(目标 URL) - 返回标题、状态码、元数据

check_robots_txt

爬虫规则检查

url

(域名 URL) - 检查 robots.txt 规则

convert_webpage_to_markdown

页面转 Markdown

url

,

method

,

extract_main_content

(提取主内容),

embed_images

(嵌入图片),

formatting_options

batch_convert_webpages_to_markdown

批量 Markdown 转换

urls

(列表),

method

,

extract_main_content

,

embed_images

,

embed_options

PDF Document

工具名称

功能描述

主要参数

convert_pdf_to_markdown

PDF 转 Markdown

pdf_source

(URL/路径),

method

(auto/pymupdf/pypdf),

page_range

,

output_format

batch_convert_pdfs_to_markdown

批量 PDF 转换

pdf_sources

(列表),

method

,

page_range

,

output_format

,

include_metadata

Service Management

工具名称

功能描述

主要参数

get_server_metrics

性能指标监控

无参数 - 返回请求统计、性能指标、缓存情况

clear_cache

缓存管理

无参数 - 清空所有缓存数据

Related MCP server: Scrapezy

🎯 Quick Navigation

🤝 Contribution

欢迎提交 IssuePull Request 来改进这个项目。

📄 License

MIT License - 详见 LICENSE 文件


注意: 请负责任地使用此工具,遵守网站的使用条款和 robots.txt 规则,尊重网站的知识产权。

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server