Skip to main content
Glama

Data Extractor is a commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

🛠️ MCP Server Core Tools (14)

Web Page

工具名称

功能描述

主要参数

scrape_webpage

单页面抓取

url, method(自动选择), extract_config(选择器配置), wait_for_element(CSS 选择器)

scrape_multiple_webpages

批量页面抓取

urls(列表), method(统一方法), extract_config(全局配置)

scrape_with_stealth

反检测抓取

url, method(selenium/playwright), scroll_page(滚动加载), wait_for_element

fill_and_submit_form

表单自动化

url, form_data(选择器:值), submit(是否提交), submit_button_selector

extract_links

专业链接提取

url, filter_domains(域名过滤), exclude_domains(排除域名), internal_only(仅内部)

extract_structured_data

结构化数据提取

url, data_type(all/contact/social/content/products/addresses)

get_page_info

页面信息获取

url(目标 URL) - 返回标题、状态码、元数据

check_robots_txt

爬虫规则检查

url(域名 URL) - 检查 robots.txt 规则

convert_webpage_to_markdown

页面转 Markdown

url, method, extract_main_content(提取主内容), embed_images(嵌入图片), formatting_options

batch_convert_webpages_to_markdown

批量 Markdown 转换

urls(列表), method, extract_main_content, embed_images, embed_options

PDF Document

工具名称

功能描述

主要参数

convert_pdf_to_markdown

PDF 转 Markdown

pdf_source(URL/路径), method(auto/pymupdf/pypdf), page_range, output_format

batch_convert_pdfs_to_markdown

批量 PDF 转换

pdf_sources(列表), method, page_range, output_format, include_metadata

Service Management

工具名称

功能描述

主要参数

get_server_metrics

性能指标监控

无参数 - 返回请求统计、性能指标、缓存情况

clear_cache

缓存管理

无参数 - 清空所有缓存数据

Related MCP server: Scrapezy

🎯 Quick Navigation

🤝 Contribution

欢迎提交 IssuePull Request 来改进这个项目。

📄 License

MIT License - 详见 LICENSE 文件


注意: 请负责任地使用此工具,遵守网站的使用条款和 robots.txt 规则,尊重网站的知识产权。

Install Server
A
security – no known vulnerabilities
A
license - permissive license
A
quality - confirmed to work

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server