Collect Corpus
collect_corpusCrawl a sitemap to extract clean text from articles and build a corpus for voice analysis.
Instructions
Crawl sitemap and collect clean writing corpus from published articles
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| sitemap_url | Yes | URL to XML sitemap (e.g., https://example.com/post-sitemap.xml) | |
| output_name | Yes | Corpus identifier/name (e.g., "richard-baxter") | |
| output_dir | Yes | Directory to store corpus files (e.g., "C:/dev/corpus") | |
| max_articles | No | Maximum articles to process (default: 100) | |
| article_pattern | No | Optional regex to filter URLs |