Skip to main content
Glama
tolik-unicornrider

Website Scraper MCP Server

scrape-to-markdown

Extract meaningful content from websites and convert HTML to high-quality Markdown using Mozilla's Readability engine.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes

Implementation Reference

  • The inline handler function for the "scrape-to-markdown" tool that invokes scrapeToMarkdown and returns the markdown content or an error response.
    async ({ url }) => {
      try {
        const markdown = await scrapeToMarkdown(url);
        
        // Return the markdown as the tool result
        return {
          content: [{ type: "text", text: markdown }]
        };
      } catch (error: any) {
        // Handle errors gracefully
        return {
          content: [{ type: "text", text: `Error: ${error.message}` }],
          isError: true
        };
      }
    }
  • Zod schema defining the input parameter: a required valid URL string.
    { url: z.string().url() },
  • src/index.ts:71-90 (registration)
    Registration of the "scrape-to-markdown" tool on the MCP server using server.tool() with name, schema, and handler.
    server.tool(
      "scrape-to-markdown",
      { url: z.string().url() },
      async ({ url }) => {
        try {
          const markdown = await scrapeToMarkdown(url);
          
          // Return the markdown as the tool result
          return {
            content: [{ type: "text", text: markdown }]
          };
        } catch (error: any) {
          // Handle errors gracefully
          return {
            content: [{ type: "text", text: `Error: ${error.message}` }],
            isError: true
          };
        }
      }
    );
  • Main helper function that performs the web scraping: fetches HTML, extracts article with Readability/JSDOM, converts to markdown, and adds code block language hints.
    export async function scrapeToMarkdown(url: string): Promise<string> {
      try {
        // Fetch the HTML content from the provided URL with proper headers
        const response = await fetch(url, {
          headers: {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
          }
        });
        
        if (!response.ok) {
          throw new Error(`Failed to fetch URL: ${response.status}`);
        }
        
        // Get content type to check encoding
        const contentType = response.headers.get('content-type') || '';
        const htmlContent = await response.text();
    
        // Parse the HTML using JSDOM with the URL to resolve relative links
        const dom = new JSDOM(htmlContent, { 
          url,
          pretendToBeVisual: true, // This helps with some interactive content
        });
        
        // Extract the main content using Readability
        const reader = new Readability(dom.window.document);
        const article = reader.parse();
        
        if (!article || !article.content) {
          throw new Error("Failed to parse article content");
        }
        
        // Convert the cleaned article HTML to Markdown using htmlToMarkdown
        let markdown = htmlToMarkdown(article.content);
        
        // Simple post-processing to improve code blocks with language hints
        markdown = markdown.replace(/```\n(class|function|import|const|let|var|if|for|while)/g, '```javascript\n$1');
        markdown = markdown.replace(/```\n(def|class|import|from|with|if|for|while)(\s+)/g, '```python\n$1$2');
        
        return markdown;
      } catch (error: any) {
        throw new Error(`Scraping error: ${error.message}`);
      }
    }
  • Supporting utility to convert cleaned HTML to Markdown using TurndownService, removing script tags first.
    export function htmlToMarkdown(html: string): string {
      // Remove script tags and their content before conversion
      const cleanHtml = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
      
      const turndownService = new TurndownService({
        codeBlockStyle: 'fenced',
        emDelimiter: '_'
      });
    
      return turndownService.turndown(cleanHtml);
    }
Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/tolik-unicornrider/mcp_scraper'

If you have feedback or need assistance with the MCP directory API, please join our Discord server