Skip to main content
Glama

Fetch YouTube Subtitles

fetch_youtube_subtitles

Extract subtitles or transcripts from YouTube videos in SRT, VTT, TXT, or JSON formats with timestamps and language options.

Instructions

Fetch subtitles/transcripts from YouTube videos. Supports multiple output formats (SRT, VTT, TXT, JSON) and language selection. Returns complete subtitle content with timestamps.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesYouTube video URL or video ID. Supported formats: https://www.youtube.com/watch?v=xxx, https://youtu.be/xxx, or direct video ID
formatNoOutput format. SRT: subtitle file format (with sequence numbers), VTT: WebVTT format, TXT: plain text (text only), JSON: structured JSON (with timestamps)JSON
langNoSubtitle language code (optional). Examples: zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese), en (English). Auto-detect if not specified

Implementation Reference

  • Core handler function that orchestrates subtitle fetching: validates args, extracts video ID, uses Innertube to get transcript, processes segments into TranscriptItem[], formats output, constructs response with metadata or error message.
      async run(args: { url: string; format?: string; lang?: string }) {
        try {
          // 1️⃣ Parameter validation
          if (!args.url) {
            throw new Error("Parameter 'url' is required");
          }
    
          const format = args.format || "JSON";
          const lang = args.lang || undefined;
    
          // 2️⃣ Extract video ID
          const videoId = extractVideoId(args.url);
    
          // 3️⃣ Initialize YouTube client
          const youtube = await Innertube.create();
          
          // 4️⃣ Get video info
          const info = await youtube.getInfo(videoId);
          
          // 5️⃣ Fetch transcript with optional language
          let transcriptData;
          try {
            transcriptData = await info.getTranscript();
          } catch (error: any) {
            throw new Error(`No subtitle data found: ${error.message}`);
          }
    
          if (!transcriptData || !transcriptData.transcript) {
            throw new Error("No subtitle data found");
          }
    
          const transcript = transcriptData.transcript;
          const segments = transcript.content?.body?.initial_segments;
    
          if (!segments || segments.length === 0) {
            throw new Error("No subtitle segments found");
          }
    
          // 6️⃣ Convert to standard format
          const transcriptItems: TranscriptItem[] = segments.map((seg: any) => ({
            text: seg.snippet?.text || "",
            offset: seg.start_ms || 0,
            duration: (seg.end_ms || 0) - (seg.start_ms || 0),
          }));
    
          // 7️⃣ Format output
          const formattedContent = formatSubtitles(transcriptItems, format);
    
          // 8️⃣ Determine actual language
          let actualLanguage = lang || "auto";
          const transcriptDataAny = transcriptData as any;
          if (transcriptDataAny.transcript_search_panel?.footer?.language_menu) {
            const selectedLang = transcriptDataAny.transcript_search_panel.footer.language_menu.sub_menu_items?.find(
              (item: any) => item.selected
            );
            if (selectedLang) {
              actualLanguage = selectedLang.title || actualLanguage;
            }
          }
    
          // 9️⃣ Return result
          const result = {
            success: true,
            videoId,
            format,
            language: actualLanguage,
            subtitleCount: transcriptItems.length,
            content: formattedContent,
          };
    
          return {
            content: [
              {
                type: "text" as const,
                text: `# YouTube Subtitle Extraction Result
    
    **Video ID**: ${videoId}
    **Video Title**: ${info.basic_info.title || 'N/A'}
    **Format**: ${format}
    **Language**: ${result.language}
    **Subtitle Count**: ${result.subtitleCount}
    
    ---
    
    ${formattedContent}`,
              },
            ],
          };
        } catch (error: any) {
          return {
            content: [
              {
                type: "text" as const,
                text: `❌ Failed to fetch subtitles: ${error.message}
    
    **Possible reasons**:
    - Video has no available subtitles
    - Video is private or restricted
    - Specified language code does not exist
    - Network connection issue
    
    **Tips**:
    - Try without specifying language code (auto-detect)
    - Verify the video URL is correct
    - Check if the video has public subtitles`,
              },
            ],
            isError: true,
          };
        }
      },
  • JSON Schema definition for the tool's input parameters including url (required), optional format and lang.
    name: "fetch_youtube_subtitles",
    description:
      "Fetch subtitles/transcripts from YouTube videos. Supports multiple output formats (SRT, VTT, TXT, JSON) and language selection. Returns complete subtitle content with timestamps.",
    parameters: {
      type: "object",
      properties: {
        url: {
          type: "string",
          description:
            "YouTube video URL or video ID. Supported formats: https://www.youtube.com/watch?v=xxx, https://youtu.be/xxx, or direct video ID",
        },
        format: {
          type: "string",
          enum: ["SRT", "VTT", "TXT", "JSON"],
          default: "JSON",
          description:
            "Output format. SRT: subtitle file format (with sequence numbers), VTT: WebVTT format, TXT: plain text (text only), JSON: structured JSON (with timestamps)",
        },
        lang: {
          type: "string",
          description:
            "Subtitle language code (optional). Examples: zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese), en (English). Auto-detect if not specified",
        },
      },
      required: ["url"],
    },
  • src/index.ts:22-51 (registration)
    Tool registration in the stdio MCP server using McpServer.registerTool, providing Zod inputSchema and handler wrapper.
    server.registerTool(
      fetchYoutubeSubtitles.name,
      {
        title: "Fetch YouTube Subtitles",
        description: fetchYoutubeSubtitles.description,
        inputSchema: {
          url: z
            .string()
            .describe(
              "YouTube video URL or video ID. Supported formats: https://www.youtube.com/watch?v=xxx, https://youtu.be/xxx, or direct video ID"
            ),
          format: z
            .enum(["SRT", "VTT", "TXT", "JSON"])
            .default("JSON")
            .optional()
            .describe(
              "Output format. SRT: subtitle file format (with sequence numbers), VTT: WebVTT format, TXT: plain text (text only), JSON: structured JSON (with timestamps)"
            ),
          lang: z
            .string()
            .optional()
            .describe(
              "Subtitle language code (optional). Examples: zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese), en (English). Auto-detect if not specified"
            ),
        },
      },
      async (args) => {
        return await fetchYoutubeSubtitles.run(args);
      }
    );
  • Tool registration in the HTTP MCP server using McpServer.registerTool within createMCPServer function, with Zod schema and handler.
    server.registerTool(
      fetchYoutubeSubtitles.name,
      {
        title: "Fetch YouTube Subtitles",
        description: fetchYoutubeSubtitles.description,
        inputSchema: {
          url: z
            .string()
            .describe(
              "YouTube video URL or video ID. Supported formats: https://www.youtube.com/watch?v=xxx, https://youtu.be/xxx, or direct video ID"
            ),
          format: z
            .enum(["SRT", "VTT", "TXT", "JSON"])
            .default("JSON")
            .optional()
            .describe(
              "Output format. SRT: subtitle file format (with sequence numbers), VTT: WebVTT format, TXT: plain text (text only), JSON: structured JSON (with timestamps)"
            ),
          lang: z
            .string()
            .optional()
            .describe(
              "Subtitle language code (optional). Examples: zh-Hans (Simplified Chinese), zh-Hant (Traditional Chinese), en (English). Auto-detect if not specified"
            ),
        },
      },
      async (args) => {
        const result = await fetchYoutubeSubtitles.run(args);
        return result;
      }
    );
  • Utility function to extract YouTube video ID from various URL formats or plain ID.
    export function extractVideoId(url: string): string {
      const patterns = [
        /(?:youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([^&\n?#]+)/,
        /^([a-zA-Z0-9_-]{11})$/,
      ];
    
      for (const pattern of patterns) {
        const match = url.match(pattern);
        if (match) {
          return match[1];
        }
      }
    
      throw new Error("Unable to extract video ID from URL");
    }
  • Helper function that converts TranscriptItem array to specified subtitle format (SRT, VTT, TXT, JSON) using dedicated converters.
    export function formatSubtitles(transcript: TranscriptItem[], format: string): string {
      switch (format.toUpperCase()) {
        case "SRT":
          return convertToSRT(transcript);
        case "VTT":
          return convertToVTT(transcript);
        case "TXT":
          return convertToTXT(transcript);
        case "JSON":
          return convertToJSON(transcript);
        default:
          throw new Error(`Unsupported format: ${format}. Supported formats: SRT, VTT, TXT, JSON`);
      }
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions output formats and language auto-detection, which are useful, but lacks details on error handling, rate limits, authentication needs, or whether the operation is read-only (implied by 'fetch'). More behavioral context would improve transparency.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with core functionality and efficiently lists key features in a single, well-structured sentence. Every element ('fetch subtitles/transcripts', 'output formats', 'language selection', 'returns content') adds value without redundancy, making it appropriately sized and easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given no annotations and no output schema, the description adequately covers the tool's purpose and basic features, but lacks completeness for a tool with 3 parameters and potential behavioral complexities. It does not explain return values in detail or address edge cases, leaving gaps in contextual understanding.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all parameters thoroughly. The description adds marginal value by summarizing key parameters ('multiple output formats', 'language selection'), but does not provide additional semantics beyond what's in the schema. Baseline is 3, but the concise mention of parameters in context slightly elevates it.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('fetch subtitles/transcripts'), target resource ('YouTube videos'), and key capabilities ('multiple output formats', 'language selection', 'complete subtitle content with timestamps'). It uses precise verbs and distinguishes what the tool does without redundancy.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage context through features like format and language selection, but provides no explicit guidance on when to use this tool versus alternatives (e.g., for different video platforms or content types). Since no sibling tools are listed, this is less critical, but general best practices are not addressed.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/guangxiangdebizi/youtube-subtitle-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server