Skip to main content
Glama

get_transcript

Extract YouTube video transcripts with timestamps in your preferred language. Use this tool to obtain captions for analysis, translation, or content creation.

Instructions

Get transcript for a YouTube video with timestamps

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
videoIdYesYouTube video ID or full YouTube URL
langNoPreferred language code for captions (default: en)en

Implementation Reference

  • The main handler function for the 'get_transcript' tool. It validates input, extracts the YouTube video ID, fetches video details and caption tracks using YouTube's InnerTube API, selects the best caption track, downloads and parses the XML transcript, and formats the output with timestamps, video metadata, and the full transcript.
      private async handleGetTranscript(args: any): Promise<CallToolResult> {
        if (!this.isValidTranscriptArgs(args)) {
          throw new McpError(
            ErrorCode.InvalidParams,
            'Invalid transcript arguments. Required: videoId'
          );
        }
    
        try {
          // Extract video ID from URL or use directly
          const videoId = this.extractVideoId(args.videoId);
          if (!videoId) {
            throw new Error('Invalid YouTube URL or video ID');
          }
    
          // Get video info from InnerTube API
          const videoInfo = await this.getVideoInfo(videoId);
    
          if (!videoInfo.videoDetails) {
            throw new Error('Could not fetch video details');
          }
    
          const { title, author, lengthSeconds } = videoInfo.videoDetails;
          const duration = `${Math.floor(parseInt(lengthSeconds) / 60)}:${(parseInt(lengthSeconds) % 60).toString().padStart(2, '0')}`;
    
          // Extract captions
          const captionTracks = videoInfo.captions?.playerCaptionsTracklistRenderer?.captionTracks;
          if (!captionTracks || captionTracks.length === 0) {
            throw new Error('No captions available for this video');
          }
    
          // Select best caption
          const selectedCaption = this.selectBestCaption(captionTracks, args.lang);
          if (!selectedCaption) {
            throw new Error('No suitable captions found');
          }
    
          const captionType = selectedCaption.kind === 'asr' ? 'auto-generated' : 'manual';
    
          // Fetch transcript content
          const transcriptResponse = await this.axiosInstance.get(selectedCaption.baseUrl);
          const parsedTranscript = this.parseXMLTranscript(transcriptResponse.data);
    
          // Format response
          const formattedTranscript = `# ${title}
    
    **Author:** ${author}  
    **Duration:** ${duration}  
    **Captions:** ${selectedCaption.name?.simpleText || selectedCaption.languageCode} (${captionType})
    
    ## Transcript
    
    ${parsedTranscript.map(segment => `${segment.timestamp} ${segment.text}`).join('\n')}
    
    ---
    *Generated using DeepSRT MCP Server*`;
    
          return {
            content: [
              {
                type: 'text',
                text: formattedTranscript
              }
            ]
          };
    
        } catch (error) {
          return {
            content: [
              {
                type: 'text',
                text: `Error getting transcript: ${error instanceof Error ? error.message : String(error)}`
              }
            ],
            isError: true
          };
        }
      }
  • Input schema definition for the 'get_transcript' tool, specifying parameters videoId (required) and optional lang.
    inputSchema: {
      type: 'object',
      properties: {
        videoId: {
          type: 'string',
          description: 'YouTube video ID or full YouTube URL',
        },
        lang: {
          type: 'string',
          description: 'Preferred language code for captions (default: en)',
          default: 'en',
        },
      },
      required: ['videoId'],
    },
  • src/index.ts:116-127 (registration)
    Registration of the CallToolRequestHandler that dispatches 'get_transcript' calls to the handleGetTranscript method.
    this.server.setRequestHandler(CallToolRequestSchema, async (request) => {
      if (request.params.name === 'get_summary') {
        return this.handleGetSummary(request.params.arguments);
      } else if (request.params.name === 'get_transcript') {
        return this.handleGetTranscript(request.params.arguments);
      } else {
        throw new McpError(
          ErrorCode.MethodNotFound,
          `Unknown tool: ${request.params.name}`
        );
      }
    });
  • src/index.ts:93-111 (registration)
    Tool registration in the ListTools response, defining name, description, and input schema for 'get_transcript'.
    {
      name: 'get_transcript',
      description: 'Get transcript for a YouTube video with timestamps',
      inputSchema: {
        type: 'object',
        properties: {
          videoId: {
            type: 'string',
            description: 'YouTube video ID or full YouTube URL',
          },
          lang: {
            type: 'string',
            description: 'Preferred language code for captions (default: en)',
            default: 'en',
          },
        },
        required: ['videoId'],
      },
    },
  • Core helper function that parses YouTube's XML timedtext format into timestamped transcript segments, handling syllable reconstruction and HTML entity decoding.
    private parseXMLTranscript(xmlContent: string): Array<{timestamp: string, text: string}> {
      const result: Array<{timestamp: string, text: string}> = [];
      
      // Handle YouTube's timedtext format
      if (xmlContent.includes('<timedtext')) {
        // Extract the body content
        const bodyMatch = xmlContent.match(/<body>(.*?)<\/body>/s);
        if (!bodyMatch) return result;
        
        const bodyContent = bodyMatch[1];
        
        // Find all <p> tags with their content
        const pTagRegex = /<p[^>]*t="(\d+)"[^>]*>(.*?)<\/p>/gs;
        let match;
        
        while ((match = pTagRegex.exec(bodyContent)) !== null) {
          const startTime = parseInt(match[1]);
          const pContent = match[2];
          
          // Skip empty paragraphs or paragraphs with only whitespace/newlines
          if (!pContent.trim() || pContent.trim() === '') {
            continue;
          }
          
          // Extract text from <s> tags within this paragraph
          const sTagRegex = /<s[^>]*>(.*?)<\/s>/g;
          const syllables: string[] = [];
          let sMatch;
          
          while ((sMatch = sTagRegex.exec(pContent)) !== null) {
            let syllable = sMatch[1];
            
            // Decode HTML entities
            syllable = syllable
              .replace(/&/g, '&')
              .replace(/</g, '<')
              .replace(/>/g, '>')
              .replace(/"/g, '"')
              .replace(/'/g, "'")
              .replace(/ /g, ' ');
            
            syllables.push(syllable);
          }
          
          // Reconstruct words from syllables
          if (syllables.length > 0) {
            const words: string[] = [];
            let currentWord = '';
            
            for (const syllable of syllables) {
              if (syllable.startsWith(' ')) {
                // This syllable starts a new word
                if (currentWord.trim()) {
                  words.push(currentWord.trim());
                }
                currentWord = syllable; // Keep the leading space for now
              } else {
                // This syllable continues the current word
                currentWord += syllable;
              }
            }
            
            // Don't forget the last word
            if (currentWord.trim()) {
              words.push(currentWord.trim());
            }
            
            // Join words with single spaces
            const fullText = words.join(' ').trim();
            
            // Skip music notation and empty segments
            if (fullText && !fullText.match(/^\[.*\]$/) && fullText !== '♪♪♪' && fullText.trim() !== '') {
              const timestamp = this.formatTimestamp(startTime);
              result.push({ timestamp, text: fullText });
            }
          }
        }
      }
      
      return result;
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions 'with timestamps,' which adds some context about the output format, but fails to address critical aspects like rate limits, authentication needs, error handling, or whether the operation is read-only or has side effects.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that front-loads the core purpose without unnecessary words. Every part of the sentence contributes directly to understanding the tool's function, making it highly concise and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (2 parameters, no annotations, no output schema), the description is minimally adequate. It covers the basic purpose and output feature (timestamps) but lacks details on behavioral traits, usage guidelines, and output structure, leaving gaps for the agent to navigate.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The input schema has 100% description coverage, clearly documenting both parameters. The description adds no additional parameter semantics beyond what the schema provides, such as examples or constraints. With high schema coverage, the baseline score of 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Get transcript for a YouTube video with timestamps.' It specifies the verb ('Get'), resource ('transcript'), and key feature ('with timestamps'), but doesn't explicitly differentiate from the sibling tool 'get_summary' beyond the resource type.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives like 'get_summary.' It lacks context about use cases, prerequisites, or exclusions, leaving the agent to infer usage based solely on the tool name and purpose.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DeepSRT/deepsrt-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server