MCP Podcast Scraper

FALLBACK.md•5.74 kB

# YouTube Fallback Feature ## Overview The podcast scraper now includes **automatic YouTube fallback** when RSS audio downloads fail. This ensures maximum reliability when scraping podcast episodes. ## How It Works ``` 1. Try RSS Download First ↓ 2. If RSS fails → Search YouTube automatically ↓ 3. Find best match → Download from YouTube ↓ 4. Transcribe & Save (works the same either way) ``` ## When Fallback Triggers The scraper automatically falls back to YouTube when: - RSS audio URL returns 404 (not found) - Download fails after redirects - Audio file is corrupt or empty - No audio URL is provided at all ## YouTube Search Logic When searching YouTube, the scraper: 1. Searches for `"{podcast name} {episode title}"` 2. Filters results to find the best match: - Contains podcast name in title or uploader - Duration > 5 minutes (filters out clips) - Picks first matching full episode 3. If no perfect match, uses the first long video (>5 min) ## Usage Examples ### Example 1: RSS Works (No Fallback Needed) ```typescript import { scrapeEpisode } from "./src/services/episode-scraper.js"; const result = await scrapeEpisode({ podcastName: "Lenny's Podcast", title: "Alexander Embiricos on AI", pubDate: "2025-12-14", audioUrl: "https://valid-rss-url.com/episode.mp3" }); // Output: source = 'rss' ``` ### Example 2: RSS Fails, YouTube Fallback ```typescript const result = await scrapeEpisode({ podcastName: "Lenny's Podcast", title: "Alexander Embiricos on AI", pubDate: "2025-12-14", audioUrl: "https://broken-url.com/nonexistent.mp3" // This will fail }); // Output: // ❌ RSS download failed: Failed to download: HTTP 404 // 🔄 Attempting YouTube fallback... // 🔍 Searching YouTube for: "Lenny's Podcast Alexander Embiricos on AI" // ✓ Found match: "..." by Lenny's Podcast // source = 'youtube' ``` ### Example 3: No RSS URL (YouTube Only) ```typescript const result = await scrapeEpisode({ podcastName: "Lenny's Podcast", title: "Alexander Embiricos", pubDate: "2025-12-14" // No audioUrl provided }); // Output: // ⚠️ No audio URL provided, trying YouTube... // source = 'youtube' ``` ## Console Output The scraper provides clear visual feedback: ``` ============================================================ 📻 Scraping: Episode Title ============================================================ 📥 Attempting download from RSS feed... ❌ RSS download failed: Failed to download: HTTP 404 🔄 Attempting YouTube fallback... 🔍 Searching YouTube for: "Podcast Name Episode Title" ✓ Found match: "Episode Title" by Podcast Channel (85 min) 📥 Downloading from YouTube... ✓ Successfully downloaded from YouTube 🎙️ Transcribing audio (this may take a few minutes)... ✓ Transcription complete: 100,086 characters 💾 Saving transcript... ✓ Saved to: /path/to/transcript.md ============================================================ ✅ Successfully scraped episode (source: YOUTUBE) ============================================================ ``` ## Testing ### Run the Demo ```bash npm run build npx tsx scripts/demo-fallback.ts ``` This will: 1. Try to download from a broken RSS URL 2. Automatically fallback to YouTube 3. Show you the full process ### Run Full Tests ```bash npx tsx scripts/test-fallback.ts ``` Tests three scenarios: 1. Valid RSS (should use RSS) 2. Invalid RSS (should fallback to YouTube) 3. No RSS URL (should use YouTube directly) ## API Reference ### `scrapeEpisode(episodeInfo)` **Parameters:** ```typescript interface EpisodeInfo { podcastName: string; // Used for YouTube search title: string; // Episode title pubDate: string; // YYYY-MM-DD format audioUrl?: string; // Optional RSS audio URL } ``` **Returns:** ```typescript interface ScrapeResult { transcriptPath: string; // Path to saved transcript transcriptPreview: string; // First 500 characters source: 'rss' | 'youtube'; // Which source was used } ``` **Throws:** - `AudioDownloadError` - Download failed (with original error) - `AudioNotFoundError` - Episode not found on YouTube either ## Error Handling ```typescript try { const result = await scrapeEpisode(episodeInfo); console.log(`Success! Used ${result.source}`); } catch (error) { if (error instanceof AudioNotFoundError) { console.error("Episode not found on RSS or YouTube"); } else if (error instanceof AudioDownloadError) { console.error("Download failed:", error.originalError); } else { console.error("Unknown error:", error); } } ``` ## Configuration No additional configuration needed! The fallback: - ✅ Uses your existing YouTube tools (`yt-dlp`) - ✅ Uses your existing Deepgram API key - ✅ Works with the MCP server automatically - ✅ Saves files in the same format ## Benefits 1. **Reliability**: Never miss an episode due to broken RSS links 2. **Automatic**: No manual intervention required 3. **Transparent**: Clear logging shows which source was used 4. **Backwards Compatible**: Existing code continues to work 5. **Smart Matching**: Filters out clips, finds full episodes ## Limitations - YouTube search requires episode title (can't scrape without it) - YouTube matching uses heuristics (may occasionally pick wrong video) - YouTube downloads are slightly slower than direct RSS - Requires `yt-dlp` to be installed ## Future Enhancements Potential improvements: - Manual confirmation before using YouTube match - Better fuzzy matching for episode titles - Support for other video platforms (Vimeo, etc.) - Caching of YouTube search results - User-configurable search preferences --- **Built with:** TypeScript, yt-dlp, Deepgram API

Loading blob content...

Latest Blog Posts

Don't Use Large Strings as Cache Keys
By punkpeye on January 11, 2026.
markdown
node-js
cache
What are Claude Skills?
By punkpeye on January 10, 2026.
mcp
skills
How to Test MCP Streamable HTTP Endpoints Using cURL
By punkpeye on January 2, 2026.
tutorial
bash

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wkoleilat-happytitan/mcp-podcast-scraper'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

FALLBACK.md•5.74 kB