Skip to main content
Glama

Hacker News MCP Server

by az9713
README.md•10.3 kB
# šŸ“š YouTube Transcript to Lecture Notes Converter A sophisticated Claude skill for transforming YouTube transcripts into professional, educational lecture notes with both PDF and HTML outputs. ## šŸŽÆ Overview This skill processes raw YouTube transcripts to create comprehensive lecture notes that: - Remove conversational artifacts (fillers, timestamps, etc.) - Intelligently structure content into sections and subsections - Extract and define key concepts - Generate illustrative examples - Produce both PDF and interactive HTML outputs with identical content ## šŸš€ Quick Start ### Installation ```bash # Clone or download the skill files cd lecture-notes-skill # Install dependencies pip install -r requirements.txt # Download NLTK data (automatic on first run) python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')" ``` ### Basic Usage ```python from lecture_notes_converter import LectureNotesConverter # Create converter converter = LectureNotesConverter() # Convert transcript result = converter.convert( transcript_file='transcript.txt', output_dir='output', title='My Lecture Title' ) if result['success']: print(f"PDF: {result['pdf_path']}") print(f"HTML: {result['html_path']}") ``` ### Command Line Usage ```bash # Basic conversion python lecture_notes_converter.py transcript.txt -t "Lecture Title" # With custom output directory python lecture_notes_converter.py transcript.txt -o output/ -t "My Lecture" # With verbose output python lecture_notes_converter.py transcript.txt -v # With custom configuration python lecture_notes_converter.py transcript.txt --config custom_config.json ``` ## šŸ“‹ Features ### 1. **Intelligent Cleaning** - Removes filler words (um, uh, like, you know) - Eliminates timestamps and speaker labels - Fixes punctuation and spacing - Preserves educational content ### 2. **Smart Structuring** - Uses TF-IDF and cosine similarity for topic segmentation - Creates hierarchical sections and subsections - Generates descriptive section titles - Groups sentences into coherent paragraphs ### 3. **Content Enhancement** - Extracts key concepts using TextRank algorithm - Generates contextual definitions - Creates illustrative examples - Adds educational value to raw content ### 4. **Dual Output Formats** #### PDF Output - Professional formatting with custom styles - Hierarchical structure with numbered sections - Concept boxes with definitions - Print-ready layout #### HTML Output - Interactive table of contents sidebar - Smooth scrolling navigation - Responsive design for all devices - Beautiful gradient styling - Active section highlighting ## āš™ļø Configuration ### Default Configuration Structure ```python { 'cleaning': { 'remove_fillers': True, 'filler_threshold': 0.75, 'remove_timestamps': True, 'normalize_whitespace': True }, 'structuring': { 'window_size': 5, 'similarity_threshold': 0.3, 'min_section_sentences': 3, 'max_section_sentences': 50 }, 'enhancement': { 'extract_concepts': True, 'max_concepts_per_section': 5, 'generate_examples': True, 'examples_per_concept': 2 }, 'output': { 'pdf_page_size': 'letter', 'html_theme': 'academic', 'include_toc': True, 'include_metadata': True } } ``` ### Configuration Parameters #### Cleaning Parameters - `remove_fillers`: Remove filler words and verbal tics - `filler_threshold`: Confidence threshold for removal (0.0-1.0) - `remove_timestamps`: Remove timestamp markers - `normalize_whitespace`: Fix spacing issues #### Structuring Parameters - `window_size`: Sentences per window for similarity calculation - `similarity_threshold`: Threshold for segment boundaries (lower = more sections) - `min_section_sentences`: Minimum sentences per section - `max_section_sentences`: Maximum sentences per section #### Enhancement Parameters - `extract_concepts`: Enable concept extraction - `max_concepts_per_section`: Maximum concepts to extract per section - `generate_examples`: Generate examples for concepts - `examples_per_concept`: Number of examples per concept #### Output Parameters - `pdf_page_size`: 'letter' or 'A4' - `html_theme`: 'academic', 'modern', or 'classic' - `include_toc`: Include table of contents - `include_metadata`: Include generation metadata ## 🧮 Mathematical Foundations ### TF-IDF for Keyword Extraction $$TF\text{-}IDF(t,d,D) = \frac{f_{t,d}}{\max_{t' \in d} f_{t',d}} \times \log\frac{|D|}{|\{d \in D : t \in d\}|}$$ ### Cosine Similarity for Topic Segmentation $$\text{similarity}(A,B) = \frac{A \cdot B}{||A|| \times ||B||} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}$$ ### TextRank for Concept Importance $$PR(v_i) = (1-d) + d \times \sum_{v_j \in In(v_i)} \frac{w_{ji}}{\sum_{v_k \in Out(v_j)} w_{jk}} PR(v_j)$$ ## šŸ“Š Example Workflow ### Step 1: Prepare Transcript ```python # Sample transcript with artifacts transcript = """ [00:00:00] Um, hello everyone, welcome to the lecture. [00:00:05] Today we'll discuss, you know, machine learning basics. The first concept is supervised learning... """ ``` ### Step 2: Clean Transcript ```python cleaner = TranscriptCleaner() cleaned = cleaner.clean(transcript, verbose=True) # Output: "Hello everyone, welcome to the lecture. Today we'll discuss machine learning basics. The first concept is supervised learning..." ``` ### Step 3: Structure Content ```python structurer = ContentStructurer() structured = structurer.structure(cleaned) # Creates sections with titles and hierarchical organization ``` ### Step 4: Enhance Content ```python enhancer = ContentEnhancer() enhanced = enhancer.enhance(structured) # Adds concepts, definitions, and examples ``` ### Step 5: Generate Outputs ```python generator = OutputGenerator() pdf_bytes, html_str = generator.generate(enhanced, title="ML Basics") # Produces both PDF and HTML versions ``` ## šŸŽØ Output Examples ### PDF Output Structure ``` Machine Learning Fundamentals Generated on November 15, 2024 | 5 sections | 2,341 words 1. Introduction to Machine Learning [Content paragraphs...] šŸ“¦ Key Concept: Supervised Learning Definition: A type of machine learning where... Example: Consider email spam detection... 1.1 Types of Learning [Subsection content...] 2. Classification Algorithms [Content paragraphs...] ``` ### HTML Output Features - **Left Sidebar**: Clickable table of contents - **Main Content**: Formatted lecture notes - **Interactive Elements**: Smooth scrolling, active section highlighting - **Responsive Design**: Adapts to screen size - **Print-Friendly**: Clean layout when printed ## šŸ”§ Advanced Usage ### Custom Processing Pipeline ```python class CustomProcessor(LectureNotesConverter): def custom_clean(self, text): # Add custom cleaning logic text = super().clean(text) # Additional processing return text def custom_enhance(self, content): # Add custom enhancements content = super().enhance(content) # Additional features return content ``` ### Batch Processing ```python transcripts = ['lecture1.txt', 'lecture2.txt', 'lecture3.txt'] converter = LectureNotesConverter() for transcript in transcripts: result = converter.convert( transcript_file=transcript, output_dir='batch_output' ) print(f"Processed: {transcript}") ``` ### Integration with YouTube ```python # Example: Download and process YouTube transcript import youtube_transcript_api def process_youtube_video(video_id): # Get transcript transcript = youtube_transcript_api.get_transcript(video_id) # Convert to text text = '\n'.join([f"[{t['start']}] {t['text']}" for t in transcript]) # Save and process with open('temp_transcript.txt', 'w') as f: f.write(text) # Convert to lecture notes converter = LectureNotesConverter() return converter.convert('temp_transcript.txt') ``` ## šŸ› Troubleshooting ### Common Issues #### 1. Poor Section Detection **Problem**: Too many or too few sections created **Solution**: Adjust `similarity_threshold` in config ```python config['structuring']['similarity_threshold'] = 0.25 # More sections # or config['structuring']['similarity_threshold'] = 0.35 # Fewer sections ``` #### 2. Over-aggressive Cleaning **Problem**: Important content being removed **Solution**: Lower the `filler_threshold` ```python config['cleaning']['filler_threshold'] = 0.9 # Less aggressive ``` #### 3. Memory Issues with Large Transcripts **Problem**: Out of memory errors **Solution**: Process in chunks ```python # Split large transcript into parts chunk_size = 10000 # characters chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)] ``` ## šŸ“ˆ Performance Metrics ### Quality Metrics - **Completeness**: Percentage of original concepts preserved - **Coherence**: Semantic similarity between adjacent paragraphs - **Structure Score**: Balance and organization of sections ### Typical Performance - Processing speed: ~1000 words/second - Compression ratio: 20-30% reduction in size - Concept extraction: 3-5 key concepts per major section ## šŸ¤ Contributing ### Adding New Features 1. Extend the appropriate class (Cleaner, Structurer, Enhancer, or Generator) 2. Override or add methods 3. Update configuration options 4. Add tests ### Example: Custom Cleaning Pattern ```python class CustomCleaner(TranscriptCleaner): def setup_patterns(self): super().setup_patterns() # Add custom pattern self.patterns.append( (r'\[CUSTOM_MARKER\]', 0.99, 'custom') ) ``` ## šŸ“„ License This skill is provided as-is for educational and professional use. ## šŸ™ Acknowledgments - Uses NLTK for natural language processing - ReportLab for PDF generation - Jinja2 for HTML templating - scikit-learn for machine learning algorithms ## šŸ“ž Support For issues or questions: 1. Check the troubleshooting section 2. Review the example scripts 3. Examine the detailed skill documentation (SKILL.md) --- **Version**: 1.0.0 **Author**: Claude AI Assistant **Last Updated**: November 2024

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/az9713/claude_skill_hn_mcp_server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server