Skip to main content
Glama

content-core

test_coverage_branch_report.mdβ€’21.3 kB
# Branch Test Coverage Analysis ## Branch Information - Branch: main - Feature: Parallel Audio Transcription with Concurrency Control - Total files analyzed: 3 - Files with test coverage concerns: 1 ## Executive Summary The parallel audio transcription feature has **excellent unit test coverage** for the core concurrency functionality. The 14 existing tests in `test_audio_concurrency.py` comprehensively cover configuration, parallel execution, error handling, and edge cases. However, there are **gaps in integration testing** and **missing coverage for helper functions** that should be addressed before production deployment. **Overall Assessment**: 85% coverage - Very good for critical paths, but needs integration tests. ## Changed Files Analysis ### 1. /Users/luisnovo/dev/projetos/content-core/src/content_core/processors/audio.py **Changes Made**: - Added `transcribe_audio_segment()` function with semaphore-based concurrency control - Modified `extract_audio_data()` to use parallel transcription with configurable concurrency - Existing functions: `split_audio()` and `extract_audio()` (no changes to these) **Current Test Coverage**: - Test file: `/Users/luisnovo/dev/projetos/content-core/tests/unit/test_audio_concurrency.py` - Coverage status: **Partially covered** - 14 unit tests covering: - Configuration loading and validation (8 tests) - Parallel transcription behavior (5 tests) - Error handling (1 test) **Missing Tests**: - [ ] Integration test for `extract_audio_data()` with real audio files - [ ] Test for audio segmentation logic (duration > 10 minutes) - [ ] Test for temporary directory cleanup after transcription - [ ] Test for metadata return format (audio_files array) - [ ] Test for content joining from multiple segments - [ ] Unit tests for `split_audio()` function - [ ] Unit tests for `extract_audio()` function - [ ] Test for concurrency with actual ModelFactory and speech_to_text model - [ ] Test for exception handling in `extract_audio_data()` main try/catch block - [ ] Test for interaction with AudioFileClip (MoviePy) library **Priority**: **High** **Rationale**: While the core concurrency mechanism is well-tested, the main entry point `extract_audio_data()` lacks integration tests. This function orchestrates file splitting, parallel transcription, and result aggregation - all critical paths that need end-to-end validation. ### 2. /Users/luisnovo/dev/projetos/content-core/src/content_core/config.py **Changes Made**: - Added `get_audio_concurrency()` function with environment variable override and validation **Current Test Coverage**: - Test file: `/Users/luisnovo/dev/projetos/content-core/tests/unit/test_audio_concurrency.py` - Coverage status: **Fully covered** - 8 comprehensive tests covering all scenarios **Missing Tests**: - None - coverage is complete **Priority**: **Low** **Rationale**: This function has excellent test coverage including edge cases, boundary values, and error conditions. ### 3. /Users/luisnovo/dev/projetos/content-core/src/content_core/cc_config.yaml **Changes Made**: - Added `extraction.audio.concurrency` configuration with default value of 3 **Current Test Coverage**: - Indirectly tested through `get_audio_concurrency()` tests - Coverage status: **Fully covered** **Missing Tests**: - None - configuration loading is tested **Priority**: **Low** **Rationale**: YAML configuration is adequately validated through config function tests. ## Test Implementation Plan ### High Priority Tests #### 1. Integration Test for `extract_audio_data()` - **Test file to create**: `/Users/luisnovo/dev/projetos/content-core/tests/integration/test_audio_processing.py` - **Test scenarios**: - Short audio file (< 10 minutes) - no segmentation needed - Long audio file (> 10 minutes) - requires segmentation - Verify parallel transcription with multiple segments - Verify content joining and metadata structure - Verify temporary directory creation and cleanup - **Example test structure**: ```python import asyncio import tempfile import os from pathlib import Path import pytest from content_core.common import ProcessSourceState from content_core.processors.audio import extract_audio_data from unittest.mock import patch, MagicMock, AsyncMock class TestAudioDataExtraction: """Integration tests for extract_audio_data function""" @pytest.mark.asyncio async def test_short_audio_file_no_segmentation(self, fixture_path): """Test extraction from audio file shorter than 10 minutes""" # Create a short audio file fixture (< 10 minutes) audio_file = fixture_path / "short_audio.mp3" if not audio_file.exists(): pytest.skip(f"Fixture file not found: {audio_file}") state = ProcessSourceState(file_path=str(audio_file)) # Mock the ModelFactory to avoid real API calls with patch('content_core.processors.audio.ModelFactory') as mock_factory: mock_model = MagicMock() mock_model.atranscribe = AsyncMock( return_value=MagicMock(text="Test transcription") ) mock_factory.get_model.return_value = mock_model result = await extract_audio_data(state) # Verify result structure assert "content" in result assert "metadata" in result assert "audio_files" in result["metadata"] assert len(result["metadata"]["audio_files"]) == 1 assert "Test transcription" in result["content"] @pytest.mark.asyncio async def test_long_audio_file_with_segmentation(self, fixture_path): """Test extraction from audio file longer than 10 minutes requiring segmentation""" audio_file = fixture_path / "long_audio.mp3" if not audio_file.exists(): pytest.skip(f"Fixture file not found: {audio_file}") state = ProcessSourceState(file_path=str(audio_file)) with patch('content_core.processors.audio.ModelFactory') as mock_factory: mock_model = MagicMock() # Simulate different transcriptions for different segments transcriptions = ["Segment 1 text", "Segment 2 text", "Segment 3 text"] mock_model.atranscribe = AsyncMock( side_effect=[MagicMock(text=t) for t in transcriptions] ) mock_factory.get_model.return_value = mock_model result = await extract_audio_data(state) # Verify segmentation occurred assert "content" in result assert "metadata" in result assert len(result["metadata"]["audio_files"]) > 1 # Verify all segments were transcribed and joined assert "Segment 1 text" in result["content"] assert "Segment 2 text" in result["content"] assert "Segment 3 text" in result["content"] @pytest.mark.asyncio async def test_parallel_transcription_respects_concurrency_limit(self): """Test that parallel transcription respects configured concurrency limit""" # Create mock audio file with duration > 10 minutes with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_audio = MagicMock() mock_audio.duration = 1800 # 30 minutes mock_clip.return_value = mock_audio with patch('content_core.processors.audio.extract_audio'): with patch('content_core.processors.audio.ModelFactory') as mock_factory: call_times = [] async def track_calls(audio_file): call_times.append(asyncio.get_event_loop().time()) await asyncio.sleep(0.1) return MagicMock(text=f"transcript_{audio_file}") mock_model = MagicMock() mock_model.atranscribe = track_calls mock_factory.get_model.return_value = mock_model with patch('content_core.config.get_audio_concurrency', return_value=2): state = ProcessSourceState(file_path="/tmp/test.mp3") result = await extract_audio_data(state) # Verify that concurrency was limited assert "content" in result # Note: This test would need more sophisticated timing analysis # to truly verify concurrency limits @pytest.mark.asyncio async def test_temporary_files_cleanup(self): """Test that temporary segmented audio files are properly handled""" with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_audio = MagicMock() mock_audio.duration = 1800 # 30 minutes (requires segmentation) mock_clip.return_value = mock_audio created_files = [] def mock_extract(input_file, output_file, start_time, end_time): # Track created files created_files.append(output_file) Path(output_file).touch() with patch('content_core.processors.audio.extract_audio', side_effect=mock_extract): with patch('content_core.processors.audio.ModelFactory') as mock_factory: mock_model = MagicMock() mock_model.atranscribe = AsyncMock( return_value=MagicMock(text="test") ) mock_factory.get_model.return_value = mock_model state = ProcessSourceState(file_path="/tmp/test.mp3") result = await extract_audio_data(state) # Verify temporary directory structure assert len(created_files) > 0 # Verify files were created in temp directory for file in created_files: assert "tmp" in file.lower() or tempfile.gettempdir() in file @pytest.mark.asyncio async def test_error_handling_propagation(self): """Test that errors in audio processing are properly handled and propagated""" with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_clip.side_effect = Exception("Failed to load audio file") state = ProcessSourceState(file_path="/tmp/nonexistent.mp3") with pytest.raises(Exception) as exc_info: await extract_audio_data(state) assert "Failed to load audio file" in str(exc_info.value) ``` #### 2. Unit Tests for `split_audio()` Function - **Test file to update**: `/Users/luisnovo/dev/projetos/content-core/tests/unit/test_audio_concurrency.py` - **Test scenarios**: - Split audio file into correct number of segments - Verify segment naming convention - Test with custom output prefix - Test with varying segment lengths - Test async execution via thread pool - **Example test structure**: ```python class TestSplitAudio: """Unit tests for split_audio function""" @pytest.mark.asyncio async def test_split_audio_creates_correct_segments(self, tmp_path): """Test that audio is split into correct number of segments""" from content_core.processors.audio import split_audio # Create a mock audio file test_audio = tmp_path / "test_audio.mp3" test_audio.touch() with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_audio = MagicMock() mock_audio.duration = 1800 # 30 minutes mock_clip.return_value = mock_audio with patch('content_core.processors.audio.extract_audio'): result = await split_audio(str(test_audio), segment_length_minutes=15) # Should create 2 segments (30 min / 15 min segments) assert len(result) == 2 assert all("_001.mp3" in result[0] or "_002.mp3" in result[0] for _ in range(1)) @pytest.mark.asyncio async def test_split_audio_naming_convention(self, tmp_path): """Test that segment files follow correct naming convention""" from content_core.processors.audio import split_audio test_audio = tmp_path / "my_podcast.mp3" test_audio.touch() with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_audio = MagicMock() mock_audio.duration = 2400 # 40 minutes mock_clip.return_value = mock_audio with patch('content_core.processors.audio.extract_audio'): result = await split_audio( str(test_audio), segment_length_minutes=10, output_prefix="custom_prefix" ) # Verify custom prefix is used assert all("custom_prefix_" in f for f in result) # Verify zero-padded numbering assert any("_001.mp3" in f for f in result) ``` #### 3. Unit Tests for `extract_audio()` Function - **Test file to update**: `/Users/luisnovo/dev/projetos/content-core/tests/unit/test_audio_concurrency.py` - **Test scenarios**: - Extract full audio without time bounds - Extract audio segment with start and end times - Extract audio with only start time - Extract audio with only end time - Error handling for invalid file paths - **Example test structure**: ```python class TestExtractAudio: """Unit tests for extract_audio function""" def test_extract_full_audio(self, tmp_path): """Test extracting full audio without time constraints""" from content_core.processors.audio import extract_audio input_file = tmp_path / "input.mp3" output_file = tmp_path / "output.mp3" input_file.touch() with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_audio = MagicMock() mock_clip.return_value = mock_audio extract_audio(str(input_file), str(output_file)) mock_audio.write_audiofile.assert_called_once() mock_audio.close.assert_called_once() def test_extract_audio_segment(self, tmp_path): """Test extracting audio segment with start and end times""" from content_core.processors.audio import extract_audio input_file = tmp_path / "input.mp3" output_file = tmp_path / "output.mp3" input_file.touch() with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_audio = MagicMock() mock_subclip = MagicMock() mock_audio.subclipped.return_value = mock_subclip mock_clip.return_value = mock_audio extract_audio(str(input_file), str(output_file), start_time=10.0, end_time=20.0) mock_audio.subclipped.assert_called_once_with(10.0, 20.0) mock_subclip.write_audiofile.assert_called_once() mock_subclip.close.assert_called_once() def test_extract_audio_error_handling(self, tmp_path): """Test error handling when audio extraction fails""" from content_core.processors.audio import extract_audio with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_clip.side_effect = Exception("Invalid audio file") with pytest.raises(Exception) as exc_info: extract_audio("/invalid/path.mp3", "/output/path.mp3") assert "Invalid audio file" in str(exc_info.value) ``` ### Medium Priority Tests #### 4. Enhanced Error Handling Tests - **Test file to update**: `/Users/luisnovo/dev/projetos/content-core/tests/unit/test_audio_concurrency.py` - **Test scenarios**: - Test behavior when all transcriptions fail - Test behavior when ModelFactory fails to create model - Test behavior with invalid audio format - Test semaphore behavior with exceptions - **Example test structure**: ```python class TestEnhancedErrorHandling: """Enhanced error handling tests for audio processing""" @pytest.mark.asyncio async def test_all_transcriptions_fail(self): """Test behavior when all transcription attempts fail""" mock_model = MagicMock() mock_model.atranscribe = AsyncMock( side_effect=Exception("API rate limit exceeded") ) semaphore = asyncio.Semaphore(3) audio_files = [f"audio_{i}.mp3" for i in range(5)] tasks = [ transcribe_audio_segment(audio_file, mock_model, semaphore) for audio_file in audio_files ] results = await asyncio.gather(*tasks, return_exceptions=True) # All should be exceptions assert all(isinstance(r, Exception) for r in results) assert len(results) == 5 @pytest.mark.asyncio async def test_model_factory_failure(self): """Test behavior when ModelFactory fails to create speech-to-text model""" with patch('content_core.processors.audio.ModelFactory') as mock_factory: mock_factory.get_model.side_effect = Exception("Model not configured") with patch('content_core.processors.audio.AudioFileClip') as mock_clip: mock_audio = MagicMock() mock_audio.duration = 300 # 5 minutes mock_clip.return_value = mock_audio state = ProcessSourceState(file_path="/tmp/test.mp3") with pytest.raises(Exception) as exc_info: await extract_audio_data(state) assert "Model not configured" in str(exc_info.value) ``` ### Low Priority Tests #### 5. Configuration Override Tests - **Test file**: Existing tests are adequate - **Additional scenarios** (nice to have): - Test config file loading with audio.concurrency set - Test precedence of environment variables over config file #### 6. Performance Tests (Optional) - **Test file to create**: `/Users/luisnovo/dev/projetos/content-core/tests/performance/test_audio_performance.py` - **Test scenarios**: - Benchmark transcription speed with different concurrency levels - Measure memory usage during parallel processing - Test with various audio file sizes ## Summary Statistics - **Files analyzed**: 3 - **Files with adequate test coverage**: 2 (config.py, cc_config.yaml) - **Files needing additional tests**: 1 (processors/audio.py) - **Total test scenarios identified**: 20+ - **Estimated effort**: 4-6 hours for high priority tests, 2-3 hours for medium priority ## Current Test Execution Results All 14 existing tests pass successfully: ``` tests/unit/test_audio_concurrency.py::TestAudioConcurrencyConfig (8 tests) - PASSED tests/unit/test_audio_concurrency.py::TestParallelTranscription (5 tests) - PASSED tests/unit/test_audio_concurrency.py::TestErrorHandling (1 test) - PASSED ``` ## Recommendations ### Immediate Actions (Before Merge) 1. **Add integration test for `extract_audio_data()`** - This is the main entry point and orchestrates all audio processing. At minimum, add one integration test that verifies end-to-end functionality with a mock audio file. 2. **Add error handling test for extract_audio_data exceptions** - Test the main try/catch block to ensure errors are properly logged and propagated. 3. **Verify the existing integration tests** - The tests `test_extract_content_from_mp3` and `test_extract_content_from_mp4` in `/Users/luisnovo/dev/projetos/content-core/tests/integration/test_extraction.py` should exercise the parallel transcription code path. Confirm they work with the new implementation. ### Short-term Improvements (Next Sprint) 4. **Add unit tests for `split_audio()` and `extract_audio()`** - These helper functions are currently untested but are important for reliability. 5. **Add temporary file cleanup verification** - Ensure temp files created during segmentation don't accumulate. 6. **Test with actual ModelFactory integration** - Create a test that uses real (or properly mocked) ModelFactory to verify the integration point. ### Long-term Enhancements (Future) 7. **Performance benchmarking** - Add tests to measure and track performance improvements from parallelization. 8. **Stress testing** - Test with very long audio files (multiple hours) to verify behavior at scale. 9. **Edge case testing** - Test with corrupted audio files, zero-length files, extremely short segments, etc. ## Conclusion The parallel audio transcription feature demonstrates **strong engineering practices** with excellent unit test coverage for the core concurrency mechanism. The `get_audio_concurrency()` configuration function has comprehensive tests covering all edge cases and validation logic. However, the **integration layer needs attention**. The main `extract_audio_data()` function lacks integration tests, and the helper functions `split_audio()` and `extract_audio()` have no test coverage at all. **Recommendation**: The feature is well-tested at the unit level for concurrency control, but needs integration tests before being considered production-ready. The risk is moderate - the core parallel execution logic is solid, but the file handling and orchestration logic is untested. Add at least 2-3 integration tests as outlined in the "High Priority Tests" section before merging.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lfnovo/content-core'

If you have feedback or need assistance with the MCP directory API, please join our Discord server