Skip to main content
Glama

Job URL Analyzer MCP Server

by subslink326
README.md9.34 kB
# Job URL Analyzer MCP Server A comprehensive FastAPI-based microservice for analyzing job URLs and extracting detailed company information. Built with modern async Python, this service crawls job postings and company websites to build rich company profiles with data enrichment from external providers. ## ✨ Features - **🕷️ Intelligent Web Crawling**: Respectful crawling with robots.txt compliance and rate limiting - **🧠 Content Extraction**: Advanced HTML parsing using Selectolax for fast, accurate data extraction - **🔗 Data Enrichment**: Pluggable enrichment providers (Crunchbase, LinkedIn, custom APIs) - **📊 Quality Scoring**: Completeness and confidence metrics for extracted data - **📝 Markdown Reports**: Beautiful, comprehensive company analysis reports - **🔍 Observability**: OpenTelemetry tracing, Prometheus metrics, structured logging - **🚀 Production Ready**: Docker, Kubernetes, health checks, graceful shutdown - **🧪 Well Tested**: Comprehensive test suite with 80%+ coverage ## 🏗️ Architecture ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ FastAPI App │───▶│ Orchestrator │───▶│ Web Crawler │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Content Extract │ │ Database │ └─────────────────┘ │ (SQLAlchemy) │ │ └─────────────────┘ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Enrichment │───▶│ Providers │ │ Manager │ │ (Crunchbase,etc)│ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ │ Report Generator│ └─────────────────┘ ``` ## 🚀 Quick Start ### Prerequisites - Python 3.11+ - Poetry (for dependency management) - Docker & Docker Compose (optional) ### Local Development 1. **Clone and Setup** ```bash git clone https://github.com/subslink326/job-url-analyzer-mcp.git cd job-url-analyzer-mcp poetry install ``` 2. **Environment Configuration** *(Optional)* ```bash # The application has sensible defaults and can run without environment configuration # To customize settings, create a .env file with your configuration # See src/job_url_analyzer/config.py for available settings ``` 3. **Database Setup** ```bash poetry run alembic upgrade head ``` 4. **Run Development Server** ```bash poetry run python -m job_url_analyzer.main # Server starts at http://localhost:8000 ``` ### Docker Deployment 1. **Development** ```bash docker-compose up --build ``` 2. **Production** ```bash docker-compose -f docker-compose.prod.yml up -d ``` ## 📡 API Usage ### Analyze Job URL ```bash curl -X POST "http://localhost:8000/analyze" \ -H "Content-Type: application/json" \ -d '{ "url": "https://company.com/jobs/software-engineer", "include_enrichment": true, "force_refresh": false }' ``` ### Response Example ```json { "profile_id": "123e4567-e89b-12d3-a456-426614174000", "source_url": "https://company.com/jobs/software-engineer", "company_profile": { "name": "TechCorp", "description": "Leading AI company...", "industry": "Technology", "employee_count": 150, "funding_stage": "Series B", "total_funding": 25.0, "headquarters": "San Francisco, CA", "tech_stack": ["Python", "React", "AWS"], "benefits": ["Health insurance", "Remote work"] }, "completeness_score": 0.85, "confidence_score": 0.90, "processing_time_ms": 3450, "enrichment_sources": ["crunchbase"], "markdown_report": "# TechCorp - Company Analysis Report\n..." } ``` ## ⚙️ Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `DEBUG` | Enable debug mode | `false` | | `HOST` | Server host | `0.0.0.0` | | `PORT` | Server port | `8000` | | `DATABASE_URL` | Database connection string | `sqlite+aiosqlite:///./data/job_analyzer.db` | | `MAX_CONCURRENT_REQUESTS` | Max concurrent HTTP requests | `10` | | `REQUEST_TIMEOUT` | HTTP request timeout (seconds) | `30` | | `CRAWL_DELAY` | Delay between requests (seconds) | `1.0` | | `RESPECT_ROBOTS_TXT` | Respect robots.txt | `true` | | `ENABLE_CRUNCHBASE` | Enable Crunchbase enrichment | `false` | | `CRUNCHBASE_API_KEY` | Crunchbase API key | `""` | | `DATA_RETENTION_DAYS` | Data retention period | `90` | ## 📊 Monitoring ### Metrics Endpoints - **Health Check**: `GET /health` - **Prometheus Metrics**: `GET /metrics` ### Key Metrics - `job_analyzer_requests_total` - Total API requests - `job_analyzer_analysis_success_total` - Successful analyses - `job_analyzer_completeness_score` - Data completeness distribution - `job_analyzer_crawl_requests_total` - Crawl requests by status - `job_analyzer_enrichment_success_total` - Enrichment success by provider ## 🧪 Testing ### Run Tests ```bash # Unit tests poetry run pytest # With coverage poetry run pytest --cov=job_url_analyzer --cov-report=html # Integration tests only poetry run pytest -m integration # Skip slow tests poetry run pytest -m "not slow" ``` ## 🚀 Deployment ### Kubernetes ```bash # Apply manifests kubectl apply -f kubernetes/ # Check deployment kubectl get pods -l app=job-analyzer kubectl logs -f deployment/job-analyzer ``` ### Production Checklist - [ ] Environment variables configured - [ ] Database migrations applied - [ ] SSL certificates configured - [ ] Monitoring dashboards set up - [ ] Log aggregation configured - [ ] Backup strategy implemented - [ ] Rate limiting configured - [ ] Resource limits set ## 🔧 Development ### Project Structure ``` job-url-analyzer/ ├── src/job_url_analyzer/ # Main application code │ ├── enricher/ # Enrichment providers │ ├── main.py # FastAPI application │ ├── config.py # Configuration │ ├── models.py # Pydantic models │ ├── database.py # Database models │ ├── crawler.py # Web crawler │ ├── extractor.py # Content extraction │ ├── orchestrator.py # Main orchestrator │ └── report_generator.py # Report generation ├── tests/ # Test suite ├── alembic/ # Database migrations ├── kubernetes/ # K8s manifests ├── monitoring/ # Monitoring configs ├── docker-compose.yml # Development setup ├── docker-compose.prod.yml # Production setup └── Dockerfile # Container definition ``` ### Code Quality The project uses: - **Black** for code formatting - **Ruff** for linting - **MyPy** for type checking - **Pre-commit** hooks for quality gates ```bash # Setup pre-commit poetry run pre-commit install # Run quality checks poetry run black . poetry run ruff check . poetry run mypy src/ ``` ## 📝 Recent Changes ### Dependency Updates - **Fixed**: Replaced non-existent `aiohttp-robotparser` dependency with `robotexclusionrulesparser` for robots.txt parsing - **Improved**: Setup process now works out-of-the-box without requiring `.env` file configuration ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch (`git checkout -b feature/amazing-feature`) 3. Make your changes 4. Add tests for new functionality 5. Ensure all tests pass (`poetry run pytest`) 6. Commit your changes (`git commit -m 'Add amazing feature'`) 7. Push to the branch (`git push origin feature/amazing-feature`) 8. Open a Pull Request ## 📄 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## 🆘 Support - **Documentation**: This README and inline code comments - **Issues**: GitHub Issues for bug reports and feature requests - **Discussions**: GitHub Discussions for questions and community --- **Built with ❤️ using FastAPI, SQLAlchemy, and modern Python tooling.**

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/subslink326/job-url-analyzer-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server