Provides streaming chat capabilities using OpenAI's GPT models with automatic retry logic and rate limit handling for conversational AI tasks.
๐ง MCP Server (Model Compute Paradigm)
A modular, production-ready FastAPI server built to route and orchestrate multiple AI/LLM-powered models behind a unified, scalable interface. It supports streaming chat, LLM-based routing, and multi-model pipelines (like analyze โ summarize โ recommend) โ all asynchronously and fully Dockerized.
๐ฏ Project Score (Production Readiness)
Capability | Status | Details |
๐ง Multi-Model Orchestration | โ Complete | Dynamic routing between
,
,
,
|
๐ค LLM-Based Task Router | โ Complete | GPT-powered routing via
task type |
๐ Async FastAPI + Concurrency | โ Complete | Async/await + concurrent task execution with simulated/model API delays |
๐ GPT Streaming Support | โ Complete |
chunked responses for chat endpoints |
๐งช Unit + Mocked API Tests | โ Complete | Pytest-based test suite with mocked
responses |
๐ณ Dockerized + Clean Layout | โ Complete | Python 3.13 base image, no Conda dependency, production-ready Dockerfile |
๐ฆ Metadata-Driven Registry | โ Complete | Model metadata loaded from external YAML config |
๐ Rate Limiting & Retry | โณ In Progress | Handles 429 retry loop; rate limiting controls WIP |
๐งช CI + Docs | โณ Next | GitHub Actions + Swagger/Redoc planned |
๐งฉ Why This Project? (Motivation)
Modern ML/LLM deployments often involve:
Multiple task types and model backends (OpenAI, HF, local, REST)
Routing decisions based on input intent
Combining outputs of multiple models (e.g.,
summarize+recommend)Handling 429 retries, async concurrency, streaming responses
๐ง However, building such an LLM backend API server that is:
Async + concurrent
Streamable
Pluggable (via metadata)
Testable
Dockerized โฆ is non-trivial and not easily found in one single place.
๐ก What Weโve Built (Solution)
This repo is a production-ready PoC of an MCP (Model-Compute Paradigm) architecture:
โ FastAPI-based microserver to handle multiple tasks via
/taskendpointโ Task router that can:
๐ Dispatch to specific model types (
chat,sentiment,summarize,recommend)๐ค Use an LLM to infer which task to run (
auto)๐ง Run multiple models in sequence (
analyze)
โ GPT streaming via
text/event-streamโ Async/await enabled architecture for concurrency
โ Clean modular code for easy extension
โ Dockerized for deployment
โ Tested using Pytest with mocking
๐ ๏ธ Use Cases
Use Case | MCP Server Support |
Build your own ChatGPT-style API | โ
task with streaming |
Build intelligent task router | โ
task with GPT-powered intent parsing |
Build AI pipelines (like RAG/RL) | โ
task with sequential execution |
Swap between OpenAI/HuggingFace APIs | โ Via
config |
Add custom models (e.g., OCR, vision) | โ Just add a new module + registry entry |
๐ Features
โ Async FastAPI server
๐ง Task-based Model Routing (
chat,sentiment,recommender,summarize)๐ Model Registry from YAML/JSON
๐ Automatic Retry and Rate Limit Handling for APIs
๐ Streaming Responses for Chat
๐งช Unit Tests + Mocked API Calls
๐ณ Dockerized for production deployment
๐ฆ Modular structure, ready for CI/CD