Skip to main content
Glama
bharisagar

Stateful MCP Server on ECS Fargate

by bharisagar

Stateful MCP Servers on ECS Fargate

This repository is an end-to-end practical test of a production question:

Can a stateful MCP server survive ECS Fargate force deployments?

The final answer from the live test is:

Yes, but Redis-backed tool state alone is not enough. The MCP Streamable HTTP transport session registry must also stop depending on one container's memory. In this implementation we use stateless Streamable HTTP plus Redis-backed logical session state.

This project was based on the experiment from:

https://github.com/AvinashDalvi89/stateful-mcp-on-ecs-fargate-example

We extended it, deployed it on AWS, reproduced the problem, fixed it, force deployed again, captured AWS Console evidence, and then tore the resources down.

The Problem

ECS Fargate tasks are disposable. During a force deployment or rolling deployment:

  1. ECS starts new replacement tasks.

  2. The new tasks register with the ALB target group.

  3. Old tasks are deregistered and enter draining.

  4. ECS sends SIGTERM to old containers.

  5. Old container memory disappears.

If an MCP server keeps session data only in process memory, the session can break when the client is routed to a new task.

ALB sticky sessions can delay this problem, but they do not solve it. When a target is draining, unhealthy, or removed, the ALB can route the client to a different task.

Related MCP server: MCP Compatible Server

What We Tested

The original experiment used:

  • FastMCP server on ECS Fargate.

  • Streamable HTTP endpoint at /mcp.

  • ALB with sticky sessions.

  • ECS rolling deployment.

  • In-memory session store.

  • A client that repeatedly calls MCP tools.

We added:

  • ElastiCache Redis.

  • Redis-backed MCP tool state.

  • Health endpoint proof showing the active session backend.

  • Stateless Streamable HTTP mode.

  • Client-generated logical Mcp-Session-Id.

  • Evidence capture from AWS Console, CloudWatch, ECS, ALB, and test-client logs.

Key Discovery

The first Redis attempt still failed.

Redis moved our application/tool state out of the task, but FastMCP's stateful Streamable HTTP transport still kept active MCP transport sessions in process memory. When traffic moved to a new ECS task, the new task did not recognize the old Mcp-Session-Id and returned:

Session not found

That error happened before our tool handler ran, which means Redis-backed tool state was necessary but not sufficient.

The working solution was:

MCP_STATELESS_HTTP=true
+ Redis-backed logical session state
+ client-provided Mcp-Session-Id

Final Architecture

Client
  -> Application Load Balancer
  -> ECS Fargate service
       -> Task A: FastMCP server
       -> Task B: FastMCP server
  -> ElastiCache Redis

Redis is outside the Fargate task. This is important.

Do not run Redis as a sidecar inside the same Fargate task for this use case. A sidecar Redis container dies with the task and does not solve deployment replacement.

Final Result

During the successful force-deployment test:

{
  "http_status_counts": {
    "200": 141
  },
  "unique_task_ids": [
    "eb83d8d37aa448758abe33e410d17864",
    "8454af6040484b64b252adf5d0448fff"
  ],
  "first_task_id": "eb83d8d37aa448758abe33e410d17864",
  "last_task_id": "8454af6040484b64b252adf5d0448fff",
  "max_state_size": 70,
  "session_not_found_count": 0,
  "error_rows": 0,
  "session_complete": true
}

This proves:

  • The client crossed from one ECS task to another.

  • The session continued after task replacement.

  • Redis state accumulated up to 70 keys.

  • Every MCP request returned HTTP 200.

  • There were zero Session not found errors.

  • The session completed during ECS deployment replacement.

What Changed In The Code

Redis And Memory Session Stores

src/session_store.py now contains:

  • SessionStore: common interface.

  • InMemorySessionStore: local/demo backend.

  • RedisSessionStore: shared backend for ECS tasks.

  • create_session_store(): selects backend from environment.

Backend selection:

REDIS_URL set     -> RedisSessionStore
REDIS_URL missing -> InMemorySessionStore

Stateless Streamable HTTP

src/server.py reads:

MCP_STATELESS_HTTP=true

When enabled, FastMCP starts with:

mcp.http_app(stateless_http=True)

This avoids depending on a per-task in-memory Streamable HTTP transport session registry.

Logical MCP Session ID

In stateless mode, the server does not issue a transport session ID. The test client generates a stable logical session ID:

client-<uuid>

It sends this value on every tool call:

Mcp-Session-Id: client-...

FastMCP exposes that header through ctx.session_id, and our tools use it as the Redis key.

Lazy Session Creation

set_session_value() creates the logical Redis session on first write if the key does not exist.

get_session_state() still fails for a never-seen session, which keeps reads honest.

Health Endpoint

/health now returns the active backend:

{
  "status": "healthy",
  "active_sessions": 0,
  "session_store": "redis"
}

AWS Infrastructure

sam/infrastructure.yaml creates:

  • VPC

  • Public subnets

  • Private subnets

  • NAT gateway

  • Application Load Balancer

  • ALB target group

  • ALB sticky sessions

  • ECR repository

  • CloudWatch log groups

  • ElastiCache Redis

  • Redis security group allowing inbound traffic only from ECS tasks

sam/ecs.yaml creates:

  • ECS cluster

  • ECS task execution role

  • Fargate task definition

  • ECS service

  • ALB service attachment

The container receives:

REDIS_URL=redis://<elasticache-endpoint>:6379/0
SESSION_TTL_SECONDS=86400
MCP_STATELESS_HTTP=true

Repository Layout

.
├── Dockerfile
├── Makefile
├── README.md
├── evidence/
│   ├── README.md
│   ├── screenshots/
│   ├── 08-health.json
│   ├── 12-client-during-force-deploy.jsonl
│   ├── 21-client-stateless-force-deploy.jsonl
│   ├── 25-stateless-client-summary.json
│   └── 28-cloudwatch-tail.txt
├── sam/
│   ├── infrastructure.yaml
│   └── ecs.yaml
├── src/
│   ├── server.py
│   ├── session_store.py
│   ├── tools.py
│   ├── health.py
│   └── shutdown.py
└── test_client/
    └── test_client.py

Prerequisites

  • AWS CLI configured with permissions for ECS, ECR, CloudFormation, EC2, ELB, CloudWatch Logs, IAM, and ElastiCache.

  • Docker Desktop or Docker Engine.

  • Python 3.12+.

  • AWS region used in this test: ap-south-1.

SAM is optional. The Makefile uses sam deploy, but this test was also run with direct aws cloudformation deploy.

Deploy

1. Deploy Infrastructure

make deploy-infra

AWS CLI equivalent:

aws cloudformation deploy \
  --region ap-south-1 \
  --template-file sam/infrastructure.yaml \
  --stack-name mcp-infrastructure \
  --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
  --no-fail-on-empty-changeset

2. Build And Push Image

make build IMAGE_TAG=redis

3. Deploy ECS

make deploy-ecs IMAGE_TAG=redis

4. Verify Health

curl http://<ALB_DNS_NAME>/health

Expected:

{
  "status": "healthy",
  "session_store": "redis"
}

Run The Force Deployment Test

Start the client:

python test_client/test_client.py \
  --endpoint http://<ALB_DNS_NAME>/mcp \
  --calls 70 \
  --delay 2

While the client is running, force a new ECS deployment:

aws ecs update-service \
  --region ap-south-1 \
  --cluster mcp-fargate-cluster \
  --service mcp-fargate-service \
  --force-new-deployment

Wait for service stability:

aws ecs wait services-stable \
  --region ap-south-1 \
  --cluster mcp-fargate-cluster \
  --services mcp-fargate-service

Then inspect:

evidence/21-client-stateless-force-deploy.jsonl
evidence/25-stateless-client-summary.json

Evidence

Evidence is included in evidence/.

Important files:

  • 08-health.json: live /health endpoint showing Redis mode.

  • 12-client-during-force-deploy.jsonl: Redis-only attempt that still hit transport-level session failure.

  • 21-client-stateless-force-deploy.jsonl: final successful stateless+Redis run.

  • 25-stateless-client-summary.json: parsed success summary.

  • 28-cloudwatch-tail.txt: ECS task logs from CloudWatch.

  • screenshots/: AWS Console screenshots.

See evidence/README.md for a detailed evidence map.

AWS Console Evidence Captured

The screenshots show:

  • ECR image pushed.

  • ECS task definition revision.

  • ECS service healthy before deployment test.

  • ECS logs from task.

  • Force new deployment menu.

  • Deployment in progress.

  • ALB target group draining old target.

  • ECR image after rebuild.

  • Revision 2 deployment in progress.

  • Revision 2 deployment success.

  • CloudFormation teardown in progress.

Teardown

Delete ECS first:

aws cloudformation delete-stack \
  --region ap-south-1 \
  --stack-name mcp-ecs

Then delete infrastructure:

aws cloudformation delete-stack \
  --region ap-south-1 \
  --stack-name mcp-infrastructure

If CloudFormation cannot delete ECR because images still exist:

aws ecr list-images \
  --region ap-south-1 \
  --repository-name mcp-fargate-server

aws ecr batch-delete-image \
  --region ap-south-1 \
  --repository-name mcp-fargate-server \
  --image-ids imageDigest=<digest>

In the captured test run, teardown completed after deleting the remaining ECR images.

Production Notes

  • Use ElastiCache with Multi-AZ or MemoryDB for stronger production durability.

  • Enable encryption in transit and Redis authentication for production.

  • Put Redis in private subnets.

  • Allow Redis inbound traffic only from the ECS task security group.

  • Do not rely on ALB sticky sessions as your durability layer.

  • Keep MCP tool operations idempotent where possible.

  • For server-sent event resumability, consider external event storage as a separate concern.

Main Lesson

For MCP on ECS Fargate, there are two different kinds of state:

  1. Application/tool state.

  2. MCP transport/session-manager state.

Moving only application state to Redis can still fail if the transport session manager is stateful and in memory.

This repository demonstrates a practical ECS-safe pattern:

stateless Streamable HTTP + external Redis logical session state
F
license - not found
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bharisagar/MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server