# ClickUp MCP Server - ELK Stack & OpenTelemetry Integration
## Mission
Integrate the CloudFlare Workers-hosted ClickUp MCP Server with a home lab ELK Stack for comprehensive observability, including OpenTelemetry traces, metrics, and R2 audit log ingestion.
## Current Infrastructure Context
### ClickUp MCP Server (CloudFlare Workers)
- **Location**: CloudFlare Workers (remote SaaS)
- **Architecture**: OAuth 2.0 multi-tenant, 72 tools across 12 categories
- **Existing Telemetry**:
- ✅ Response metadata (tool_name, execution_time_ms, rate_limit, retry telemetry)
- ✅ R2 Audit Logging (all user actions logged to CloudFlare R2 bucket)
- ✅ Automatic retry with exponential backoff
- ✅ Rate limiting (100 req/min free, 500 req/min premium)
- ✅ CloudFlare Workers Analytics (basic request metrics)
**Audit Log Structure (R2):**
```json
{
"request_id": "uuid",
"timestamp": "2025-11-14T12:34:56Z",
"user_id": "user_xxx",
"tier": "free|premium",
"tool_name": "clickup_task_create",
"tool_category": "task",
"duration_ms": 450,
"clickup_api_calls": 2,
"retry_attempts": 1,
"status": "success|error",
"error_code": "RATE_LIMIT|...",
"rate_limit": { "remaining": 95, "limit": 100, "reset_at": 1699963200 },
"user_agent": "Claude/Desktop",
"ip_address": "1.2.3.4"
}
```
**R2 Bucket Structure:**
```
AUDIT_BUCKET/
├── audit/
│ ├── 2025-11-14/
│ │ ├── user_xxx/
│ │ │ ├── request-uuid-1.json
│ │ │ ├── request-uuid-2.json
│ └── 2025-11-15/
│ └── user_yyy/
│ └── request-uuid-3.json
```
### Home Lab ELK Stack
- **Location**: On-premise home lab, accessible via Tailnet
- **IP Address**: `100.108.180.29:9200` (rtx-ubuntu Elasticsearch)
- **Components**:
- Elasticsearch (already running)
- Kibana (for visualization)
- Existing dashboards and indices
- **Network**: Tailscale VPN (Tailnet)
- **Experience Level**: User already familiar with ELK operations
### Networking Context
- **Tailnet**: Private Tailscale network connecting home lab
- **CloudFlare Workers**: Public edge workers (no direct access to Tailnet)
- **Challenge**: CloudFlare Workers cannot directly reach Tailnet resources
- **Solution Options**: CloudFlare Tunnels, Reverse Proxy, or Pull-based ingestion
## Architecture Goals
### 1. OpenTelemetry → Elastic APM
**Goal**: Send distributed traces from CloudFlare Workers to home lab Elastic APM
**Key Requirements:**
- CloudFlare Workers → OpenTelemetry → Elastic APM
- Distributed tracing: User request → Worker → ClickUp API
- Custom spans for retry attempts, rate limiting, cache hits
- Metrics: request duration, error rates, tool usage distribution
- Support for both Free and Premium tier analytics
**Architecture Challenge:**
```
CloudFlare Workers (public edge)
↓ Need HTTPS endpoint
↓ Cannot reach Tailnet directly
↓
[How to reach?]
↓
Elastic APM (home lab @ 100.108.180.29:9200)
```
### 2. R2 Audit Logs → Elasticsearch
**Goal**: Ingest structured audit logs from CloudFlare R2 into Elasticsearch for querying/dashboards
**Key Requirements:**
- Periodic ingestion (every 5-15 minutes)
- Structured mapping for efficient queries
- Retention policy (e.g., 90 days)
- Support for time-series analysis
- Index pattern: `clickup-audit-YYYY.MM.DD`
**Architecture Challenge:**
```
CloudFlare R2 Bucket (audit logs)
↓ Need to pull or push?
↓
[Ingestion mechanism?]
↓
Elasticsearch (home lab)
```
## Architecture Options (WITH TRADE-OFFS)
### Option 1: CloudFlare Tunnel → Home Lab (RECOMMENDED ⭐)
**Concept**: Use CloudFlare Tunnel to expose home lab Elastic APM endpoint
```bash
# On home lab (rtx-ubuntu):
cloudflared tunnel create elastic-apm
cloudflared tunnel route dns elastic-apm apm.yourdomain.com
cloudflared tunnel run elastic-apm \
--url http://100.108.180.29:9200 \
--protocol http2
# Result: CloudFlare Workers can POST to https://apm.yourdomain.com
```
**Pros:**
- ✅ No firewall changes needed
- ✅ TLS/HTTPS automatically handled by CloudFlare
- ✅ Workers can directly POST traces
- ✅ No exposed public IP
**Cons:**
- ❌ Requires domain name ($10-15/yr)
- ❌ Additional CloudFlare Tunnel daemon running
- ❌ ~50-100ms latency added
**Cost**: ~$12/yr (domain) + $0 (tunnel is free)
### Option 2: Tailscale Funnel (EXPERIMENTAL)
**Concept**: Expose Elastic APM via Tailscale Funnel (public HTTPS proxy)
```bash
# On home lab:
tailscale serve https:443 / http://100.108.180.29:9200
tailscale funnel 443 on
# Result: Public HTTPS endpoint like: https://rtx-ubuntu.ts.net
```
**Pros:**
- ✅ No domain required
- ✅ No additional daemons (uses existing Tailscale)
- ✅ TLS handled by Tailscale
**Cons:**
- ❌ Experimental feature (beta)
- ❌ Rate limits on Funnel endpoints
- ❌ Public URL (security consideration)
**Cost**: $0
### Option 3: Separate Collector + Pull Ingestion (PRODUCTION-GRADE)
**Concept**: Use intermediate OTLP collector, home lab pulls data
```
CloudFlare Workers
↓ HTTPS
CloudFlare R2 (OTLP logs) + R2 (audit logs)
↓ Pull via Wrangler
Home Lab Filebeat/Logstash
↓
Elasticsearch
```
**Implementation:**
```bash
# Home lab cron job (every 5 minutes)
*/5 * * * * /home/user/scripts/ingest-clickup-logs.sh
# ingest-clickup-logs.sh
#!/bin/bash
# Pull latest audit logs from R2
wrangler r2 object list clickup-audit --prefix "audit/$(date +%Y-%m-%d)" \
| jq -r '.[] | .Key' \
| while read key; do
wrangler r2 object get "clickup-audit/$key" | \
curl -X POST http://100.108.180.29:9200/clickup-audit/_doc \
-H "Content-Type: application/json" \
-d @-
done
```
**Pros:**
- ✅ No inbound connections to home lab
- ✅ Works with strict firewall rules
- ✅ Decouples Workers from home lab availability
- ✅ Can batch process for efficiency
**Cons:**
- ❌ 5-15 minute delay for logs
- ❌ Requires scheduled job
- ❌ Cannot do real-time tracing (OpenTelemetry)
**Cost**: $0
### Option 4: Hybrid (RECOMMENDED FOR YOU) ⭐⭐
**Best of both worlds:**
```
OpenTelemetry Traces:
CloudFlare Workers → CloudFlare Tunnel → Elastic APM (real-time)
Audit Logs:
R2 → Scheduled Pull → Elasticsearch (5-min delay)
```
**Why this is best for your use case:**
- ✅ Real-time distributed tracing for debugging
- ✅ Batch ingestion for audit logs (more efficient)
- ✅ No dependency on home lab uptime for Workers
- ✅ Secure (Tunnel handles auth)
- ✅ Cost-effective
## Implementation Tasks
### Phase A: CloudFlare Tunnel Setup (Est: 1-2 hours)
**Prerequisites:**
- Domain name (or subdomain from existing domain)
- Home lab with Docker or systemd
- CloudFlare account
**Steps:**
1. Install cloudflared on home lab (rtx-ubuntu)
```bash
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb -o cloudflared.deb
sudo dpkg -i cloudflared.deb
```
2. Create tunnel
```bash
cloudflared tunnel login
cloudflared tunnel create elastic-apm
cloudflared tunnel route dns elastic-apm apm.yourdomain.com
```
3. Configure tunnel (create config.yml)
```yaml
tunnel: <TUNNEL-ID>
credentials-file: /home/user/.cloudflared/<TUNNEL-ID>.json
ingress:
- hostname: apm.yourdomain.com
service: http://100.108.180.29:9200
originRequest:
noTLSVerify: true
- service: http_status:404
```
4. Run as systemd service
```bash
sudo cloudflared service install
sudo systemctl enable cloudflared
sudo systemctl start cloudflared
```
5. Test endpoint
```bash
curl https://apm.yourdomain.com/_cluster/health
# Should return Elasticsearch cluster health
```
### Phase B: OpenTelemetry Integration in Workers (Est: 2-3 hours)
**File: `src/telemetry/otel.ts`**
```typescript
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
export interface TelemetryConfig {
serviceName: string;
serviceVersion: string;
environment: 'production' | 'development';
elasticApmUrl: string; // https://apm.yourdomain.com
enabled: boolean;
}
export class WorkerTelemetry {
private tracer: Tracer;
private provider: WebTracerProvider;
constructor(config: TelemetryConfig) {
if (!config.enabled) {
// Return noop tracer if disabled
this.tracer = trace.getTracer('noop');
return;
}
// Create resource with service metadata
const resource = Resource.default().merge(
new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: config.serviceName,
[SemanticResourceAttributes.SERVICE_VERSION]: config.serviceVersion,
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: config.environment,
'telemetry.sdk.name': 'opentelemetry',
'telemetry.sdk.language': 'javascript',
'telemetry.sdk.version': '1.0.0'
})
);
// Create OTLP exporter pointing to Elastic APM via CloudFlare Tunnel
const exporter = new OTLPTraceExporter({
url: `${config.elasticApmUrl}/v1/traces`,
headers: {
'Content-Type': 'application/json',
// Add authentication if Elastic APM requires it
// 'Authorization': `Bearer ${env.ELASTIC_APM_TOKEN}`
},
timeoutMillis: 5000
});
// Create provider with batch processor (efficient for edge workers)
this.provider = new WebTracerProvider({ resource });
this.provider.addSpanProcessor(new BatchSpanProcessor(exporter, {
maxQueueSize: 100,
maxExportBatchSize: 10,
scheduledDelayMillis: 500 // Batch every 500ms
}));
this.provider.register();
this.tracer = trace.getTracer(config.serviceName, config.serviceVersion);
}
/**
* Trace an MCP tool execution
*/
async traceTool<T>(
toolName: string,
userId: string,
tier: 'free' | 'premium',
fn: () => Promise<T>
): Promise<T> {
const span = this.tracer.startSpan('mcp.tool.execute', {
attributes: {
'mcp.tool.name': toolName,
'mcp.tool.category': this.extractCategory(toolName),
'user.id': userId,
'user.tier': tier,
'service.name': 'clickup-mcp-server'
}
});
const startTime = Date.now();
try {
const result = await fn();
// Add success metrics
span.setAttributes({
'mcp.execution.duration_ms': Date.now() - startTime,
'mcp.execution.status': 'success'
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error: any) {
// Record error details
span.recordException(error);
span.setAttributes({
'mcp.execution.duration_ms': Date.now() - startTime,
'mcp.execution.status': 'error',
'mcp.error.code': error.code || 'UNKNOWN',
'mcp.error.message': error.message
});
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
}
/**
* Trace a ClickUp API request (nested span)
*/
async traceClickUpRequest<T>(
method: string,
path: string,
fn: () => Promise<T>
): Promise<T> {
const span = this.tracer.startSpan('clickup.api.request', {
attributes: {
'http.method': method,
'http.url': path,
'http.target': path,
'service.name': 'clickup-api'
}
});
try {
const result = await fn();
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error: any) {
span.recordException(error);
span.setAttributes({
'http.status_code': error.status || 500
});
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
/**
* Trace retry attempts
*/
traceRetry(attemptNumber: number, delayMs: number, error: string): void {
const span = this.tracer.startSpan('mcp.retry.attempt', {
attributes: {
'mcp.retry.attempt': attemptNumber,
'mcp.retry.delay_ms': delayMs,
'mcp.retry.error': error
}
});
span.end();
}
private extractCategory(toolName: string): string {
// Extract category from tool name: clickup_task_create → task
const match = toolName.match(/^clickup_([^_]+)_/);
return match ? match[1] : 'unknown';
}
/**
* Flush all pending spans (call on worker shutdown)
*/
async flush(): Promise<void> {
await this.provider.forceFlush();
}
}
```
**Integration into BaseClickUpService:**
```typescript
// src/services/clickup/base.ts
import { WorkerTelemetry } from '../telemetry/otel.js';
export class BaseClickUpService {
protected telemetry?: WorkerTelemetry;
setTelemetry(telemetry: WorkerTelemetry) {
this.telemetry = telemetry;
}
protected async makeRequest<T>(fn: () => Promise<T>): Promise<T> {
// Existing retry logic...
// Wrap with telemetry if available
if (this.telemetry) {
return await this.telemetry.traceClickUpRequest(
requestMethod,
requestPath,
() => this.executeWithRetry(fn, `${requestMethod} ${requestPath}`)
);
}
return await this.executeWithRetry(fn, `${requestMethod} ${requestPath}`);
}
}
```
**Worker Entry Point:**
```typescript
// src/worker.ts
import { WorkerTelemetry } from './telemetry/otel.js';
export default {
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
// Initialize telemetry
const telemetry = new WorkerTelemetry({
serviceName: 'clickup-mcp-server',
serviceVersion: '1.0.0',
environment: env.ENVIRONMENT || 'production',
elasticApmUrl: env.ELASTIC_APM_URL, // https://apm.yourdomain.com
enabled: env.ENABLE_TELEMETRY === 'true'
});
// Inject into services
clickUpServices.task.setTelemetry(telemetry);
try {
// Wrap entire request in trace
return await telemetry.traceTool(
toolName,
userId,
userTier,
async () => {
// Your existing request handling
return handleMcpRequest(request, env);
}
);
} finally {
// Flush traces before worker dies
ctx.waitUntil(telemetry.flush());
}
}
};
```
### Phase C: R2 Audit Log Ingestion (Est: 1-2 hours)
**File: `/home/user/scripts/ingest-clickup-audit-logs.sh`**
```bash
#!/bin/bash
set -euo pipefail
# Configuration
R2_BUCKET="clickup-audit"
ES_HOST="http://100.108.180.29:9200"
ES_INDEX_PREFIX="clickup-audit"
WRANGLER_CONFIG="/home/user/.wrangler/config.json"
STATE_FILE="/var/lib/clickup-audit-ingest/last_sync"
LOG_FILE="/var/log/clickup-audit-ingest.log"
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
log "Starting audit log ingestion"
# Get last sync timestamp (or default to 1 hour ago)
if [ -f "$STATE_FILE" ]; then
LAST_SYNC=$(cat "$STATE_FILE")
else
LAST_SYNC=$(date -u -d '1 hour ago' +'%Y-%m-%dT%H:%M:%SZ')
fi
log "Last sync: $LAST_SYNC"
# Get current date for prefix
TODAY=$(date -u +'%Y-%m-%d')
# List new objects in R2 bucket
NEW_OBJECTS=$(wrangler r2 object list "$R2_BUCKET" \
--prefix "audit/$TODAY/" \
--config "$WRANGLER_CONFIG" \
| jq -r --arg last "$LAST_SYNC" \
'.[] | select(.uploaded > $last) | .Key')
if [ -z "$NEW_OBJECTS" ]; then
log "No new audit logs found"
exit 0
fi
log "Found $(echo "$NEW_OBJECTS" | wc -l) new audit logs"
# Process each object
echo "$NEW_OBJECTS" | while read -r key; do
log "Processing: $key"
# Download object from R2
audit_data=$(wrangler r2 object get "$R2_BUCKET/$key" \
--config "$WRANGLER_CONFIG" \
--pipe)
# Extract date for index name
log_date=$(echo "$audit_data" | jq -r '.timestamp' | cut -d'T' -f1)
index_name="${ES_INDEX_PREFIX}-${log_date}"
# Index into Elasticsearch
curl -s -X POST "$ES_HOST/$index_name/_doc" \
-H "Content-Type: application/json" \
-d "$audit_data" \
> /dev/null
if [ $? -eq 0 ]; then
log "✓ Indexed: $key → $index_name"
else
log "✗ Failed to index: $key"
fi
done
# Update sync timestamp
date -u +'%Y-%m-%dT%H:%M:%SZ' > "$STATE_FILE"
log "Ingestion complete"
```
**Elasticsearch Index Template:**
```bash
# Create index template for audit logs
curl -X PUT "http://100.108.180.29:9200/_index_template/clickup-audit" \
-H "Content-Type: application/json" \
-d '{
"index_patterns": ["clickup-audit-*"],
"template": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"index.lifecycle.name": "clickup-audit-lifecycle",
"index.lifecycle.rollover_alias": "clickup-audit"
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"request_id": { "type": "keyword" },
"user_id": { "type": "keyword" },
"tier": { "type": "keyword" },
"tool_name": { "type": "keyword" },
"tool_category": { "type": "keyword" },
"duration_ms": { "type": "long" },
"clickup_api_calls": { "type": "integer" },
"retry_attempts": { "type": "integer" },
"status": { "type": "keyword" },
"error_code": { "type": "keyword" },
"rate_limit": {
"properties": {
"remaining": { "type": "integer" },
"limit": { "type": "integer" },
"reset_at": { "type": "long" }
}
},
"user_agent": { "type": "text" },
"ip_address": { "type": "ip" }
}
}
}
}'
```
**Cron Job Setup:**
```bash
# Add to crontab (every 5 minutes)
crontab -e
# Add line:
*/5 * * * * /home/user/scripts/ingest-clickup-audit-logs.sh
```
### Phase D: Kibana Dashboards (Est: 1 hour)
**Dashboard 1: Tool Usage Overview**
- Line chart: Requests per tool over time
- Pie chart: Tool category distribution
- Bar chart: Top 10 most used tools
- Metric: Total requests today
- Metric: Unique users today
**Dashboard 2: Performance Monitoring**
- Line chart: p50, p95, p99 latency over time
- Heatmap: Latency by tool and hour
- Bar chart: Slowest tools
- Metric: Average request duration
- Metric: Error rate
**Dashboard 3: User Analytics**
- Line chart: DAU/MAU trend
- Pie chart: Free vs Premium users
- Table: Top users by request count
- Bar chart: Premium feature usage
**Dashboard 4: Error Analysis**
- Line chart: Error rate over time
- Bar chart: Errors by type
- Table: Recent errors with context
- Metric: Retry success rate
**Saved Searches:**
```
# Failed requests in last hour
status:error AND timestamp:[now-1h TO now]
# Slow requests (>1s)
duration_ms:>1000
# Rate limit events
error_code:RATE_LIMIT
# Premium user activity
tier:premium
# Tool-specific errors
tool_name:clickup_task_create AND status:error
```
## Security Considerations
### 1. CloudFlare Tunnel Authentication
```yaml
# config.yml - Add authentication
ingress:
- hostname: apm.yourdomain.com
service: http://100.108.180.29:9200
originRequest:
noTLSVerify: true
# Add basic auth or JWT validation
httpHostHeader: "apm.yourdomain.com"
```
### 2. Elasticsearch Security
```bash
# Enable Elasticsearch authentication
# /etc/elasticsearch/elasticsearch.yml
xpack.security.enabled: true
xpack.security.http.ssl.enabled: false
# Create dedicated user for ingestion
curl -X POST "http://100.108.180.29:9200/_security/user/clickup-ingest" \
-H "Content-Type: application/json" \
-d '{
"password": "STRONG_PASSWORD_HERE",
"roles": ["clickup_ingest_role"]
}'
# Create role with write permissions
curl -X POST "http://100.108.180.29:9200/_security/role/clickup_ingest_role" \
-H "Content-Type: application/json" \
-d '{
"indices": [{
"names": ["clickup-audit-*"],
"privileges": ["write", "create_index"]
}]
}'
```
### 3. Rate Limiting on Tunnel
```yaml
# Protect against abuse
ingress:
- hostname: apm.yourdomain.com
service: http://100.108.180.29:9200
originRequest:
# CloudFlare automatically rate limits
# But add explicit limits if needed
connectTimeout: 10s
tlsTimeout: 10s
```
## Testing & Validation
### Test 1: CloudFlare Tunnel Connectivity
```bash
# From any machine (not home lab)
curl https://apm.yourdomain.com/_cluster/health
# Should return: {"status":"green",...}
```
### Test 2: OpenTelemetry Trace Flow
```bash
# Make a test request to Workers
curl -X POST https://your-worker.workers.dev/mcp \
-H "Authorization: Bearer $JWT_TOKEN" \
-d '{"method":"tools/list"}'
# Check Kibana APM UI
# Should see trace: User Request → Worker → ClickUp API
```
### Test 3: R2 Audit Log Ingestion
```bash
# Manually run ingestion script
sudo /home/user/scripts/ingest-clickup-audit-logs.sh
# Query Elasticsearch
curl "http://100.108.180.29:9200/clickup-audit-$(date +%Y.%m.%d)/_search?size=1" | jq
# Should return recent audit log
```
### Test 4: End-to-End Workflow
```bash
# 1. Create a task via MCP
# 2. Check CloudFlare Workers logs for OTel export
# 3. Check Kibana APM for distributed trace
# 4. Wait 5 minutes for R2 ingestion cron
# 5. Query Elasticsearch for audit log
# 6. View Kibana dashboard for metrics
```
## Troubleshooting Guide
### Issue: CloudFlare Tunnel not connecting
```bash
# Check tunnel status
cloudflared tunnel info elastic-apm
# Check logs
sudo journalctl -u cloudflared -f
# Common fix: Restart tunnel
sudo systemctl restart cloudflared
```
### Issue: OpenTelemetry traces not appearing in Kibana
```bash
# Check Worker logs for export errors
wrangler tail
# Verify APM endpoint is reachable
curl https://apm.yourdomain.com/v1/traces \
-X POST \
-H "Content-Type: application/json" \
-d '{"test":"data"}'
# Check Elastic APM server logs
# (Elastic APM server must be configured separately)
```
### Issue: R2 audit logs not ingesting
```bash
# Check cron job is running
grep "ingest-clickup-audit-logs" /var/log/syslog
# Manually run script with verbose output
bash -x /home/user/scripts/ingest-clickup-audit-logs.sh
# Verify R2 bucket access
wrangler r2 object list clickup-audit --prefix "audit/"
# Check Elasticsearch index exists
curl "http://100.108.180.29:9200/_cat/indices/clickup-audit-*"
```
### Issue: High latency from Workers to home lab
```bash
# Measure tunnel latency
time curl https://apm.yourdomain.com/_cluster/health
# If >200ms, consider:
# 1. Use CloudFlare Argo (paid feature for faster routing)
# 2. Increase OTel batch delay (less frequent exports)
# 3. Switch to pull-based ingestion for non-critical traces
```
## Performance Optimization
### 1. Batch OpenTelemetry Exports
```typescript
// Increase batch size and delay
new BatchSpanProcessor(exporter, {
maxQueueSize: 200, // Double the queue
maxExportBatchSize: 20, // Larger batches
scheduledDelayMillis: 1000 // Export every 1s instead of 500ms
})
```
### 2. Elasticsearch Index Lifecycle Management
```bash
# Create ILM policy (90-day retention)
curl -X PUT "http://100.108.180.29:9200/_ilm/policy/clickup-audit-lifecycle" \
-H "Content-Type: application/json" \
-d '{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "1d"
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}'
```
### 3. Efficient R2 Ingestion
```bash
# Optimize: Only sync last 24 hours
# Add to ingestion script:
YESTERDAY=$(date -u -d '1 day ago' +'%Y-%m-%d')
TODAY=$(date -u +'%Y-%m-%d')
# Sync both days (in case of overnight runs)
for date in $YESTERDAY $TODAY; do
wrangler r2 object list "$R2_BUCKET" --prefix "audit/$date/" | ...
done
```
## Cost Analysis
**Estimated Monthly Costs:**
| Item | Cost |
|------|------|
| Domain (amortized) | $1/mo |
| CloudFlare Tunnel | $0 (free) |
| CloudFlare Workers | $0-5 (within free tier for moderate usage) |
| R2 Storage (1GB audit logs) | $0.015/mo |
| R2 Class B Operations (ingestion) | ~$0.05/mo |
| Electricity (home lab) | ~$10-20/mo (existing) |
| **Total** | **~$11-26/mo** |
**Comparison to SaaS APM:**
- Elastic Cloud APM: $95/mo
- Honeycomb: $100/mo (after free tier)
- Datadog APM: $31/host/mo
- **Savings**: ~$70-90/mo by self-hosting
## Success Criteria
- [ ] CloudFlare Tunnel established and accessible
- [ ] OpenTelemetry traces appearing in Kibana APM
- [ ] R2 audit logs ingesting every 5 minutes
- [ ] Kibana dashboards created and populated
- [ ] End-to-end trace: User → Worker → ClickUp API visible
- [ ] 90-day retention policy active
- [ ] Authentication secured
- [ ] Documentation complete
## Next Steps After Implementation
1. **Create Alerts**: Set up Kibana alerts for:
- Error rate >5%
- p95 latency >1s
- Rate limit events
- Authentication failures
2. **Optimize Queries**: Create saved searches for common debugging scenarios
3. **Add Custom Metrics**: Extend telemetry to track:
- Cache hit rates
- Premium feature usage
- Cost per user (ClickUp API calls)
4. **Capacity Planning**: Monitor storage growth and adjust retention policy
5. **Integration with Grafana**: If you prefer Grafana, connect to Elasticsearch as data source
## Reference Resources
- **CloudFlare Tunnel Docs**: https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/
- **OpenTelemetry JS**: https://opentelemetry.io/docs/instrumentation/js/
- **Elastic APM**: https://www.elastic.co/guide/en/apm/guide/current/index.html
- **Wrangler R2 CLI**: https://developers.cloudflare.com/r2/api/workers/workers-api-reference/
- **Elasticsearch Index Templates**: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html
## Questions to Clarify
1. **Domain**: Do you already have a domain? Or need to register one?
2. **Elastic APM Server**: Is Elastic APM server already running on 100.108.180.29? Or need to set it up?
3. **Network Security**: Are there firewall rules on rtx-ubuntu that need adjustment?
4. **Retention**: Is 90 days good for audit logs? Or need longer?
5. **Real-time vs Batch**: For OpenTelemetry, is real-time critical? Or can we batch more aggressively?
---
## Implementation Order
**Week 1:**
1. Set up CloudFlare Tunnel (2 hours)
2. Test connectivity from public internet (30 min)
3. Verify Elastic APM server is ready (1 hour)
**Week 2:**
4. Implement OpenTelemetry in Workers (3 hours)
5. Test end-to-end tracing (1 hour)
6. Deploy to production Workers (30 min)
**Week 3:**
7. Create R2 ingestion script (2 hours)
8. Set up cron job (30 min)
9. Create Elasticsearch index template (30 min)
10. Test ingestion manually (1 hour)
**Week 4:**
11. Build Kibana dashboards (2 hours)
12. Create saved searches and alerts (1 hour)
13. Documentation and training (1 hour)
**Total Time Estimate**: ~15-18 hours
---
**Status**: Ready for implementation. All architecture decisions documented. Proceed with Phase A (CloudFlare Tunnel Setup).