Skip to main content
Glama
README.md8.24 kB
# kivv Automation Worker Daily automation worker for paper discovery, summarization, and storage. ## Overview The automation worker runs daily at **6 AM UTC** via Cloudflare cron trigger. It processes all active users, fetches papers from arXiv matching their topics, generates AI summaries, and stores results in the database. ### Key Features - **Checkpointed Execution**: Resumes from last successful user on failure - **Budget Enforcement**: Stops at $1/day circuit breaker - **Two-Stage AI**: Haiku triage (fast) + Sonnet summaries (quality) - **Rate Limiting**: Respects arXiv (1 req/3s) and Anthropic (5 req/s) limits - **Error Resilience**: Continues processing other users/papers on errors - **Multi-User Support**: Processes all active users independently ## Architecture ### Workflow ``` 1. Load checkpoint (resume if exists) 2. Fetch all active users from database 3. For each user: a. Get enabled topics b. Build combined arXiv query (OR all topics) c. Search arXiv for papers (last 24 hours) d. For each paper: - Check if exists (skip duplicates) - Triage with Haiku (relevance score) - Summarize with Sonnet (if relevant) - Store in database with user association e. Save checkpoint after each user f. Check budget (stop if >= $1) 4. Mark checkpoint as completed 5. Cleanup old checkpoints ``` ### Checkpoint Structure Stored in KV with key: `checkpoint:automation:{YYYY-MM-DD}` ```typescript interface Checkpoint { date: string; // YYYY-MM-DD users_processed: number; // Count of users completed papers_found: number; // Total papers found from arXiv papers_summarized: number; // Total papers successfully summarized total_cost: number; // Total AI cost in USD errors: string[]; // Array of error messages last_user_id?: number; // Last successfully processed user ID completed: boolean; // True when all users processed } ``` **TTL**: 7 days (auto-expires) ## Configuration ### wrangler.toml ```toml # Cron trigger: Daily at 6 AM UTC [triggers] crons = ["0 6 * * *"] # Bindings [[d1_databases]] binding = "DB" database_name = "kivv-db" database_id = "1e80f2bf-462d-4d51-8002-a4cf26013933" [[kv_namespaces]] binding = "CACHE" id = "7f6b7437931c4c268c27d01a4169101b" [[r2_buckets]] binding = "PAPERS" bucket_name = "kivv-papers" ``` ### Environment Variables Set via `wrangler secret put`: ```bash # Required wrangler secret put CLAUDE_API_KEY # Optional (for manual trigger) wrangler secret put CRON_SECRET ``` ## Endpoints ### 1. Cron Trigger (Automatic) **Trigger**: Daily at 6 AM UTC **Authentication**: Automatic via Cloudflare **Action**: Runs full automation workflow ### 2. Health Check ``` GET /health ``` **Response**: ```json { "status": "ok", "service": "kivv-automation", "timestamp": "2025-11-30T06:00:00.000Z" } ``` ### 3. Manual Trigger ``` POST /run Authorization: Bearer {CRON_SECRET} ``` **Response** (Success): ```json { "success": true, "message": "Automation completed successfully", "timestamp": "2025-11-30T06:00:00.000Z" } ``` **Response** (Error): ```json { "success": false, "error": "Budget exceeded", "timestamp": "2025-11-30T06:00:00.000Z" } ``` ## Database Operations ### Papers Table ```sql INSERT INTO papers ( arxiv_id, title, authors, abstract, categories, published_date, pdf_url, summary, summary_generated_at, summary_model, relevance_score, content_hash, collected_for_user_id, created_at ) VALUES (...) ``` **Deduplication**: Unique constraint on `arxiv_id` prevents duplicates ### User Paper Status Table ```sql INSERT INTO user_paper_status ( user_id, paper_id, explored, bookmarked, created_at ) VALUES (...) ``` Creates initial status entry for each user-paper pair. ## Cost Optimization ### Budget Breakdown - **Haiku Triage**: ~$0.00025 per paper - **Sonnet Summary**: ~$0.006 per paper (only for relevant papers) - **Daily Budget**: $1.00 maximum ### Example Costs | Papers | Relevant (70%) | Haiku Cost | Sonnet Cost | Total | |--------|----------------|------------|-------------|-------| | 100 | 70 | $0.025 | $0.420 | $0.445 | | 200 | 140 | $0.050 | $0.840 | $0.890 | | 225 | 157 | $0.056 | $0.942 | $0.998 | **Savings**: ~96% cost reduction compared to summarizing all papers ## Error Handling ### User-Level Errors - **Behavior**: Log error, save checkpoint, continue with next user - **Logged**: Added to `checkpoint.errors[]` - **Example**: User has no enabled topics, API timeout ### Paper-Level Errors - **Behavior**: Log error, continue with next paper - **Not Logged**: Not added to checkpoint (too granular) - **Example**: Summarization failed, database insert error ### Circuit Breakers 1. **Budget Exceeded**: Stop at $1.00, save checkpoint 2. **Rate Limits**: Enforced automatically by clients (arXiv, Anthropic) ## Monitoring ### Logs All logs include structured context: ``` [CRON] Starting daily automation at 2025-11-30T06:00:00.000Z [AUTOMATION] Processing 2 users [USER:jeff] Processing 3 topics [USER:jeff] Search query: (cat:cs.AI AND safety) OR (cat:cs.LG) [USER:jeff] Found 25 papers from arXiv [PAPER:2311.12345] Stored with summary (relevance: 0.85) [USER:jeff] Completed: 25 found, 18 summarized, $0.1234 cost [CHECKPOINT] Saved: 1 users, $0.1234 cost [BUDGET] Daily budget ($1.00) exceeded, stopping automation [AUTOMATION] ===== SUMMARY ===== [AUTOMATION] Users processed: 2 [AUTOMATION] Papers found: 50 [AUTOMATION] Papers summarized: 35 [AUTOMATION] Total cost: $0.4567 [AUTOMATION] Errors: 0 ``` ### Cloudflare Dashboard - View cron trigger history - See execution logs and errors - Monitor worker performance ## Testing ### Run Tests ```bash bun test tests/integration/automation.test.ts ``` ### Test Coverage - ✅ Checkpoint creation and loading - ✅ Resume from last_user_id - ✅ Budget tracking and circuit breaker - ✅ Error handling (user-level and paper-level) - ✅ Authentication (cron header and Bearer token) - ✅ User processing and topic queries - ✅ Paper deduplication ## Deployment ### Initial Deployment ```bash # Navigate to automation directory cd automation # Deploy to Cloudflare wrangler deploy # Set secrets wrangler secret put CLAUDE_API_KEY wrangler secret put CRON_SECRET # Optional ``` ### Update Deployment ```bash # After code changes wrangler deploy ``` ### View Logs ```bash # Tail live logs wrangler tail # View logs in dashboard # https://dash.cloudflare.com -> Workers & Pages -> kivv-automation -> Logs ``` ## Troubleshooting ### Checkpoint Not Resuming **Symptom**: Worker restarts from beginning after failure **Cause**: Checkpoint not saved or expired **Solution**: Check KV namespace bindings, verify checkpoint TTL (7 days) ### Budget Exceeded Too Quickly **Symptom**: Worker stops after processing few users **Cause**: Too many relevant papers, high relevance threshold **Solution**: Adjust per-topic `max_papers_per_day`, increase relevance threshold to 0.8 ### Papers Not Appearing in Database **Symptom**: Papers found but not stored **Cause**: Failing relevance triage, database errors **Solution**: Check logs for triage scores, verify database schema and bindings ### arXiv Rate Limit Errors **Symptom**: HTTP 503 from arXiv **Cause**: Too many requests too quickly **Solution**: ArxivClient enforces 1 req/3s + jitter automatically, check for duplicate searches ## Future Enhancements - [ ] **Email Notifications**: Daily summary report to users - [ ] **PDF Download**: Store papers in R2 bucket - [ ] **Advanced Filtering**: Category-specific relevance thresholds - [ ] **Retry Logic**: Exponential backoff for transient failures - [ ] **Analytics**: Cost breakdown per topic, user engagement metrics - [ ] **Smart Scheduling**: Process users at different times based on activity ## Related Documentation - [Main README](../README.md) - Project overview - [MCP Server](../mcp-server/README.md) - MCP tools and RSS feed - [Database Schema](../shared/schema.sql) - D1 table structure - [Shared Utilities](../shared/) - ArxivClient, SummarizationClient

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jeffaf/kivv'

If you have feedback or need assistance with the MCP directory API, please join our Discord server