FPF Agent Stack

Overview Schema Related Servers Score Discussions

experiment_plan.md•5.63 KiB

# Experiment Design: AgentFS & FPF Benchmark > **Context:** `BC:AgentFS-Benchmark-2026` > **Type:** Experimental Design (Method + Plan + Metrics) > **Status:** Draft This document defines the "AgentFS vs Files" ablation study using FPF patterns to ensure the experiment is **executable** (unambiguous steps & metrics) and **relatable** (grounded in specific engineering tasks). --- ## 1. Experimental Context (`U.BoundedContext`) We define a local context to lock the meaning of our terms and metrics. * **Context ID:** `BC:AgentFS-Benchmark-2026` * **Invariants:** * All tasks must be performed on the same hardware/model class. * "Time" is wall-clock time from prompt to final commit. * "Help" is defined as any query to an external doc/LLM about the *framework mechanics* (not domain logic). --- ## 2. Metrics Definition (A.18 CSLC) We define the **Characteristic/Scale/Level/Coordinate (CSLC)** standard for this experiment. This ensures "Quality" and "Speed" are not vague feelings but typed measurements. | ID | Characteristic (`U.Characteristic`) | Scale (`U.Scale`) | Type | Unit | Polarity | Definition | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | **M1** | `TimeToCompletion` | `Seconds` | Ratio | `s` | Lower is better | Wall-clock duration from task start to valid commit/save. | | **M2** | `HelpQuestionCount` | `Count` | Ratio | `1` | Lower is better | Number of explicit "How do I..." queries regarding FPF/AgentFS mechanics. | | **M3** | `SubjectiveWorkload` | `NASA-TLX-Raw` | Interval | `0-100` | Lower is better | Self-reported cognitive load after task completion (unweighted NASA-TLX). | | **M4** | `OutputQuality` | `Rubric-5pt` | Ordinal | `1-5` | Higher is better | Blind evaluation: 1=Broken, 3=Functional, 5=Idiomatic & Robust. | | **M5** | `ConformancePass` | `Binary` | Nominal | `{Pass, Fail}` | Pass is better | Binary check against specific constraints (e.g., "Schema is 3NF", "Audit trail exists"). | | **M6** | `ApplicabilityCoverage` | `Ratio-0-1` | Ratio | `0.0-1.0` | Higher is better | Fraction of sub-tasks completed without bypassing the assigned framework. | --- ## 3. Experimental Procedure (`U.MethodDescription`) **Method:** `Method:Run-AgentFS-Ablation-v1` **Roles Required:** `Subject#ParticipantRole`, `Reviewer#ObserverRole`. ### Step 1: Condition Assignment (Randomized Crossover) The participant is assigned a sequence of conditions. * **C0 (Control):** Raw Files (JSON/Markdown). No AgentFS, no Schema. * **C1 (Hybrid):** AgentFS treating artifacts as Files (Blob storage). * **C2 (Native):** AgentFS + FPF-Native Schema (Typed tables, Indexes). * **C3 (Stress):** C2 + Turso MVCC (Multi-writer concurrency - *Optional/Advanced*). ### Step 2: Task Execution Suite (The "Work") For each condition, perform the following tasks (`U.TaskSignature`): * **T1 (Plan→Work Seam):** * *Goal:* Create a "Planned Baseline" for a simple feature (e.g., "Add User Profile"). * *Constraint:* Must distinguish "Plan" (Intent) from "Work" (Actuals). * *Measure:* M1, M2. * **T2 (Iteration & Evolution):** * *Goal:* Apply a change request (e.g., "Add 'Bio' field") that updates the baseline. * *Constraint:* Maintain history/audit trail (no destructive overwrites of history). * *Measure:* M1, M5 (Audit trail check). * **T3 (Retrieval & Justification):** * *Goal:* Answer "Why did we add the 'Bio' field?" using *only* stored system state. * *Measure:* M4 (Quality of answer), M1. ### Step 3: Evaluation * Reviewer blindly scores outputs using `Rubric-5pt` (M4). * Compute `ApplicabilityCoverage` (M6) based on whether the participant had to "eject" to raw shell/manual hacks to finish. --- ## 4. Planned Baseline (`SlotFillingsPlanItem` - A.15.3) This artifact makes the experiment **executable**. It defines exactly *what* is being run, pinning versions to ensure reproducibility (Parity). ```fpf SlotFillingsPlanItem := ⟨ kind = SlotFillingsPlanItem, bounded_context_ref = U.BoundedContextRef(BC:AgentFS-Benchmark-2026), path_slice_id = PathSliceId(P2W:bench-run-001), Γ_time_selector = point(2026-01-10T12:00:00Z), // Explicit time, no "latest" magic planned_fillings = [ // 1. The Tooling Stack (Pinning the "How") ⟨ slot_kind = ToolVersionSlot, planned_filler = ByValue("agent-stack=v0.4.2; sqlite=3.45; turso-lib=0.1.0-beta") ⟩, // 2. The Task Suite (Pinning the "What") ⟨ slot_kind = TaskSuiteSlot, planned_filler = ByRef(TaskSuiteRef(BENCH:tasks-v1 @edition(E1))) ⟩, // 3. The Measurement Standard (Pinning the "Ruler") ⟨ slot_kind = MetricSetSlot, planned_filler = ByRef(MetricSetRef(BENCH:metrics-v1 @edition(E1) /* See §2 above */)) ⟩, // 4. The Quality Standard (Pinning the "Target") ⟨ slot_kind = RubricSlot, planned_filler = ByRef(RubricRef(BENCH:quality-rubric-v1 @edition(E1))) ⟩ ] ⟩ ``` ## 5. Execution Log Template (`U.Work`) When the experiment runs, results are recorded as **Work** containing **Measures**. ```fpf // Example Record for T1 under C2 U.Work { id: "work-run-101", performedBy: "UserAlice#ParticipantRole:BC:AgentFS-Benchmark-2026", method: "Method:Run-AgentFS-Ablation-v1 / Step:T1", condition: "C2 (AgentFS+Schema)", // The Results (A.18 Coordinates) measures: [ Measure(M1:TimeToCompletion, 145.5, s), Measure(M2:HelpQuestionCount, 3, 1), Measure(M3:SubjectiveWorkload, 45, 0-100), Measure(M5:ConformancePass, Pass) ], // Artifacts produced outputs: [ EpistemeRef("agentfs://schema/user_profile_v1"), EpistemeRef("agentfs://audit/plan_v1") ] } ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/venikman/fpf-agent-stack'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

experiment_plan.md•5.63 KiB