# Experiment Design: AgentFS & FPF Benchmark
> **Context:** `BC:AgentFS-Benchmark-2026`
> **Type:** Experimental Design (Method + Plan + Metrics)
> **Status:** Draft
This document defines the "AgentFS vs Files" ablation study using FPF patterns to ensure the experiment is **executable** (unambiguous steps & metrics) and **relatable** (grounded in specific engineering tasks).
---
## 1. Experimental Context (`U.BoundedContext`)
We define a local context to lock the meaning of our terms and metrics.
* **Context ID:** `BC:AgentFS-Benchmark-2026`
* **Invariants:**
* All tasks must be performed on the same hardware/model class.
* "Time" is wall-clock time from prompt to final commit.
* "Help" is defined as any query to an external doc/LLM about the *framework mechanics* (not domain logic).
---
## 2. Metrics Definition (A.18 CSLC)
We define the **Characteristic/Scale/Level/Coordinate (CSLC)** standard for this experiment. This ensures "Quality" and "Speed" are not vague feelings but typed measurements.
| ID | Characteristic (`U.Characteristic`) | Scale (`U.Scale`) | Type | Unit | Polarity | Definition |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **M1** | `TimeToCompletion` | `Seconds` | Ratio | `s` | Lower is better | Wall-clock duration from task start to valid commit/save. |
| **M2** | `HelpQuestionCount` | `Count` | Ratio | `1` | Lower is better | Number of explicit "How do I..." queries regarding FPF/AgentFS mechanics. |
| **M3** | `SubjectiveWorkload` | `NASA-TLX-Raw` | Interval | `0-100` | Lower is better | Self-reported cognitive load after task completion (unweighted NASA-TLX). |
| **M4** | `OutputQuality` | `Rubric-5pt` | Ordinal | `1-5` | Higher is better | Blind evaluation: 1=Broken, 3=Functional, 5=Idiomatic & Robust. |
| **M5** | `ConformancePass` | `Binary` | Nominal | `{Pass, Fail}` | Pass is better | Binary check against specific constraints (e.g., "Schema is 3NF", "Audit trail exists"). |
| **M6** | `ApplicabilityCoverage` | `Ratio-0-1` | Ratio | `0.0-1.0` | Higher is better | Fraction of sub-tasks completed without bypassing the assigned framework. |
---
## 3. Experimental Procedure (`U.MethodDescription`)
**Method:** `Method:Run-AgentFS-Ablation-v1`
**Roles Required:** `Subject#ParticipantRole`, `Reviewer#ObserverRole`.
### Step 1: Condition Assignment (Randomized Crossover)
The participant is assigned a sequence of conditions.
* **C0 (Control):** Raw Files (JSON/Markdown). No AgentFS, no Schema.
* **C1 (Hybrid):** AgentFS treating artifacts as Files (Blob storage).
* **C2 (Native):** AgentFS + FPF-Native Schema (Typed tables, Indexes).
* **C3 (Stress):** C2 + Turso MVCC (Multi-writer concurrency - *Optional/Advanced*).
### Step 2: Task Execution Suite (The "Work")
For each condition, perform the following tasks (`U.TaskSignature`):
* **T1 (Plan→Work Seam):**
* *Goal:* Create a "Planned Baseline" for a simple feature (e.g., "Add User Profile").
* *Constraint:* Must distinguish "Plan" (Intent) from "Work" (Actuals).
* *Measure:* M1, M2.
* **T2 (Iteration & Evolution):**
* *Goal:* Apply a change request (e.g., "Add 'Bio' field") that updates the baseline.
* *Constraint:* Maintain history/audit trail (no destructive overwrites of history).
* *Measure:* M1, M5 (Audit trail check).
* **T3 (Retrieval & Justification):**
* *Goal:* Answer "Why did we add the 'Bio' field?" using *only* stored system state.
* *Measure:* M4 (Quality of answer), M1.
### Step 3: Evaluation
* Reviewer blindly scores outputs using `Rubric-5pt` (M4).
* Compute `ApplicabilityCoverage` (M6) based on whether the participant had to "eject" to raw shell/manual hacks to finish.
---
## 4. Planned Baseline (`SlotFillingsPlanItem` - A.15.3)
This artifact makes the experiment **executable**. It defines exactly *what* is being run, pinning versions to ensure reproducibility (Parity).
```fpf
SlotFillingsPlanItem := ⟨
kind = SlotFillingsPlanItem,
bounded_context_ref = U.BoundedContextRef(BC:AgentFS-Benchmark-2026),
path_slice_id = PathSliceId(P2W:bench-run-001),
Γ_time_selector = point(2026-01-10T12:00:00Z), // Explicit time, no "latest" magic
planned_fillings = [
// 1. The Tooling Stack (Pinning the "How")
⟨
slot_kind = ToolVersionSlot,
planned_filler = ByValue("agent-stack=v0.4.2; sqlite=3.45; turso-lib=0.1.0-beta")
⟩,
// 2. The Task Suite (Pinning the "What")
⟨
slot_kind = TaskSuiteSlot,
planned_filler = ByRef(TaskSuiteRef(BENCH:tasks-v1 @edition(E1)))
⟩,
// 3. The Measurement Standard (Pinning the "Ruler")
⟨
slot_kind = MetricSetSlot,
planned_filler = ByRef(MetricSetRef(BENCH:metrics-v1 @edition(E1) /* See §2 above */))
⟩,
// 4. The Quality Standard (Pinning the "Target")
⟨
slot_kind = RubricSlot,
planned_filler = ByRef(RubricRef(BENCH:quality-rubric-v1 @edition(E1)))
⟩
]
⟩
```
## 5. Execution Log Template (`U.Work`)
When the experiment runs, results are recorded as **Work** containing **Measures**.
```fpf
// Example Record for T1 under C2
U.Work {
id: "work-run-101",
performedBy: "UserAlice#ParticipantRole:BC:AgentFS-Benchmark-2026",
method: "Method:Run-AgentFS-Ablation-v1 / Step:T1",
condition: "C2 (AgentFS+Schema)",
// The Results (A.18 Coordinates)
measures: [
Measure(M1:TimeToCompletion, 145.5, s),
Measure(M2:HelpQuestionCount, 3, 1),
Measure(M3:SubjectiveWorkload, 45, 0-100),
Measure(M5:ConformancePass, Pass)
],
// Artifacts produced
outputs: [
EpistemeRef("agentfs://schema/user_profile_v1"),
EpistemeRef("agentfs://audit/plan_v1")
]
}
```