Run an ablation-style, randomized crossover benchmark with 3–4 conditions (Files, AgentFS-as-files, AgentFS+FPF-native schema, optional Turso MVCC) over a fixed task suite; measure (1) time in seconds, (2) “how many clarifying/help questions,” (3) output quality via a blinded rubric + conformance checks, and (4) concurrency error/throughput under parallel writes. I’m using FPF Spec patterns A.15.3 SlotFillingsPlanItem (planned baseline + explicit Γ_time, no “latest”) and A.18 CSLC (Characteristic/Scale/Level/Coordinate) for metrics.
FPF-Spec (8)
FPF-Spec (8)
FPF-Spec (8)
Assumptions
You can run identical prompts/tasks across conditions on the same model+hardware; your “complexity” proxy is help/clarification count + subjective workload (HF-Loop framing).
FPF-Spec (8)
FPF-Spec (8)
Model
Independent vars: storage+state layer (Files vs AgentFS) and structure (unstructured vs FPF-native schema), plus optional write-concurrency mode (SQLite single-writer vs Turso MVCC). AgentFS gives a single SQLite file with filesystem+KV+toolcall audit trail; Turso MVCC allows concurrent writes but aborts write-write conflicts at commit; Turso reports up to 4× write throughput in some multi-thread + compute workloads, and claims elimination of SQLITE_BUSY (early testing; treat as provisional; accessed 2026-01-10).
Options
Ablation ladder (recommended): C0 Files; C1 AgentFS storing FPF artifacts as files; C2 AgentFS + FPF-native schema (typed tables + indexes); C3 C2 + Turso MVCC for parallel-writer stress (note MVCC preview limitations like no CREATE INDEX; accessed 2026-01-10).
Pick
Do C0/C1/C2 for “speed/complexity/quality/applicability”; add C3 only if you truly need multi-writer (multi-agent / multi-thread) and can live with MVCC-preview constraints. AgentFS is explicitly ALPHA (dev/testing only; accessed 2026-01-10), so treat all results as “experiment-grade,” not production-grade.
Tests
Task suite (minimum that actually exercises the claimed benefits):
T1 Plan→Work seam: create a planned baseline, then execute and record variance (forces “no backfill”).
FPF-Spec (8)
T2 Iteration: apply a small new constraint, update baseline correctly (new PlanItem edition), keep audit trail intact.
FPF-Spec (8)
FPF-Spec (8)
T3 Retrieval: answer “why did we choose X?” using stored state only (tests auditability/queryability).
T4 Concurrency stress (only if C3): N=8 writer threads for 60 s, each doing small read+compute+write transactions; record throughput and conflict/error rates (Turso’s win-zone per their benchmark narrative).
Metrics (define in CSLC so you don’t average nonsense)
FPF-Spec (8)
TimeToCompletion: ratio scale; unit s; measure wall-clock from task start to “final answer committed.”
HelpQuestionCount: count scale; unit 1; count explicit “how do I do FPF X?” / “what is Γ_time?” type queries.
SubjectiveWorkload: ordinal/interval (pick one and stick to it); unit 1; e.g., NASA-TLX 0–100 or a 5-level workload scale; motivated by HF-Loop/cognitive overload risk.
SkillUsabilityScore (U.Metric): ordinal 1-5; unit 1. Measure of "Zero-Shot Enactment". Rated by: (1) Discovery success (matching U.ServiceClause without clarification), and (2) Interface compliance (satisfying U.Method.interface without error).
FPF-Spec (8)
FPF-Spec (8)
QualityRubricScore: ordinal 1–5; judged blind across conditions (median + distribution, not mean).
ConformancePass: nominal {pass, fail}; based on a short checklist you freeze up front; report pass rate.
ApplicabilityCoverage: ratio 0–1; fraction of tasks where you can complete without dropping FPF invariants (time explicit, baseline not backfilled, variance recorded in Work).
FPF-Spec (8)
FPF-Spec (8)
Procedure (tight enough to replicate)
Freeze a planned baseline (“what versions/settings/tasks count”) in a SlotFillingsPlanItem with explicit Γ_time (no “latest/current”).
FPF-Spec (8)
FPF-Spec (8)
Randomize condition order per task (Latin square if you care about learning effects).
Run each (task, condition) pair K times (K≥3) to estimate variance; log everything (AgentFS toolcall log helps).
Blind-evaluate outputs for quality + conformance.
Analysis: paired deltas vs C0 (median ΔTime, ΔHelpCount) + non-inferiority on quality (e.g., “quality not worse by >1 rubric level on median”).
Risks
Learning effects dominate “complexity” unless you counterbalance order; AgentFS alpha + Turso MVCC preview limitations can confound (e.g., MVCC “no CREATE INDEX” breaks your “decent schema” condition).
Next
Create the baseline PlanItem + metric definitions first; run a 2-task pilot to debug instrumentation, then expand to the full suite and lock the rubric before collecting “real” data.
FPF-shaped baseline skeleton (fill in your actual refs/versions)
SlotFillingsPlanItem := ⟨
kind = SlotFillingsPlanItem,
bounded_context_ref = U.BoundedContextRef(BC:AgentFS×FPF-Benchmark),
path_slice_id = PathSliceId(P2W:bench-v1),
Γ_time_selector = point(t0), // no implicit “latest” :contentReference[oaicite:21]{index=21}
planned_fillings = [
⟨slot_kind = ToolVersionSlot, planned_filler = ByValue("agent-model=X; agentfs=…; sqlite/turso=…")⟩,
⟨slot_kind = TaskSuiteSlot, planned_filler = ByRef(TaskSuiteRef(BENCH:tasks-v1@edition(E1)))⟩,
⟨slot_kind = MetricSetSlot, planned_filler = ByRef(MetricSetRef(BENCH:metrics-v1@edition(E1)))⟩,
⟨slot_kind = RubricSlot, planned_filler = ByRef(RubricRef(BENCH:quality-rubric-v1@edition(E1)))⟩
## Observability & Telemetry (FPF-aligned)
To satisfy **A.15.1 (U.Work)** and **G.12 (Lawful Telemetry)** without building a custom observability stack, we will use **OpenTelemetry (OTel)** with a strict attribute schema.
### 1. Principle: Spans as U.Work
Every method execution (Task or Tool) is a `U.Work` occurrence. The OTel `Span` is the carrier.
### 2. Schema: Work 4D Mapping
We map FPF's 4-dimensional anchors to OTel attributes. These are **MANDATORY** for all experiment traces.
| FPF Anchor (A.15.1) | OTel Attribute | Value / Format |
| :--- | :--- | :--- |
| **Identity** | `trace_id` / `span_id` | Standard OTel W3C TraceContext |
| **Window** | `start_time` / `end_time` | Standard OTel timestamp (nanoseconds) |
| **Spec** | `fpf.spec_ref` | URI of the `MethodDescription` (e.g., `method:PlanTask@v1`) |
| **Performer** | `fpf.performer_ref` | URI of the `RoleAssignment` (e.g., `role:Assistant@run-1`) |
| **Context** | `fpf.context_ref` | `U.BoundedContext` URI (e.g., `ctx:AgentFS-Experiment-C1`) |
| **System** | `fpf.system_ref` | Identity of the runtime system (e.g., `sys:MacBookPro-M3`) |
| **Pins** | `fpf.edition_pins` | JSON string: `{"method_v": "1.0", "prompt_v": "A"}` |
### 3. Metric Telemetry (G.11/G.12)
Metrics are derived strictly from spans to ensure "Lawful Telemetry" (no side-channel numbers).
* **TimeToCompletion (`Γ_time`)**
* **Source**: Duration of the root span for the Task.
* **Metric Name**: `fpf.experiment.duration_ms`
* **Type**: Histogram (Explicit buckets compliant with A.18 Scales).
* **HelpQuestionCount**
* **Source**: Count of child spans with `event="tool_use"` and `tool="ask_clarification"`.
* **Metric Name**: `fpf.experiment.help_requests`
* **Type**: Counter (Sum).
* **Outcome & Quality**
* **Source**: Attributes on the root span.
* **Attribute**: `fpf.outcome.class` ∈ {`Success`, `Failure`, `Aborted`}
* **Attribute**: `fpf.outcome.rubric_score` (1-5, set by blinding reviewer)
### 4. Persistence (EvidenceGraph)
* **Condition C0 (Files)**: Use `ConsoleSpanExporter` or basic OTel file exporter (JSON).
* **Condition C1/C2 (AgentFS)**: Write spans as immutable JSON artifacts into `telemetry/` folder. This directory acts as the **EvidenceGraph** carrier (G.6).
]
⟩
Note: I could not access the NotebookLM link (it appears auth-gated from here). If you export/share the underlying text, it can be pinned as an evidence/carrier in the baseline the same way as the other inputs.