FPF Agent Stack

experiment.md•7.94 KiB

Run an ablation-style, randomized crossover benchmark with 3–4 conditions (Files, AgentFS-as-files, AgentFS+FPF-native schema, optional Turso MVCC) over a fixed task suite; measure (1) time in seconds, (2) “how many clarifying/help questions,” (3) output quality via a blinded rubric + conformance checks, and (4) concurrency error/throughput under parallel writes. I’m using FPF Spec patterns A.15.3 SlotFillingsPlanItem (planned baseline + explicit Γ_time, no “latest”) and A.18 CSLC (Characteristic/Scale/Level/Coordinate) for metrics. FPF-Spec (8) FPF-Spec (8) FPF-Spec (8) Assumptions You can run identical prompts/tasks across conditions on the same model+hardware; your “complexity” proxy is help/clarification count + subjective workload (HF-Loop framing). FPF-Spec (8) FPF-Spec (8) Model Independent vars: storage+state layer (Files vs AgentFS) and structure (unstructured vs FPF-native schema), plus optional write-concurrency mode (SQLite single-writer vs Turso MVCC). AgentFS gives a single SQLite file with filesystem+KV+toolcall audit trail; Turso MVCC allows concurrent writes but aborts write-write conflicts at commit; Turso reports up to 4× write throughput in some multi-thread + compute workloads, and claims elimination of SQLITE_BUSY (early testing; treat as provisional; accessed 2026-01-10). Options Ablation ladder (recommended): C0 Files; C1 AgentFS storing FPF artifacts as files; C2 AgentFS + FPF-native schema (typed tables + indexes); C3 C2 + Turso MVCC for parallel-writer stress (note MVCC preview limitations like no CREATE INDEX; accessed 2026-01-10). Pick Do C0/C1/C2 for “speed/complexity/quality/applicability”; add C3 only if you truly need multi-writer (multi-agent / multi-thread) and can live with MVCC-preview constraints. AgentFS is explicitly ALPHA (dev/testing only; accessed 2026-01-10), so treat all results as “experiment-grade,” not production-grade. Tests Task suite (minimum that actually exercises the claimed benefits): T1 Plan→Work seam: create a planned baseline, then execute and record variance (forces “no backfill”). FPF-Spec (8) T2 Iteration: apply a small new constraint, update baseline correctly (new PlanItem edition), keep audit trail intact. FPF-Spec (8) FPF-Spec (8) T3 Retrieval: answer “why did we choose X?” using stored state only (tests auditability/queryability). T4 Concurrency stress (only if C3): N=8 writer threads for 60 s, each doing small read+compute+write transactions; record throughput and conflict/error rates (Turso’s win-zone per their benchmark narrative). Metrics (define in CSLC so you don’t average nonsense) FPF-Spec (8) TimeToCompletion: ratio scale; unit s; measure wall-clock from task start to “final answer committed.” HelpQuestionCount: count scale; unit 1; count explicit “how do I do FPF X?” / “what is Γ_time?” type queries. SubjectiveWorkload: ordinal/interval (pick one and stick to it); unit 1; e.g., NASA-TLX 0–100 or a 5-level workload scale; motivated by HF-Loop/cognitive overload risk. SkillUsabilityScore (U.Metric): ordinal 1-5; unit 1. Measure of "Zero-Shot Enactment". Rated by: (1) Discovery success (matching U.ServiceClause without clarification), and (2) Interface compliance (satisfying U.Method.interface without error). FPF-Spec (8) FPF-Spec (8) QualityRubricScore: ordinal 1–5; judged blind across conditions (median + distribution, not mean). ConformancePass: nominal {pass, fail}; based on a short checklist you freeze up front; report pass rate. ApplicabilityCoverage: ratio 0–1; fraction of tasks where you can complete without dropping FPF invariants (time explicit, baseline not backfilled, variance recorded in Work). FPF-Spec (8) FPF-Spec (8) Procedure (tight enough to replicate) Freeze a planned baseline (“what versions/settings/tasks count”) in a SlotFillingsPlanItem with explicit Γ_time (no “latest/current”). FPF-Spec (8) FPF-Spec (8) Randomize condition order per task (Latin square if you care about learning effects). Run each (task, condition) pair K times (K≥3) to estimate variance; log everything (AgentFS toolcall log helps). Blind-evaluate outputs for quality + conformance. Analysis: paired deltas vs C0 (median ΔTime, ΔHelpCount) + non-inferiority on quality (e.g., “quality not worse by >1 rubric level on median”). Risks Learning effects dominate “complexity” unless you counterbalance order; AgentFS alpha + Turso MVCC preview limitations can confound (e.g., MVCC “no CREATE INDEX” breaks your “decent schema” condition). Next Create the baseline PlanItem + metric definitions first; run a 2-task pilot to debug instrumentation, then expand to the full suite and lock the rubric before collecting “real” data. FPF-shaped baseline skeleton (fill in your actual refs/versions) SlotFillingsPlanItem := ⟨ kind = SlotFillingsPlanItem, bounded_context_ref = U.BoundedContextRef(BC:AgentFS×FPF-Benchmark), path_slice_id = PathSliceId(P2W:bench-v1), Γ_time_selector = point(t0), // no implicit “latest” :contentReference[oaicite:21]{index=21} planned_fillings = [ ⟨slot_kind = ToolVersionSlot, planned_filler = ByValue("agent-model=X; agentfs=…; sqlite/turso=…")⟩, ⟨slot_kind = TaskSuiteSlot, planned_filler = ByRef(TaskSuiteRef(BENCH:tasks-v1@edition(E1)))⟩, ⟨slot_kind = MetricSetSlot, planned_filler = ByRef(MetricSetRef(BENCH:metrics-v1@edition(E1)))⟩, ⟨slot_kind = RubricSlot, planned_filler = ByRef(RubricRef(BENCH:quality-rubric-v1@edition(E1)))⟩ ## Observability & Telemetry (FPF-aligned) To satisfy **A.15.1 (U.Work)** and **G.12 (Lawful Telemetry)** without building a custom observability stack, we will use **OpenTelemetry (OTel)** with a strict attribute schema. ### 1. Principle: Spans as U.Work Every method execution (Task or Tool) is a `U.Work` occurrence. The OTel `Span` is the carrier. ### 2. Schema: Work 4D Mapping We map FPF's 4-dimensional anchors to OTel attributes. These are **MANDATORY** for all experiment traces. | FPF Anchor (A.15.1) | OTel Attribute | Value / Format | | :--- | :--- | :--- | | **Identity** | `trace_id` / `span_id` | Standard OTel W3C TraceContext | | **Window** | `start_time` / `end_time` | Standard OTel timestamp (nanoseconds) | | **Spec** | `fpf.spec_ref` | URI of the `MethodDescription` (e.g., `method:PlanTask@v1`) | | **Performer** | `fpf.performer_ref` | URI of the `RoleAssignment` (e.g., `role:Assistant@run-1`) | | **Context** | `fpf.context_ref` | `U.BoundedContext` URI (e.g., `ctx:AgentFS-Experiment-C1`) | | **System** | `fpf.system_ref` | Identity of the runtime system (e.g., `sys:MacBookPro-M3`) | | **Pins** | `fpf.edition_pins` | JSON string: `{"method_v": "1.0", "prompt_v": "A"}` | ### 3. Metric Telemetry (G.11/G.12) Metrics are derived strictly from spans to ensure "Lawful Telemetry" (no side-channel numbers). * **TimeToCompletion (`Γ_time`)** * **Source**: Duration of the root span for the Task. * **Metric Name**: `fpf.experiment.duration_ms` * **Type**: Histogram (Explicit buckets compliant with A.18 Scales). * **HelpQuestionCount** * **Source**: Count of child spans with `event="tool_use"` and `tool="ask_clarification"`. * **Metric Name**: `fpf.experiment.help_requests` * **Type**: Counter (Sum). * **Outcome & Quality** * **Source**: Attributes on the root span. * **Attribute**: `fpf.outcome.class` ∈ {`Success`, `Failure`, `Aborted`} * **Attribute**: `fpf.outcome.rubric_score` (1-5, set by blinding reviewer) ### 4. Persistence (EvidenceGraph) * **Condition C0 (Files)**: Use `ConsoleSpanExporter` or basic OTel file exporter (JSON). * **Condition C1/C2 (AgentFS)**: Write spans as immutable JSON artifacts into `telemetry/` folder. This directory acts as the **EvidenceGraph** carrier (G.6). ] ⟩ Note: I could not access the NotebookLM link (it appears auth-gated from here). If you export/share the underlying text, it can be pinned as an evidence/carrier in the baseline the same way as the other inputs.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/venikman/fpf-agent-stack'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

experiment.md•7.94 KiB