Aleph

Overview Schema Related Servers Score Discussions

aleph
docs

RLM_REALIGNMENT_STRATEGY.md•8.09 KiB

# RLM Realignment Strategy Date: 2026-02-05 Owner: Aleph maintainers Status: Proposed execution plan ## 1) Goal Realign Aleph with Recursive Language Model (RLM) research on two axes: 1. Presentation: clearly distinguish canonical RLM behavior from Aleph-specific platform features. 2. Ability: strengthen measurable RLM behavior (decompose, recurse, aggregate, verify) with explicit evaluation and regression checks. ## 2) Research Baseline (What "RLM-aligned" Means) Using the RLM paper and official implementation as source of truth, canonical RLM behavior is: 1. Treat prompt/context as an external environment object, not as raw prompt tokens. 2. Use a REPL-style symbolic loop: inspect, decompose, execute, observe. 3. Recursively call sub-LMs for decomposition/aggregation when useful. 4. Iterate until a final answer is produced. Important nuance from research: 1. RLM is an inference framework, not a product surface by itself. 2. RLM can outperform long-context baselines on information-dense tasks, but performance is workload-dependent. 3. Paper limitations explicitly call out open areas: optimal execution strategy, async sub-calls, and recursion-depth policy. ## 3) Current Repo Audit ### 3.1 What is already strongly aligned 1. Core closed-loop RLM runtime exists: - `aleph/core.py` implements iterative root loop + REPL + `FINAL(...)` protocol. - `aleph/core.py` supports `sub_query` and recursive `sub_aleph`. 2. Externalized context + symbolic interaction exists: - Context is loaded into `ctx` and explored via helpers/search/code execution. 3. Recursion budget controls exist: - Depth, iterations, wall time, and sub-query budgets are enforced. 4. Recursion behavior has test coverage: - Nested recursion and sub-query backends are tested in `tests/test_double_recursion.py` and `tests/test_sub_query.py`. ### 3.2 Where presentation currently drifts 1. Mode boundary is under-explicit: - Docs position Aleph broadly as RLM-based, but do not consistently separate: - canonical "Aleph core loop mode" (`aleph run`) - MCP tool-server orchestration mode (RLM-inspired, host-model-driven). 2. Documentation inconsistency reduces credibility: - `DEVELOPMENT.md` has stale backend priority and stale budget schema examples. - `docs/CONFIGURATION.md` timeout defaults are outdated versus runtime defaults. 3. Some guidance is phrased as universal when it is backend-dependent: - Prompt/docs imply fixed sub-query capacity heuristics while runtime enforces configurable truncation limits. ### 3.3 Where ability currently drifts from research rigor 1. No first-class benchmark harness reproducing paper-style task families in CI. 2. No published "research profile" config that reproduces paper-like control settings for fair comparison. 3. No recurring regression report tracking RLM-specific quality dimensions over time (decomposition quality, recursion efficiency, cost variance). ## 4) Strategy Run two workstreams in parallel, then unify with an evaluation gate. ## Workstream A: Presentation Realignment ### A1. Introduce explicit product taxonomy Add one canonical section to top-level docs: 1. Core RLM Mode (paper-aligned loop): `aleph run`, `alef`, `Aleph.complete(...)`. 2. RLM Infrastructure Mode (MCP external memory server): tool-driven orchestration from host assistants. 3. Platform Extensions (not in canonical RLM): swarm workflows, remote MCP orchestration, workspace/action tooling. ### A2. Establish a "Claims Ledger" Create `docs/RLM_CLAIMS_LEDGER.md` mapping every major claim to one of: 1. Research-backed (with citation) 2. Code-backed (with repo file reference) 3. Aspirational (roadmap only) No claim should remain unclassified. ### A3. Remove doc drift and ambiguity Synchronize: 1. `README.md` 2. `docs/CONFIGURATION.md` 3. `DEVELOPMENT.md` 4. `docs/prompts/aleph.md` Rules: 1. Defaults must match runtime constants. 2. Backend priority must match `aleph/sub_query/__init__.py`. 3. Capacity guidance must mention truncation/config dependence. 4. "RLM" label should be used precisely: canonical loop vs inspired tooling. ## Workstream B: Ability Realignment ### B1. Add an RLM benchmark harness Create a lightweight benchmark module with deterministic synthetic tasks mirroring paper structure: 1. Constant-information retrieval task (S-NIAH-style) 2. Linear aggregation task (OOLONG-style) 3. Pairwise aggregation task (OOLONG-Pairs-style) 4. Code context reasoning task (CodeQA-style) Output JSON report fields: 1. accuracy 2. total tokens 3. wall time 4. recursion depth used 5. sub-query count 6. cost estimate (if provider supports it) ### B2. Add "research profile" runtime preset Add a named config profile (example: `--profile rlm-paper-like`) that pins: 1. recursion depth 2. iteration cap 3. sub-query cap 4. prompt template variant 5. model split policy (root vs sub-model) Purpose: make "paper-like mode" reproducible and auditable. ### B3. Add regression gates for RLM behavior Extend tests with behavior-level checks: 1. decomposition quality (does model/toolchain split input and aggregate correctly on fixed fixtures) 2. recursion efficiency (budget use stays within expected bounds for fixtures) 3. stability (result variance under repeated runs with controlled temperature) ## Workstream C: Evidence and Reporting ### C1. Publish periodic evaluation snapshots Add `docs/reports/rlm-eval-YYYY-MM-DD.md` containing: 1. benchmark outcomes 2. deltas versus previous run 3. regression explanations 4. confidence and known blind spots ### C2. Tie docs to evidence Any top-level performance statement in README must point to: 1. paper citation, or 2. local benchmark report, and include run date. ## 5) Execution Plan (Phased) ## Phase 0 (1-2 days): Correctness of messaging 1. Publish taxonomy section in `README.md`. 2. Fix stale defaults/priority mismatches in `DEVELOPMENT.md` and `docs/CONFIGURATION.md`. 3. Add `docs/RLM_CLAIMS_LEDGER.md` scaffold. Exit criteria: 1. No contradictory defaults between docs and runtime for sub-query/backend/timeouts. 2. Every major README claim tagged as research-backed, code-backed, or aspirational. ## Phase 1 (3-5 days): Measurement baseline 1. Implement benchmark harness and fixture set. 2. Add CI job that runs a reduced benchmark subset nightly. 3. Commit first `docs/reports/rlm-eval-*.md`. Exit criteria: 1. Reproducible benchmark report exists. 2. README claims reference concrete report or paper. ## Phase 2 (1-2 weeks): Capability hardening 1. Add research profile preset. 2. Add regression tests for decomposition/aggregation and recursion-budget behavior. 3. Improve sub-query policy controls (parallelism policy, truncation observability, retry behavior) behind explicit config. Exit criteria: 1. RLM-specific regression gates exist in CI. 2. Profile-based reproducibility for paper-like runs is documented. ## 6) Success Metrics Presentation metrics: 1. Zero stale-default mismatches between docs and runtime constants. 2. 100% of top-level claims linked to paper or local reports. 3. Clear user understanding of mode boundaries in docs review (qualitative check). Ability metrics: 1. Benchmark suite committed and reproducible. 2. Stable pass/fail thresholds for synthetic RLM task families. 3. Trendline report shows non-regressing decomposition/aggregation quality. ## 7) Risks and Mitigations 1. Risk: Overfitting to synthetic benchmarks. - Mitigation: include both synthetic structure tests and real-world corpus fixtures. 2. Risk: Benchmark cost/runtime explosion. - Mitigation: tiered benchmark modes (smoke/nightly/full). 3. Risk: Docs become stale again. - Mitigation: add CI check that validates documented defaults against code constants. ## 8) Immediate Next Actions 1. Approve this plan as the execution baseline. 2. Start Phase 0 with doc corrections and claims ledger. 3. Open tracking issues for Phase 1 benchmark harness and Phase 2 regression gates. ## 9) Sources 1. Recursive Language Models paper (arXiv): https://arxiv.org/abs/2512.24601 2. Official RLM codebase: https://github.com/alexzhang13/rlm 3. Author article on RLM framing and limitations: https://towardsdatascience.com/recursive-language-models-new-rules-for-agentic-ai/

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Hmbown/aleph'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

RLM_REALIGNMENT_STRATEGY.md•8.09 KiB