**Stop summarizing your agent's memory.** Every co…
Stop summarizing your agent's memory. Every compaction call burns a model round-trip, rewrites your prefix so the provider prompt cache goes cold, and quietly drops the exact identifiers your agent needs. Fold it deterministically instead.
The Infinite Context Warp Engine. Keep long function-calling agent sessions under the context window without LLM summarization calls and without ending the session — while keeping provider prompt caches hot — and page folded content back in the moment the agent touches it again.
Deterministic. Zero-LLM. Pure CPU, zero I/O, byte-identical output for identical inputs. Provider-agnostic: Anthropic content blocks, OpenAI tool_calls , and Gemini parts .
Extracted from a production multi-agent system, where it folds context continuously across every model and long-running agent workloads.
The core engine passes 380+ deterministic tests across rolling fold, recall, freeze, and integration.
Every number below is measured, not estimated — production cache rates from the Claude provider usage ledger, reproducible live against Claude ( ANTHROPIC_API_KEY=… npx tsx examples/benchmark-live.ts , real model + real summarizer) and offline with exact o200k_base BPE token counts ( npx tsx examples/benchmark.ts , deterministic, no key).
Provenance note: this public package is production-derived. It is the portable distribution of an engine that runs live inside a private multi-agent system, so it deliberately uses generic WARP_* environment names, package-neutral examples, raw-history recovery wording, and tool-agnostic voice mining. The byte-identical invariant is local to this package — identical inputs produce identical folded views — and is not a claim of bit-for-bit parity with any private integration layer.
The numbers that matter are from the production multi-agent system this engine powers — real Claude workloads running the fold/freeze engine continuously across hundreds of turns, measured from the provider's own usage ledger (cache-read tokens ÷ total input tokens):
Production Claude workload Measured turns Cache-read hit Fresh input Cache-read input
Opus 4.8 agent 691 89.6% 32.9M tok 292.6M tok
Opus agent 510 93.2% 32.6M tok 602.5M tok
~90% of all input tokens are served from cache across these high-turn Claude workloads — that is the byte-identical frozen-fold prefix doing its job, turn after turn, at $0.30/MTok cache reads instead of $3.00/MTok fresh input (Sonnet rates). A re-summarizing compactor rewrites the prefix and can never sustain this; truncation slides the window and breaks it. This is the entire economic argument, measured live.
ANTHROPIC_API_KEY=sk-ant-... npx tsx examples/benchmark-live.ts # default claude-haiku-4-5
Real Claude calls every turn with Anthropic cache_control breakpoints, a real Claude summarizer (told to preserve every identifier — a fair fight), and the provider's own cache_read_input_tokens / cache_creation_input_tokens . A short 16-turn demo understates the production cache rate (caching needs a ≥1024-token prefix and CWD's advantage compounds over long sessions) — but it shows the mechanism on real telemetry, with CWD reading from cache while truncation and summarization rebuild their prefix.
npx tsx examples/benchmark.ts — a 16-turn outage-debugging session, exact o200k_base BPE token counts (a portable proxy; Claude's tokenizer isn't public), claude-haiku-4-5 list pricing. This is the CI smoke test; the cache column is a turn-over-turn byte-prefix proxy and the summarizer is a transparent deterministic stand-in (it drops ids buried past its head cutoff — the failure mode the Coordinate Closet exists to avoid).
Strategy Input Cost Cache Hit (prefix proxy) Extra LLM Calls Fact Retention
Truncation (rolling window) $0.0172 28% 0 44% (7/16)
LLM Summarization (stand-in) $0.0228 43% 6 44% (7/16)
Context Warp Drive$0.0066 60% 0 94% (15/16)
CWD is cheapest (−71% vs summarization, −62% vs truncation at Claude-haiku rates), makes zero extra model calls, and beats truncation decisively on retention. (A well-prompted real summarizer can match retention at higher cost — CWD's durable edge is cost + zero calls + determinism + a hot cache.) The engine is provider-agnostic: set WARP_BENCH_MODEL (and WARP_BENCH_PRICE_* for an unlisted model) to benchmark against any model, including OpenAI.
Every long agent session hits the same wall: the context window fills up. The usual answers are bad:
Truncation drops the middle of your history — the agent forgets what it was doing.
LLM summarization ("compaction") costs a model call, adds latency, is non-deterministic, and busts your provider prompt cache every time it rewrites the prefix.
Context Warp Drive does neither. It deterministically folds old turns into compact structural skeletons (one line per tool call + retained reasoning), conserves the salient exact identifiers (UUIDs, SHAs, paths, ports) in a budget-scored Coordinate Closet, freezes the folded prefix so it's reused byte-identical while the provider cache is warm, and pages folded content back in automatically when the agent re-touches a path. No model calls. No truncation. Cache stays hot.
Not published on npm yet. Install from source today:
git clone https://github.com/dogtorjonah/context-warp-drive.git
cd context-warp-drive
npm install # runs prepare -> builds dist/ automatically
npm install better-sqlite3
The core ( context-warp-drive/fold ) has zero runtime dependencies. better-sqlite3 is an optional peer needed only by the reference episodic store.
Local tarball / future npm install dist/ is gitignored, so build before consuming the package from another project. For a local package install:
npm run build # explicit fallback npm pack
npm install /path/to/context-warp-drive/context-warp-drive- * .tgz
After the first npm publish, installation becomes:
npm install context-warp-drive
Paste this:
** Add context-warp-drive from the source checkout or local tarball, then wrap our function-calling message history with FoldSession.prepare() before each model call. Preserve raw history separately; send only the prepared messages view to the provider. Use cacheHot and stats for logging.
Then add the provider cache knob:
Provider What to do
Claude / Anthropic Use prepareAnthropicCachedRequest() from context-warp-drive/providers/anthropic with messages , sealedBoundary , system , and tools . It marks the relay-style breakpoints: tools, stable system head, sealed fold/rebirth boundary, and rolling tail. Default TTL is Anthropic's 5-minute cache shape; pass ttl: '1h' only when you want the paid 1-hour cache and merge the returned requestOptions / anthropicBeta into your SDK or fetch call. Log usage.cache_read_input_tokens and usage.cache_creation_input_tokens .
OpenAI No cache marker is required. Keep static tools/system/context first, pass the prepared messages , optionally reuse a stable prompt_cache_key , and log usage.prompt_tokens_details.cached_tokens .
Gemini Implicit caching is automatic on Gemini 2.5+ when prefixes match. For a large static document/corpus, create an explicit Gemini cache separately and pass it as cachedContent ; keep the folded conversation after that stable prefix. Log usage_metadata .
Gemini CLI Use context-warp-drive/providers/gemini-cli to fold the CLI-owned JSONL view, preserving the metadata header and rewriting with $set.messages + $set.lastUpdated .
Context Warp Drive keeps the prefix byte-identical. The provider SDK call still owns provider-specific cach…
本条由桃子采集流水线(启发式模式)自动整理,原文见文末信源。