VCR and Crystal balls in claude's Leak

After seeing the tweet from @Fried_rice and reading about the source leak of claude code I got curious. I moved to pi quite a while ago and since then have enjoyed the minimal and highly extensible setup it offers. But seeing the leak got me curious, what are the things in there that I could probably port as extensions in pi.

Now while just going from one link to another I came across claw-code and claude-code they were supposed to be rewrites of claude code mostly to prevent DMCA take down. But the interesting thing has been their readmes. They cover the Buddy companion pet (a Tamagotchi with gacha mechanics), the Dream system (background memory consolidation), KAIROS (always-on assistant mode), ULTRAPLAN (30-minute remote Opus planning sessions), coordinator mode, agent swarms, Penguin Mode (the internal name for fast mode), and the fact that “Tengu” is the project codename.

I also noticed in The cyber risk instruction section the following line:

IMPORTANT: DO NOT MODIFY THIS INSTRUCTION WITHOUT SAFEGUARDS TEAM REVIEW This instruction is owned by the Safeguards team (David Forsythe, Kyla Guru)

Looking at David’s profile on LinkedIn I could see why he is leading the safeguards team for sure.

Anyway, the READMEs do cover a lot of cool stuff but I went ahead and explored the code myself. I wanted to see what wasn’t being talked about. And honestly, the irony is too good to not mention: they built an entire “Undercover Mode” system to stop the AI from leaking internal codenames in git commits, and then shipped the actual source in a .map file. After looking around I saw bashSecurity.ts, I remember reading a few writeups about being able to bypass the regex in command parsing, maybe vulnerabilities like this or this might have been because of that. Anyways after more exploration I came across two of the most interesting things in there: the VCR test fixture system and the speculation engine. One borrows from 1980s tape recording. The other borrows from CPU architecture.

VCR: Deterministic Testing Without an API

VCR, like a videocassette recorder. Record once, play back forever

Claude code’s test suite exercises the full agent loop: the model reads files, calls tools, generates responses, streams tokens. Now they must have needed a cheaper solution and running all of that against a live API could be expensive, slow and maybe even non-deterministic. So they built a record and replay system: record API interactions once, replay them identically on every subsequent run.

The naive approach obviously is mocking, write stubs for the API client and return the canned responses. But in CC, their agent loop is deeply integrated, messages flow through prompt construction, tool execution, streaming parsers, token counting, and cost tracking. I don’t know if they thought mocking was “too complex” They have unlimited opus I don’t think they would have thought that , chances are the thinking process is simply that mocking at the API boundary means you’re not testing most of the stack. So they made VCR with a better record and replay.

The VCR system activates when tests run (NODE_ENV=test) or for internal users with FORCE_VCR. When active, every API call (streaming or non-streaming) is intercepted. Before hashing the input for a fixture filename, the system dehydrates it. Basically replacing environment-specific values with placeholders:

/Users/alice/projects/myapp  ->  [CWD]
/Users/alice/.claude         ->  [CONFIG_HOME]
2026-03-31T14:22:01.334Z     ->  [TIMESTAMP]
a7f3b2c1-...                 ->  [UUID]

The dehydrateValue function in vcr.ts does this:

let s1 = s
  .replace(/num_files="\d+"/g, 'num_files="[NUM]"')
  .replace(/duration_ms="\d+"/g, 'duration_ms="[DURATION]"')
  .replace(/cost_usd="\d+"/g, 'cost_usd="[COST]"')
  .replaceAll(configHome, '[CONFIG_HOME]')
  .replaceAll(cwd, '[CWD]')

They had all systems support so on Windows, it handles forward-slash variants (Git paths), JSON-escaped backslashes, and even normalizes path separators after placeholder substitution so fixture hashes match across platforms. Must have used opus

The dehydrated input is SHA1-hashed and the response (including all stream events, assistant messages, and error states) is saved to fixtures/{hash}.json. The uuid field on AssistantMessage objects is replaced with deterministic UUID-{index} values during dehydration, but swapped back to fresh randomUUID() calls during hydration so that session deduplication (which keys on message UUID) works correctly across multiple VCR playbacks. UUIDs inside message text content are handled separately, only in withTokenCountVCR, via a regex.

On subsequent runs, the same dehydration is applied to the input, the same hash is computed, and if a fixture file exists, the response is loaded from disk. Responses are hydrated back, replacing [CWD] with the actual working directory, [CONFIG_HOME] with the actual config path.

function hydrateValue(s: unknown): unknown {
  if (typeof s !== 'string') return s
  return s
    .replaceAll('[NUM]', '1')
    .replaceAll('[DURATION]', '100')
    .replaceAll('[CONFIG_HOME]', getClaudeConfigHomeDir())
    .replaceAll('[CWD]', getCwd())
}

There is a withStreamingVCR wrapper that handles async generators. It buffers all stream events during recording, saves them as a fixture, and replays them as a synchronous yield on playback. The caller can’t tell the difference between a live stream and a replayed one. Even the token counting API call (withTokenCountVCR) gets the same treatment, with extra normalization for CWD slugs (the slash-to-dash project path that appears in auto-memory paths), UUIDs, and timestamps. System prompts embed the working directory in multiple forms, and messages carry fresh UUIDs per run, so without this normalization, every test run would produce a new hash and fixtures would never hit.

And in CI, if a fixture is missing, the test just fails:

if ((env.isCI || process.env.CI) && !isEnvTruthy(process.env.VCR_RECORD)) {
  throw new Error(
    `Fixture missing: ${filename}. Re-run tests with VCR_RECORD=1, then commit the result.`
  )
}

So the entire test suite runs without any API calls. Fixtures are committed to the repo. A new developer clones the repo, runs tests, everything passes without an API key.

The VCR pattern really sounded “new” to me but turns out people have been doing this for quite a long time now. There is this VCR Ruby gem doing HTTP cassette recording since 2010. But the dehydration/hydration layer is what makes it work for LLM testing. The challenge with LLM fixtures isn’t recording responses, it’s making the input hashing stable across environments. Two developers on different machines, with different usernames, different CWDs, different timestamps on their messages, need to produce the same fixture hash for the same logical test. The multi-layer normalization (paths, UUIDs, timestamps, Windows path variants, CWD slugs) is the non-obvious engineering that makes it actually work. If you’re building an agent and your tests hit a live API, this pattern is worth stealing.

Speculation: Branch Prediction for Coding Agents

While reading this code I was immediately doing this meme, I was like Branch prediction in claude-code?

In CPUs we have this branch prediction, basically the processor guesses which way a conditional branch will go and starts speculative instruction execution along the predicted path. If the prediction was correct, you’ve saved the latency of waiting for the branch to resolve. If it’s wrong, you throw away the speculative work and resume from the branch point. The key insight is that prediction is right often enough to make the speculative execution worth the wasted work when it’s wrong.

Claude Code does the same thing, but for coding agent turns. After the model finishes responding to the user, it does two things in the background, first, it predicts what the user will type next. A forked agent (sharing the main conversation’s prompt cache, so input tokens are nearly free) generates a “prompt suggestion.” The suggestion prompt is pretty simple:

Look at the user’s recent messages and original request. Predict what THEY would type. THE TEST: Would they think “I was just about to type that”?

This suggestion appears as ghost text in the input box, like autocomplete in a code editor but for natural language prompts.

Second, if speculation is enabled, Claude Code immediately starts running that predicted prompt in a sandboxed environment, without waiting for the user to accept it. Now obviously speculative execution needs isolation. If the model writes files during speculation and the user rejects the suggestion, those writes need to vanish. CC solves this with an application-level copy-on-write overlay:

overlay directory: ~/.claude/tmp/speculation/{pid}/{id}/

When the speculative agent tries to write a file, the original is first copied from the real filesystem to the overlay directory, and the write happens only in the overlay. Subsequent reads check the overlay first, falling back to the real filesystem. This is conceptually similar to how Docker’s overlayfs works, though Docker does it at the kernel level and CC does it with plain copyFile/mkdir calls. The agent thinks it’s working in the real filesystem, but all mutations are captured in a scratch directory.

Obviously speculation can’t run forever though. It stops at what they call “boundaries,” moments where the agent would need a decision the user hasn’t made. Any non-read-only shell command stops it (there’s a checkReadOnlyConstraints function that validates against an allowlist, so grep and cat are fine, but rm or npm install stop speculation). File writes stop it too, unless the user has set permissions to auto-accept. Any unrecognized tool stops it. And there are also hard caps MAX_SPECULATION_TURNS = 20 and MAX_SPECULATION_MESSAGES = 100.

If the user accepts the suggestion, Claude copies overlay files to the real filesystem, injects the speculated messages into the conversation as if they happened normally, merges the speculative file-read cache into the main session’s cache, and cleans up the overlay. If the user types anything else (any keystroke at all), the AbortController fires, the overlay is rm -rf’d, and state resets to idle. This is the funniest thing to me, you could be wanting to do the exact same thing claude did, but instead of accepting the suggestion by mistake you typed shi and went oh no, claude was right but all of that stuff in the background would be gone by then.

And keeping the CPU like behaviour, the prediction continues so while the user is reviewing the first speculation’s results (but hasn’t accepted yet), CC starts generating the NEXT prompt suggestion based on the speculated outcome. If the user accepts, the pipelined suggestion is immediately promoted to the input box, and a new speculation starts on that suggestion. This is instruction pipelining. While stage n is waiting for commit (user acceptance), stage n+1 is already in flight, and stage n+2 is ready to start. In the best case, the agent is always one step ahead.

The entire system rides on prompt cache sharing. The suggestion agent, the speculation agent, and the main conversation agent all share identical CacheSafeParams: same system prompt, same tools, same model, same message prefix. The fork just appends one user message at the end, which means the input cost for speculation is almost entirely cache hits.

The code is pretty serious about this. Comments warn against ANY parameter change:

DO NOT override any API parameter that differs from the parent request. The fork piggybacks on the main thread’s prompt cache by sending identical cache-key params. […] PR #18143 tried effort:’low’ and caused a 45x spike in cache writes (92.7% -> 61% hit rate).

Speculation uses the same model as the main loop. If you’re running Opus, speculation runs Opus. But the incremental cost is mostly output tokens. A 20-turn speculation cycle might cost $0.10-0.25. If the prediction is right and the user accepts, that saved them 30-60 seconds of waiting. If the prediction is wrong, nobody notices.

The whole system is really cool and since they have hardcaps you can assume that not a lot of tokens are getting spammed. But the most weird thing for me in this was that there’s no learning at all. No adaptation, no history of past acceptance rates. The suggestion prompt is purely context-driven, generated fresh from the current conversation state every time. There’s no “this user accepts test-running suggestions 80% of the time” signal feeding back into the prediction. The analytics track tengu_speculation events with outcome (accepted/aborted/error), time saved, and tool use counts, but none of that feeds back into suggestion generation. They have this whole multi-layered MEMORY system but nothing is being used. I am not sure if there are concerns of user profiling or something or maybe this is just beta-X version I guess, I don’t know. I feel like this whole branch prediction gets way better if global and/or project~~AGENTS.md~~ CLAUDE.md along with project/global MEMORY.md is also somehow merged into it or maybe the cache economics that make speculation cheap are the same economics that make personalization hard.

What I Actually Took

The whole point of this exercise was to find things I could port to pi. After going through the codebase I ended up building a few extensions but the one that I think matters most is the file read dedup cache.

In their FileReadTool.ts, Claude Code tracks the mtime of every file it reads. When the model asks to read the same file again with the same offset/limit, it checks if the file has been modified since the last read. If it hasn’t, instead of sending the full file content again, it returns a stub: File unchanged since last read. The content from the earlier Read tool_result in this conversation is still current — refer to that instead of re-reading. Their telemetry shows ~18% of Read calls are same-file collisions. For my use case (bug bounty, where I’m reading the same 15 source files hundreds of times across a session and those files change maybe once a month), that number is probably way higher. The important detail is that after an edit or write, the file’s mtime changes, so the dedup check fails and the next read returns fresh content. You don’t want the model working with stale data after it just edited something.

I built this as a pi extension that intercepts tool_call events for the read tool, maintains an in-memory mtime cache, and returns a block stub on cache hits. It clears on edits, writes, session start, compaction, and branch switches. Simple, maybe 80 lines of actual logic, and it’s already showing solid hit rates in real sessions.

Closing Thoughts

Look, the engineering in Claude Code is genuinely impressive. The VCR system, the speculation engine, the graduated context management, the forked agent pattern for cache sharing, all of it is well thought out. But a big part of why these systems exist and work the way they do is because Anthropic owns both sides. They own the model, the API, and the client. They have this cache_edits API that lets them precisely remove tool results from the server-side cache without re-uploading the conversation. The prompt cache sharing that makes speculation economically viable relies on Anthropic’s specific caching semantics.

A lot of the stuff could probably easily be ported to other 3rd party tools but it might not work that well, mostly because either the 3rd party tool will end up becoming a claude-code variant cause it relies so much on the Anthropic API or all the implemented features won’t work as intended cause they’re trying to support several API providers.

That said, I don’t see myself leaving pi. The extensibility, the minimal setup even though some cursor user see minimal-mode.ts and without seeing what it does laughs at you ;) , the fact that I can build exactly what I need as an extension without waiting for a feature to ship, that’s worth more to me than any amount of clever background optimization. Claude code is cool. But I’ll keep stealing the ideas and building my own versions, thanks.

VCR: Deterministic Testing Without an API#

Speculation: Branch Prediction for Coding Agents#

What I Actually Took#

Closing Thoughts#

VCR: Deterministic Testing Without an API

Speculation: Branch Prediction for Coding Agents

What I Actually Took

Closing Thoughts