feat(llm): cache-policy auto-placement (#26786)

2026-05-16 09:33:24 +00:00 · 2026-05-10 22:09:55 -04:00
parent ce66b191d1
commit 942630eb4a
13 changed files with 721 additions and 5 deletions
--- a/packages/llm/README.md
+++ b/packages/llm/README.md
@@ -0,0 +1,130 @@
+# @opencode-ai/llm
+
+Schema-first LLM core for opencode. One typed request, response, event, and tool language; provider quirks live in adapters, not in calling code.
+
+```ts
+import { Effect } from "effect"
+import { LLM, LLMClient } from "@opencode-ai/llm"
+import { OpenAI } from "@opencode-ai/llm/providers"
+
+const model = OpenAI.model("gpt-4o-mini", { apiKey: process.env.OPENAI_API_KEY })
+
+const request = LLM.request({
+  model,
+  system: "You are concise.",
+  prompt: "Say hello in one short sentence.",
+  generation: { maxTokens: 40 },
+})
+
+const program = Effect.gen(function* () {
+  const response = yield* LLMClient.generate(request)
+  console.log(response.text)
+})
+```
+
+Run `LLMClient.stream(request)` instead of `generate` when you want incremental `LLMEvent`s. The event stream is provider-neutral — same shape across OpenAI Chat, OpenAI Responses, Anthropic Messages, Gemini, Bedrock Converse, and any OpenAI-compatible deployment.
+
+## Public API
+
+- **`LLM.request({...})`** — build a provider-neutral `LLMRequest`. Accepts ergonomic inputs (`system: string`, `prompt: string`) that normalize into the canonical Schema classes.
+- **`LLM.generate` / `LLM.stream`** — re-exported from `LLMClient` for one-import use.
+- **`LLM.user(...)` / `LLM.assistant(...)` / `LLM.toolMessage(...)`** — message constructors.
+- **`LLM.toolCall(...)` / `LLM.toolResult(...)` / `LLM.toolDefinition(...)`** — tool-related parts.
+- **`LLMClient.prepare(request)`** — compile a request through protocol body construction, validation, and HTTP preparation without sending. Useful for inspection and testing.
+- **`LLMEvent.is.*`** — typed guards (`is.text`, `is.toolCall`, `is.requestFinish`, …) for filtering streams.
+
+## Caching
+
+Prompt caching is unified across providers. Mark content with a `CacheHint` and each protocol translates it to its wire format (`cache_control` on Anthropic, `cachePoint` on Bedrock; OpenAI's implicit caching needs no markers).
+
+### Auto placement
+
+The simplest path is `cache: "auto"` on the request:
+
+```ts
+LLM.request({
+  model,
+  system,
+  messages,
+  tools,
+  cache: "auto",
+})
+```
+
+`"auto"` places three breakpoints — last tool definition, last system part, latest user message. The last-user-message boundary is the load-bearing detail: in a tool-use loop, a single user turn expands into many assistant/tool round-trips, all sharing that prefix. Caching at that boundary lets every intra-turn API call hit.
+
+On OpenAI and Gemini `"auto"` is a no-op (their wire formats don't accept inline markers — both use implicit caching). On Anthropic and Bedrock it emits provider-native cache markers.
+
+### Granular policy
+
+```ts
+cache: {
+  tools?: boolean,
+  system?: boolean,
+  messages?: "latest-user-message" | "latest-assistant" | { tail: number },
+  ttlSeconds?: number,         // ≥ 3600 → 1h on Anthropic/Bedrock; else 5m
+}
+```
+
+### Manual hints
+
+Inline `CacheHint` on any text / system / tool / tool-result part overrides automatic placement. The auto policy preserves manual hints; it only fills gaps.
+
+```ts
+LLM.request({
+  model,
+  system: [
+    { type: "text", text: "stable system prompt", cache: { type: "ephemeral" } },
+  ],
+  ...
+})
+```
+
+### Provider behavior table
+
+| Protocol | `cache: "auto"` |
+|---|---|
+| Anthropic Messages | emits up to 3 `cache_control` markers (4-breakpoint cap enforced) |
+| Bedrock Converse | emits up to 3 `cachePoint` blocks (4-breakpoint cap enforced) |
+| OpenAI Chat / Responses | no-op (implicit caching above 1024 tokens) |
+| Gemini | no-op (implicit caching on 2.5+; explicit `CachedContent` is out-of-band) |
+
+Normalized cache usage is read back into `response.usage.cacheReadInputTokens` and `cacheWriteInputTokens` across every provider.
+
+## Providers
+
+Each provider exports a `model(...)` helper that records identity, protocol, capabilities, auth, and defaults.
+
+```ts
+import { Anthropic } from "@opencode-ai/llm/providers"
+
+const model = Anthropic.model("claude-sonnet-4-6", {
+  apiKey: process.env.ANTHROPIC_API_KEY,
+})
+```
+
+Included providers: OpenAI, Anthropic, Google (Gemini), Amazon Bedrock, Azure OpenAI, Cloudflare, GitHub Copilot, OpenRouter, xAI, plus generic OpenAI-compatible helpers for DeepSeek, Cerebras, Groq, Fireworks, Together, etc.
+
+## Provider options & HTTP overlays
+
+Three escape hatches in order of stability:
+
+1. **`generation`** — portable knobs (`maxTokens`, `temperature`, `topP`, `topK`, penalties, seed, stop).
+2. **`providerOptions: { <provider>: {...} }`** — typed-at-the-facade provider-specific knobs (OpenAI `promptCacheKey`, Anthropic `thinking`, Gemini `thinkingConfig`, OpenRouter routing).
+3. **`http: { body, headers, query }`** — last-resort serializable overlays merged into the final HTTP request. Reach for this only when a stable typed path doesn't yet exist.
+
+Model-level defaults are overridden by request-level values for each axis.
+
+## Routes
+
+Adding a new model or deployment is usually 5–15 lines using `Route.make({ protocol, transport, ... })`. The four orthogonal pieces are protocol (body construction + stream parsing), transport (endpoint + auth + framing + encoding), defaults, and capabilities. See `AGENTS.md` for the architectural detail.
+
+## Effect
+
+This package is built on Effect. Public methods return `Effect` or `Stream`; provide `LLMClient.layer` (the default registers every shipped route) for runtime dispatch. The example at `example/tutorial.ts` is a runnable walkthrough.
+
+## See also
+
+- `AGENTS.md` — architecture, route construction, contributor guide
+- `example/tutorial.ts` — runnable end-to-end walkthrough
+- `test/provider/*.test.ts` — fixture-first protocol tests; `*.recorded.test.ts` files cover live cassettes
--- a/packages/llm/src/cache-policy.ts
+++ b/packages/llm/src/cache-policy.ts
@@ -0,0 +1,120 @@
+// Apply an `LLMRequest.cache` policy by injecting `CacheHint`s onto the parts
+// the policy designates. Runs once at compile time, before the per-protocol
+// body builder, so the existing inline-hint lowering path handles the rest.
+//
+// The default `"auto"` shape places one breakpoint at the last tool definition,
+// one at the last system part, and one at the latest user message. This
+// matches what production agent harnesses (LangChain's caching middleware,
+// kern-ai's 10x cost-reduction playbook) converge on for tool-use loops: the
+// latest user message stays put while a single turn explodes into many
+// assistant/tool round-trips, so caching at that boundary lets every
+// intra-turn API call hit the prefix.
+//
+// Manual `cache: CacheHint` placements on individual parts are preserved —
+// this function only fills gaps the caller left empty.
+import { CacheHint, type CachePolicy, type CachePolicyObject } from "./schema/options"
+import { LLMRequest, Message, ToolDefinition, type ContentPart } from "./schema/messages"
+
+const AUTO: CachePolicyObject = {
+  tools: true,
+  system: true,
+  messages: "latest-user-message",
+}
+
+const NONE: CachePolicyObject = {}
+
+// Resolution rules:
+//   - undefined   → "none" (opt-in default so the policy never changes wire
+//                   shape for existing callers; downstream code can flip to
+//                   `cache: "auto"` once they audit the placement choices).
+//   - "auto"      → the recommended policy: tools + system + latest user msg.
+//   - "none"      → no auto placement; manual `CacheHint`s still flow.
+//   - object form → exactly what the caller asked for.
+const resolve = (policy: CachePolicy | undefined): CachePolicyObject => {
+  if (policy === undefined || policy === "none") return NONE
+  if (policy === "auto") return AUTO
+  return policy
+}
+
+// Protocols whose wire format ignores inline cache markers (OpenAI's implicit
+// prefix caching, Gemini's implicit + out-of-band CachedContent). Skip the
+// whole policy pass for these — emitting hints would be harmless but pointless.
+const RESPECTS_INLINE_HINTS = new Set(["anthropic-messages", "bedrock-converse"])
+
+const makeHint = (ttlSeconds: number | undefined): CacheHint =>
+  ttlSeconds !== undefined ? new CacheHint({ type: "ephemeral", ttlSeconds }) : new CacheHint({ type: "ephemeral" })
+
+const markLastTool = (
+  tools: ReadonlyArray<ToolDefinition>,
+  hint: CacheHint,
+): ReadonlyArray<ToolDefinition> => {
+  if (tools.length === 0) return tools
+  const last = tools.length - 1
+  if (tools[last]!.cache) return tools
+  return tools.map((tool, i) => (i === last ? new ToolDefinition({ ...tool, cache: hint }) : tool))
+}
+
+const markLastSystem = (system: LLMRequest["system"], hint: CacheHint): LLMRequest["system"] => {
+  if (system.length === 0) return system
+  const last = system.length - 1
+  if (system[last]!.cache) return system
+  return system.map((part, i) => (i === last ? { ...part, cache: hint } : part))
+}
+
+const lastIndexOfRole = (messages: ReadonlyArray<Message>, role: Message["role"]): number =>
+  messages.findLastIndex((m) => m.role === role)
+
+// Mark the last text part of `messages[index]`. If no text part exists, mark
+// the last content part regardless of type — that's the breakpoint position
+// in tool-result-only messages too.
+const markMessageAt = (
+  messages: ReadonlyArray<Message>,
+  index: number,
+  hint: CacheHint,
+): ReadonlyArray<Message> => {
+  if (index < 0 || index >= messages.length) return messages
+  const target = messages[index]!
+  if (target.content.length === 0) return messages
+  const lastTextIndex = target.content.findLastIndex((part) => part.type === "text")
+  const markAt = lastTextIndex >= 0 ? lastTextIndex : target.content.length - 1
+  const existing = target.content[markAt]!
+  if ("cache" in existing && existing.cache) return messages
+  const nextContent = target.content.map((part, i) =>
+    i === markAt ? ({ ...part, cache: hint } as ContentPart) : part,
+  )
+  const next = new Message({ ...target, content: nextContent })
+  // Single pass over `messages`, substituting the one updated entry. Long
+  // conversations call this on every request, so avoid `.map()` here — its
+  // closure dispatch and identity copies show up in profiling.
+  const result = messages.slice()
+  result[index] = next
+  return result
+}
+
+const markMessages = (
+  messages: ReadonlyArray<Message>,
+  strategy: NonNullable<CachePolicyObject["messages"]>,
+  hint: CacheHint,
+): ReadonlyArray<Message> => {
+  if (messages.length === 0) return messages
+  if (strategy === "latest-user-message") return markMessageAt(messages, lastIndexOfRole(messages, "user"), hint)
+  if (strategy === "latest-assistant") return markMessageAt(messages, lastIndexOfRole(messages, "assistant"), hint)
+  const start = Math.max(0, messages.length - strategy.tail)
+  let next = messages
+  for (let i = start; i < messages.length; i++) next = markMessageAt(next, i, hint)
+  return next
+}
+
+export const applyCachePolicy = (request: LLMRequest): LLMRequest => {
+  if (!RESPECTS_INLINE_HINTS.has(request.model.route)) return request
+  const policy = resolve(request.cache)
+  if (!policy.tools && !policy.system && !policy.messages) return request
+
+  const hint = makeHint(policy.ttlSeconds)
+  const tools = policy.tools ? markLastTool(request.tools, hint) : request.tools
+  const system = policy.system ? markLastSystem(request.system, hint) : request.system
+  const messages = policy.messages ? markMessages(request.messages, policy.messages, hint) : request.messages
+
+  if (tools === request.tools && system === request.system && messages === request.messages) return request
+  return LLMRequest.update(request, { tools, system, messages })
+}
--- a/packages/llm/src/route/client.ts
+++ b/packages/llm/src/route/client.ts
@@ -8,6 +8,7 @@ import type { Transport, TransportRuntime } from "./transport"
 import { WebSocketExecutor } from "./transport"
 import type { Service as WebSocketExecutorService } from "./transport/websocket"
 import type { Protocol } from "./protocol"
+import { applyCachePolicy } from "../cache-policy"
 import * as ProviderShared from "../protocols/shared"
 import * as ToolRuntime from "../tool-runtime"
 import type { Tools } from "../tool"
@@ -400,7 +401,7 @@ export function make<Body, Prepared, Frame, Event, State>(
 // validated provider body plus transport-private prepared data, but does not
 // execute transport.
 const compile = Effect.fn("LLM.compile")(function* (request: LLMRequest) {
-  const resolved = resolveRequestOptions(request)
+  const resolved = applyCachePolicy(resolveRequestOptions(request))
  const route = registeredRoute(resolved.model.route)
  if (!route) return yield* noRoute(resolved.model)

--- a/packages/llm/src/schema/messages.ts
+++ b/packages/llm/src/schema/messages.ts
@@ -1,6 +1,6 @@
 import { Schema } from "effect"
 import { JsonSchema, MessageRole, ProviderMetadata } from "./ids"
-import { CacheHint, GenerationOptions, HttpOptions, ModelRef, ProviderOptions } from "./options"
+import { CacheHint, CachePolicy, GenerationOptions, HttpOptions, ModelRef, ProviderOptions } from "./options"

 const isRecord = (value: unknown): value is Record<string, unknown> =>
  typeof value === "object" && value !== null && !Array.isArray(value)
@@ -206,6 +206,7 @@ export class LLMRequest extends Schema.Class<LLMRequest>("LLM.Request")({
  providerOptions: Schema.optional(ProviderOptions),
  http: Schema.optional(HttpOptions),
  responseFormat: Schema.optional(ResponseFormat),
+  cache: Schema.optional(CachePolicy),
  metadata: Schema.optional(Schema.Record(Schema.String, Schema.Unknown)),
 }) {}

@@ -223,6 +224,7 @@ export namespace LLMRequest {
    providerOptions: request.providerOptions,
    http: request.http,
    responseFormat: request.responseFormat,
+    cache: request.cache,
    metadata: request.metadata,
  })

--- a/packages/llm/src/schema/options.ts
+++ b/packages/llm/src/schema/options.ts
@@ -200,3 +200,35 @@ export class CacheHint extends Schema.Class<CacheHint>("LLM.CacheHint")({
  type: Schema.Literals(["ephemeral", "persistent"]),
  ttlSeconds: Schema.optional(Schema.Number),
 }) {}
+
+// Auto-placement policy for prompt caching. The protocol-neutral lowering step
+// reads this and injects `CacheHint`s at the configured boundaries; the
+// per-protocol body builders then translate those hints into wire markers as
+// usual. `"auto"` is the recommended default for agent loops — it places one
+// breakpoint at the last tool definition, one at the last system part, and one
+// at the latest user message. The combination of provider invalidation
+// hierarchy (tools → system → messages) and Anthropic/Bedrock's 20-block
+// lookback means three trailing breakpoints reliably cover the static prefix.
+//
+// Pass `"none"` to opt out entirely (the legacy behavior). Pass the granular
+// object form to override individual choices.
+export const CachePolicyObject = Schema.Struct({
+  tools: Schema.optional(Schema.Boolean),
+  system: Schema.optional(Schema.Boolean),
+  messages: Schema.optional(
+    Schema.Union([
+      Schema.Literal("latest-user-message"),
+      Schema.Literal("latest-assistant"),
+      Schema.Struct({ tail: Schema.Number }),
+    ]),
+  ),
+  ttlSeconds: Schema.optional(Schema.Number),
+})
+export type CachePolicyObject = Schema.Schema.Type<typeof CachePolicyObject>
+
+export const CachePolicy = Schema.Union([
+  Schema.Literal("auto"),
+  Schema.Literal("none"),
+  CachePolicyObject,
+])
+export type CachePolicy = Schema.Schema.Type<typeof CachePolicy>
--- a/packages/llm/test/cache-policy.test.ts
+++ b/packages/llm/test/cache-policy.test.ts
@@ -0,0 +1,262 @@
+import { describe, expect, test } from "bun:test"
+import { Effect } from "effect"
+import { CacheHint, LLM } from "../src"
+import { LLMClient } from "../src/route"
+import * as AnthropicMessages from "../src/protocols/anthropic-messages"
+import * as BedrockConverse from "../src/protocols/bedrock-converse"
+import * as Gemini from "../src/protocols/gemini"
+import * as OpenAIChat from "../src/protocols/openai-chat"
+import { applyCachePolicy } from "../src/cache-policy"
+import { it } from "./lib/effect"
+
+const anthropicModel = AnthropicMessages.model({
+  id: "claude-sonnet-4-5",
+  baseURL: "https://api.anthropic.test/v1/",
+  headers: { "x-api-key": "test" },
+})
+
+const bedrockModel = BedrockConverse.model({
+  id: "anthropic.claude-3-5-sonnet-20241022-v2:0",
+  credentials: { region: "us-east-1", accessKeyId: "fixture", secretAccessKey: "fixture" },
+})
+
+const openaiModel = OpenAIChat.model({
+  id: "gpt-4o-mini",
+  baseURL: "https://api.openai.test/v1/",
+  headers: { authorization: "Bearer test" },
+})
+
+const geminiModel = Gemini.model({
+  id: "gemini-2.5-flash",
+  baseURL: "https://generativelanguage.test/v1beta/",
+  headers: { "x-goog-api-key": "test" },
+})
+
+describe("applyCachePolicy", () => {
+  it.effect("undefined cache leaves the request untouched (opt-in default)", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          system: "You are concise.",
+          prompt: "hi",
+        }),
+      )
+
+      expect(prepared.body).toMatchObject({
+        system: [{ type: "text", text: "You are concise.", cache_control: undefined }],
+      })
+    }),
+  )
+
+  it.effect("'auto' marks the last tool, last system part, and latest user message on Anthropic", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          system: "Sys A",
+          tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
+          messages: [
+            LLM.user("first user"),
+            LLM.assistant("assistant reply"),
+            LLM.user("latest user message"),
+          ],
+          cache: "auto",
+        }),
+      )
+
+      expect(prepared.body).toMatchObject({
+        tools: [{ name: "t1", cache_control: { type: "ephemeral" } }],
+        system: [{ type: "text", text: "Sys A", cache_control: { type: "ephemeral" } }],
+        messages: [
+          { role: "user", content: [{ type: "text", text: "first user" }] },
+          { role: "assistant", content: [{ type: "text", text: "assistant reply" }] },
+          {
+            role: "user",
+            content: [{ type: "text", text: "latest user message", cache_control: { type: "ephemeral" } }],
+          },
+        ],
+      })
+    }),
+  )
+
+  it.effect("'auto' is a no-op on OpenAI (implicit caching protocol)", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: openaiModel,
+          system: "Sys",
+          prompt: "hi",
+          cache: "auto",
+        }),
+      )
+
+      const body = prepared.body as { messages: Array<{ content: unknown }> }
+      // OpenAI doesn't accept cache_control on messages — policy must skip.
+      const flat = JSON.stringify(body)
+      expect(flat).not.toContain("cache_control")
+      expect(flat).not.toContain("cachePoint")
+    }),
+  )
+
+  it.effect("'auto' is a no-op on Gemini (out-of-band caching protocol)", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: geminiModel,
+          system: "Sys",
+          prompt: "hi",
+          cache: "auto",
+        }),
+      )
+
+      const flat = JSON.stringify(prepared.body)
+      expect(flat).not.toContain("cache_control")
+      expect(flat).not.toContain("cachePoint")
+    }),
+  )
+
+  it.effect("'auto' on Bedrock emits cachePoint markers in the right places", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: bedrockModel,
+          system: "Sys",
+          tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
+          messages: [LLM.user("first user"), LLM.assistant("reply"), LLM.user("latest user")],
+          cache: "auto",
+        }),
+      )
+
+      expect(prepared.body).toMatchObject({
+        toolConfig: {
+          tools: [{ toolSpec: { name: "t1" } }, { cachePoint: { type: "default" } }],
+        },
+        system: [{ text: "Sys" }, { cachePoint: { type: "default" } }],
+        messages: [
+          { role: "user", content: [{ text: "first user" }] },
+          { role: "assistant", content: [{ text: "reply" }] },
+          { role: "user", content: [{ text: "latest user" }, { cachePoint: { type: "default" } }] },
+        ],
+      })
+    }),
+  )
+
+  it.effect("'none' disables auto placement even when manual hints exist", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          system: "Sys",
+          tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
+          prompt: "hi",
+          cache: "none",
+        }),
+      )
+
+      expect(prepared.body).toMatchObject({
+        tools: [{ name: "t1", cache_control: undefined }],
+        system: [{ type: "text", text: "Sys", cache_control: undefined }],
+      })
+    }),
+  )
+
+  it.effect("granular object form: tools-only marks just tools", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          system: "Sys",
+          tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
+          prompt: "hi",
+          cache: { tools: true },
+        }),
+      )
+
+      expect(prepared.body).toMatchObject({
+        tools: [{ name: "t1", cache_control: { type: "ephemeral" } }],
+        system: [{ type: "text", text: "Sys", cache_control: undefined }],
+      })
+    }),
+  )
+
+  it.effect("auto policy preserves manual CacheHints on other parts", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          system: [
+            { type: "text", text: "first system", cache: new CacheHint({ type: "ephemeral", ttlSeconds: 3600 }) },
+            { type: "text", text: "last system" },
+          ],
+          prompt: "hi",
+          cache: "auto",
+        }),
+      )
+
+      const body = prepared.body as { system: Array<{ text: string; cache_control?: unknown }> }
+      expect(body.system[0]?.cache_control).toEqual({ type: "ephemeral", ttl: "1h" })
+      expect(body.system[1]?.cache_control).toEqual({ type: "ephemeral" })
+    }),
+  )
+
+  it.effect("ttlSeconds in the policy flows through to wire markers", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          system: "Sys",
+          prompt: "hi",
+          cache: { system: true, ttlSeconds: 3600 },
+        }),
+      )
+
+      expect(prepared.body).toMatchObject({
+        system: [{ type: "text", text: "Sys", cache_control: { type: "ephemeral", ttl: "1h" } }],
+      })
+    }),
+  )
+
+  it.effect("messages: { tail: 2 } marks the last 2 message boundaries", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          messages: [LLM.user("u1"), LLM.assistant("a1"), LLM.user("u2"), LLM.assistant("a2")],
+          cache: { messages: { tail: 2 } },
+        }),
+      )
+
+      const body = prepared.body as { messages: Array<{ content: Array<{ cache_control?: unknown }> }> }
+      expect(body.messages[0]?.content[0]?.cache_control).toBeUndefined()
+      expect(body.messages[1]?.content[0]?.cache_control).toBeUndefined()
+      expect(body.messages[2]?.content[0]?.cache_control).toEqual({ type: "ephemeral" })
+      expect(body.messages[3]?.content[0]?.cache_control).toEqual({ type: "ephemeral" })
+    }),
+  )
+
+  it.effect("'latest-assistant' marks the last assistant message", () =>
+    Effect.gen(function* () {
+      const prepared = yield* LLMClient.prepare(
+        LLM.request({
+          model: anthropicModel,
+          messages: [LLM.user("u1"), LLM.assistant("a1"), LLM.user("u2")],
+          cache: { messages: "latest-assistant" },
+        }),
+      )
+
+      const body = prepared.body as { messages: Array<{ content: Array<{ cache_control?: unknown }> }> }
+      expect(body.messages[0]?.content[0]?.cache_control).toBeUndefined()
+      expect(body.messages[1]?.content[0]?.cache_control).toEqual({ type: "ephemeral" })
+      expect(body.messages[2]?.content[0]?.cache_control).toBeUndefined()
+    }),
+  )
+
+  test("returns the same request reference when policy is a no-op (pure function)", () => {
+    const request = LLM.request({
+      model: anthropicModel,
+      prompt: "hi",
+    })
+    expect(applyCachePolicy(request)).toBe(request)
+  })
+})
--- a/packages/llm/test/fixtures/recordings/anthropic-messages-cache/writes-then-reads-cache-control-on-identical-second-call.json
+++ b/packages/llm/test/fixtures/recordings/anthropic-messages-cache/writes-then-reads-cache-control-on-identical-second-call.json
--- a/packages/llm/test/fixtures/recordings/gemini-cache/reports-cachedcontenttokencount-on-identical-second-call.json
+++ b/packages/llm/test/fixtures/recordings/gemini-cache/reports-cachedcontenttokencount-on-identical-second-call.json
--- a/packages/llm/test/fixtures/recordings/openai-responses-cache/reports-cached-tokens-on-identical-second-call.json
+++ b/packages/llm/test/fixtures/recordings/openai-responses-cache/reports-cached-tokens-on-identical-second-call.json
--- a/packages/llm/test/provider/anthropic-messages-cache.recorded.test.ts
+++ b/packages/llm/test/provider/anthropic-messages-cache.recorded.test.ts
@@ -28,7 +28,12 @@ const recorded = recordedTests({
  provider: "anthropic",
  protocol: "anthropic-messages",
  requires: ["ANTHROPIC_API_KEY"],
-  options: { redactor: Redactor.defaults({ requestHeaders: { allow: ["content-type", "anthropic-version"] } }) },
+  // Two identical requests in one cassette — match by recording order so the
+  // second call replays the cached-hit interaction.
+  options: {
+    dispatch: "sequential",
+    redactor: Redactor.defaults({ requestHeaders: { allow: ["content-type", "anthropic-version"] } }),
+  },
 })

 describe("Anthropic Messages cache recorded", () => {
--- a/packages/llm/test/provider/bedrock-converse-cache.recorded.test.ts
+++ b/packages/llm/test/provider/bedrock-converse-cache.recorded.test.ts
@@ -35,6 +35,9 @@ const recorded = recordedTests({
  provider: "amazon-bedrock",
  protocol: "bedrock-converse",
  requires: ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"],
+  // Two identical requests in one cassette — match by recording order so the
+  // second call replays the cached-hit interaction.
+  options: { dispatch: "sequential" },
 })

 describe("Bedrock Converse cache recorded", () => {
--- a/packages/llm/test/provider/gemini-cache.recorded.test.ts
+++ b/packages/llm/test/provider/gemini-cache.recorded.test.ts
@@ -8,7 +8,7 @@ import { recordedTests } from "../recorded-test"

 const model = Gemini.model({
  id: "gemini-2.5-flash",
-  apiKey: process.env.GEMINI_API_KEY ?? "fixture",
+  apiKey: process.env.GOOGLE_GENERATIVE_AI_API_KEY ?? process.env.GEMINI_API_KEY ?? "fixture",
 })

 // Gemini does implicit prefix caching on 2.5+ models above ~1024 tokens. The
@@ -28,7 +28,10 @@ const recorded = recordedTests({
  prefix: "gemini-cache",
  provider: "google",
  protocol: "gemini",
-  requires: ["GEMINI_API_KEY"],
+  requires: ["GOOGLE_GENERATIVE_AI_API_KEY"],
+  // Two identical requests in one cassette — match by recording order so the
+  // second call replays the cached-hit interaction.
+  options: { dispatch: "sequential" },
 })

 describe("Gemini cache recorded", () => {
--- a/packages/llm/test/provider/openai-responses-cache.recorded.test.ts
+++ b/packages/llm/test/provider/openai-responses-cache.recorded.test.ts
@@ -29,6 +29,9 @@ const recorded = recordedTests({
  provider: "openai",
  protocol: "openai-responses",
  requires: ["OPENAI_API_KEY"],
+  // Two identical requests in one cassette — match by recording order so the
+  // second call replays the cached-hit interaction, not the cold-miss one.
+  options: { dispatch: "sequential" },
 })

 describe("OpenAI Responses cache recorded", () => {