mirror of
https://github.com/anomalyco/opencode.git
synced 2026-05-16 09:33:24 +00:00
feat(llm): cache-policy auto-placement (#26786)
This commit is contained in:
130
packages/llm/README.md
Normal file
130
packages/llm/README.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# @opencode-ai/llm
|
||||
|
||||
Schema-first LLM core for opencode. One typed request, response, event, and tool language; provider quirks live in adapters, not in calling code.
|
||||
|
||||
```ts
|
||||
import { Effect } from "effect"
|
||||
import { LLM, LLMClient } from "@opencode-ai/llm"
|
||||
import { OpenAI } from "@opencode-ai/llm/providers"
|
||||
|
||||
const model = OpenAI.model("gpt-4o-mini", { apiKey: process.env.OPENAI_API_KEY })
|
||||
|
||||
const request = LLM.request({
|
||||
model,
|
||||
system: "You are concise.",
|
||||
prompt: "Say hello in one short sentence.",
|
||||
generation: { maxTokens: 40 },
|
||||
})
|
||||
|
||||
const program = Effect.gen(function* () {
|
||||
const response = yield* LLMClient.generate(request)
|
||||
console.log(response.text)
|
||||
})
|
||||
```
|
||||
|
||||
Run `LLMClient.stream(request)` instead of `generate` when you want incremental `LLMEvent`s. The event stream is provider-neutral — same shape across OpenAI Chat, OpenAI Responses, Anthropic Messages, Gemini, Bedrock Converse, and any OpenAI-compatible deployment.
|
||||
|
||||
## Public API
|
||||
|
||||
- **`LLM.request({...})`** — build a provider-neutral `LLMRequest`. Accepts ergonomic inputs (`system: string`, `prompt: string`) that normalize into the canonical Schema classes.
|
||||
- **`LLM.generate` / `LLM.stream`** — re-exported from `LLMClient` for one-import use.
|
||||
- **`LLM.user(...)` / `LLM.assistant(...)` / `LLM.toolMessage(...)`** — message constructors.
|
||||
- **`LLM.toolCall(...)` / `LLM.toolResult(...)` / `LLM.toolDefinition(...)`** — tool-related parts.
|
||||
- **`LLMClient.prepare(request)`** — compile a request through protocol body construction, validation, and HTTP preparation without sending. Useful for inspection and testing.
|
||||
- **`LLMEvent.is.*`** — typed guards (`is.text`, `is.toolCall`, `is.requestFinish`, …) for filtering streams.
|
||||
|
||||
## Caching
|
||||
|
||||
Prompt caching is unified across providers. Mark content with a `CacheHint` and each protocol translates it to its wire format (`cache_control` on Anthropic, `cachePoint` on Bedrock; OpenAI's implicit caching needs no markers).
|
||||
|
||||
### Auto placement
|
||||
|
||||
The simplest path is `cache: "auto"` on the request:
|
||||
|
||||
```ts
|
||||
LLM.request({
|
||||
model,
|
||||
system,
|
||||
messages,
|
||||
tools,
|
||||
cache: "auto",
|
||||
})
|
||||
```
|
||||
|
||||
`"auto"` places three breakpoints — last tool definition, last system part, latest user message. The last-user-message boundary is the load-bearing detail: in a tool-use loop, a single user turn expands into many assistant/tool round-trips, all sharing that prefix. Caching at that boundary lets every intra-turn API call hit.
|
||||
|
||||
On OpenAI and Gemini `"auto"` is a no-op (their wire formats don't accept inline markers — both use implicit caching). On Anthropic and Bedrock it emits provider-native cache markers.
|
||||
|
||||
### Granular policy
|
||||
|
||||
```ts
|
||||
cache: {
|
||||
tools?: boolean,
|
||||
system?: boolean,
|
||||
messages?: "latest-user-message" | "latest-assistant" | { tail: number },
|
||||
ttlSeconds?: number, // ≥ 3600 → 1h on Anthropic/Bedrock; else 5m
|
||||
}
|
||||
```
|
||||
|
||||
### Manual hints
|
||||
|
||||
Inline `CacheHint` on any text / system / tool / tool-result part overrides automatic placement. The auto policy preserves manual hints; it only fills gaps.
|
||||
|
||||
```ts
|
||||
LLM.request({
|
||||
model,
|
||||
system: [
|
||||
{ type: "text", text: "stable system prompt", cache: { type: "ephemeral" } },
|
||||
],
|
||||
...
|
||||
})
|
||||
```
|
||||
|
||||
### Provider behavior table
|
||||
|
||||
| Protocol | `cache: "auto"` |
|
||||
|---|---|
|
||||
| Anthropic Messages | emits up to 3 `cache_control` markers (4-breakpoint cap enforced) |
|
||||
| Bedrock Converse | emits up to 3 `cachePoint` blocks (4-breakpoint cap enforced) |
|
||||
| OpenAI Chat / Responses | no-op (implicit caching above 1024 tokens) |
|
||||
| Gemini | no-op (implicit caching on 2.5+; explicit `CachedContent` is out-of-band) |
|
||||
|
||||
Normalized cache usage is read back into `response.usage.cacheReadInputTokens` and `cacheWriteInputTokens` across every provider.
|
||||
|
||||
## Providers
|
||||
|
||||
Each provider exports a `model(...)` helper that records identity, protocol, capabilities, auth, and defaults.
|
||||
|
||||
```ts
|
||||
import { Anthropic } from "@opencode-ai/llm/providers"
|
||||
|
||||
const model = Anthropic.model("claude-sonnet-4-6", {
|
||||
apiKey: process.env.ANTHROPIC_API_KEY,
|
||||
})
|
||||
```
|
||||
|
||||
Included providers: OpenAI, Anthropic, Google (Gemini), Amazon Bedrock, Azure OpenAI, Cloudflare, GitHub Copilot, OpenRouter, xAI, plus generic OpenAI-compatible helpers for DeepSeek, Cerebras, Groq, Fireworks, Together, etc.
|
||||
|
||||
## Provider options & HTTP overlays
|
||||
|
||||
Three escape hatches in order of stability:
|
||||
|
||||
1. **`generation`** — portable knobs (`maxTokens`, `temperature`, `topP`, `topK`, penalties, seed, stop).
|
||||
2. **`providerOptions: { <provider>: {...} }`** — typed-at-the-facade provider-specific knobs (OpenAI `promptCacheKey`, Anthropic `thinking`, Gemini `thinkingConfig`, OpenRouter routing).
|
||||
3. **`http: { body, headers, query }`** — last-resort serializable overlays merged into the final HTTP request. Reach for this only when a stable typed path doesn't yet exist.
|
||||
|
||||
Model-level defaults are overridden by request-level values for each axis.
|
||||
|
||||
## Routes
|
||||
|
||||
Adding a new model or deployment is usually 5–15 lines using `Route.make({ protocol, transport, ... })`. The four orthogonal pieces are protocol (body construction + stream parsing), transport (endpoint + auth + framing + encoding), defaults, and capabilities. See `AGENTS.md` for the architectural detail.
|
||||
|
||||
## Effect
|
||||
|
||||
This package is built on Effect. Public methods return `Effect` or `Stream`; provide `LLMClient.layer` (the default registers every shipped route) for runtime dispatch. The example at `example/tutorial.ts` is a runnable walkthrough.
|
||||
|
||||
## See also
|
||||
|
||||
- `AGENTS.md` — architecture, route construction, contributor guide
|
||||
- `example/tutorial.ts` — runnable end-to-end walkthrough
|
||||
- `test/provider/*.test.ts` — fixture-first protocol tests; `*.recorded.test.ts` files cover live cassettes
|
||||
120
packages/llm/src/cache-policy.ts
Normal file
120
packages/llm/src/cache-policy.ts
Normal file
@@ -0,0 +1,120 @@
|
||||
// Apply an `LLMRequest.cache` policy by injecting `CacheHint`s onto the parts
|
||||
// the policy designates. Runs once at compile time, before the per-protocol
|
||||
// body builder, so the existing inline-hint lowering path handles the rest.
|
||||
//
|
||||
// The default `"auto"` shape places one breakpoint at the last tool definition,
|
||||
// one at the last system part, and one at the latest user message. This
|
||||
// matches what production agent harnesses (LangChain's caching middleware,
|
||||
// kern-ai's 10x cost-reduction playbook) converge on for tool-use loops: the
|
||||
// latest user message stays put while a single turn explodes into many
|
||||
// assistant/tool round-trips, so caching at that boundary lets every
|
||||
// intra-turn API call hit the prefix.
|
||||
//
|
||||
// Manual `cache: CacheHint` placements on individual parts are preserved —
|
||||
// this function only fills gaps the caller left empty.
|
||||
import { CacheHint, type CachePolicy, type CachePolicyObject } from "./schema/options"
|
||||
import { LLMRequest, Message, ToolDefinition, type ContentPart } from "./schema/messages"
|
||||
|
||||
const AUTO: CachePolicyObject = {
|
||||
tools: true,
|
||||
system: true,
|
||||
messages: "latest-user-message",
|
||||
}
|
||||
|
||||
const NONE: CachePolicyObject = {}
|
||||
|
||||
// Resolution rules:
|
||||
// - undefined → "none" (opt-in default so the policy never changes wire
|
||||
// shape for existing callers; downstream code can flip to
|
||||
// `cache: "auto"` once they audit the placement choices).
|
||||
// - "auto" → the recommended policy: tools + system + latest user msg.
|
||||
// - "none" → no auto placement; manual `CacheHint`s still flow.
|
||||
// - object form → exactly what the caller asked for.
|
||||
const resolve = (policy: CachePolicy | undefined): CachePolicyObject => {
|
||||
if (policy === undefined || policy === "none") return NONE
|
||||
if (policy === "auto") return AUTO
|
||||
return policy
|
||||
}
|
||||
|
||||
// Protocols whose wire format ignores inline cache markers (OpenAI's implicit
|
||||
// prefix caching, Gemini's implicit + out-of-band CachedContent). Skip the
|
||||
// whole policy pass for these — emitting hints would be harmless but pointless.
|
||||
const RESPECTS_INLINE_HINTS = new Set(["anthropic-messages", "bedrock-converse"])
|
||||
|
||||
const makeHint = (ttlSeconds: number | undefined): CacheHint =>
|
||||
ttlSeconds !== undefined ? new CacheHint({ type: "ephemeral", ttlSeconds }) : new CacheHint({ type: "ephemeral" })
|
||||
|
||||
const markLastTool = (
|
||||
tools: ReadonlyArray<ToolDefinition>,
|
||||
hint: CacheHint,
|
||||
): ReadonlyArray<ToolDefinition> => {
|
||||
if (tools.length === 0) return tools
|
||||
const last = tools.length - 1
|
||||
if (tools[last]!.cache) return tools
|
||||
return tools.map((tool, i) => (i === last ? new ToolDefinition({ ...tool, cache: hint }) : tool))
|
||||
}
|
||||
|
||||
const markLastSystem = (system: LLMRequest["system"], hint: CacheHint): LLMRequest["system"] => {
|
||||
if (system.length === 0) return system
|
||||
const last = system.length - 1
|
||||
if (system[last]!.cache) return system
|
||||
return system.map((part, i) => (i === last ? { ...part, cache: hint } : part))
|
||||
}
|
||||
|
||||
const lastIndexOfRole = (messages: ReadonlyArray<Message>, role: Message["role"]): number =>
|
||||
messages.findLastIndex((m) => m.role === role)
|
||||
|
||||
// Mark the last text part of `messages[index]`. If no text part exists, mark
|
||||
// the last content part regardless of type — that's the breakpoint position
|
||||
// in tool-result-only messages too.
|
||||
const markMessageAt = (
|
||||
messages: ReadonlyArray<Message>,
|
||||
index: number,
|
||||
hint: CacheHint,
|
||||
): ReadonlyArray<Message> => {
|
||||
if (index < 0 || index >= messages.length) return messages
|
||||
const target = messages[index]!
|
||||
if (target.content.length === 0) return messages
|
||||
const lastTextIndex = target.content.findLastIndex((part) => part.type === "text")
|
||||
const markAt = lastTextIndex >= 0 ? lastTextIndex : target.content.length - 1
|
||||
const existing = target.content[markAt]!
|
||||
if ("cache" in existing && existing.cache) return messages
|
||||
const nextContent = target.content.map((part, i) =>
|
||||
i === markAt ? ({ ...part, cache: hint } as ContentPart) : part,
|
||||
)
|
||||
const next = new Message({ ...target, content: nextContent })
|
||||
// Single pass over `messages`, substituting the one updated entry. Long
|
||||
// conversations call this on every request, so avoid `.map()` here — its
|
||||
// closure dispatch and identity copies show up in profiling.
|
||||
const result = messages.slice()
|
||||
result[index] = next
|
||||
return result
|
||||
}
|
||||
|
||||
const markMessages = (
|
||||
messages: ReadonlyArray<Message>,
|
||||
strategy: NonNullable<CachePolicyObject["messages"]>,
|
||||
hint: CacheHint,
|
||||
): ReadonlyArray<Message> => {
|
||||
if (messages.length === 0) return messages
|
||||
if (strategy === "latest-user-message") return markMessageAt(messages, lastIndexOfRole(messages, "user"), hint)
|
||||
if (strategy === "latest-assistant") return markMessageAt(messages, lastIndexOfRole(messages, "assistant"), hint)
|
||||
const start = Math.max(0, messages.length - strategy.tail)
|
||||
let next = messages
|
||||
for (let i = start; i < messages.length; i++) next = markMessageAt(next, i, hint)
|
||||
return next
|
||||
}
|
||||
|
||||
export const applyCachePolicy = (request: LLMRequest): LLMRequest => {
|
||||
if (!RESPECTS_INLINE_HINTS.has(request.model.route)) return request
|
||||
const policy = resolve(request.cache)
|
||||
if (!policy.tools && !policy.system && !policy.messages) return request
|
||||
|
||||
const hint = makeHint(policy.ttlSeconds)
|
||||
const tools = policy.tools ? markLastTool(request.tools, hint) : request.tools
|
||||
const system = policy.system ? markLastSystem(request.system, hint) : request.system
|
||||
const messages = policy.messages ? markMessages(request.messages, policy.messages, hint) : request.messages
|
||||
|
||||
if (tools === request.tools && system === request.system && messages === request.messages) return request
|
||||
return LLMRequest.update(request, { tools, system, messages })
|
||||
}
|
||||
@@ -8,6 +8,7 @@ import type { Transport, TransportRuntime } from "./transport"
|
||||
import { WebSocketExecutor } from "./transport"
|
||||
import type { Service as WebSocketExecutorService } from "./transport/websocket"
|
||||
import type { Protocol } from "./protocol"
|
||||
import { applyCachePolicy } from "../cache-policy"
|
||||
import * as ProviderShared from "../protocols/shared"
|
||||
import * as ToolRuntime from "../tool-runtime"
|
||||
import type { Tools } from "../tool"
|
||||
@@ -400,7 +401,7 @@ export function make<Body, Prepared, Frame, Event, State>(
|
||||
// validated provider body plus transport-private prepared data, but does not
|
||||
// execute transport.
|
||||
const compile = Effect.fn("LLM.compile")(function* (request: LLMRequest) {
|
||||
const resolved = resolveRequestOptions(request)
|
||||
const resolved = applyCachePolicy(resolveRequestOptions(request))
|
||||
const route = registeredRoute(resolved.model.route)
|
||||
if (!route) return yield* noRoute(resolved.model)
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import { Schema } from "effect"
|
||||
import { JsonSchema, MessageRole, ProviderMetadata } from "./ids"
|
||||
import { CacheHint, GenerationOptions, HttpOptions, ModelRef, ProviderOptions } from "./options"
|
||||
import { CacheHint, CachePolicy, GenerationOptions, HttpOptions, ModelRef, ProviderOptions } from "./options"
|
||||
|
||||
const isRecord = (value: unknown): value is Record<string, unknown> =>
|
||||
typeof value === "object" && value !== null && !Array.isArray(value)
|
||||
@@ -206,6 +206,7 @@ export class LLMRequest extends Schema.Class<LLMRequest>("LLM.Request")({
|
||||
providerOptions: Schema.optional(ProviderOptions),
|
||||
http: Schema.optional(HttpOptions),
|
||||
responseFormat: Schema.optional(ResponseFormat),
|
||||
cache: Schema.optional(CachePolicy),
|
||||
metadata: Schema.optional(Schema.Record(Schema.String, Schema.Unknown)),
|
||||
}) {}
|
||||
|
||||
@@ -223,6 +224,7 @@ export namespace LLMRequest {
|
||||
providerOptions: request.providerOptions,
|
||||
http: request.http,
|
||||
responseFormat: request.responseFormat,
|
||||
cache: request.cache,
|
||||
metadata: request.metadata,
|
||||
})
|
||||
|
||||
|
||||
@@ -200,3 +200,35 @@ export class CacheHint extends Schema.Class<CacheHint>("LLM.CacheHint")({
|
||||
type: Schema.Literals(["ephemeral", "persistent"]),
|
||||
ttlSeconds: Schema.optional(Schema.Number),
|
||||
}) {}
|
||||
|
||||
// Auto-placement policy for prompt caching. The protocol-neutral lowering step
|
||||
// reads this and injects `CacheHint`s at the configured boundaries; the
|
||||
// per-protocol body builders then translate those hints into wire markers as
|
||||
// usual. `"auto"` is the recommended default for agent loops — it places one
|
||||
// breakpoint at the last tool definition, one at the last system part, and one
|
||||
// at the latest user message. The combination of provider invalidation
|
||||
// hierarchy (tools → system → messages) and Anthropic/Bedrock's 20-block
|
||||
// lookback means three trailing breakpoints reliably cover the static prefix.
|
||||
//
|
||||
// Pass `"none"` to opt out entirely (the legacy behavior). Pass the granular
|
||||
// object form to override individual choices.
|
||||
export const CachePolicyObject = Schema.Struct({
|
||||
tools: Schema.optional(Schema.Boolean),
|
||||
system: Schema.optional(Schema.Boolean),
|
||||
messages: Schema.optional(
|
||||
Schema.Union([
|
||||
Schema.Literal("latest-user-message"),
|
||||
Schema.Literal("latest-assistant"),
|
||||
Schema.Struct({ tail: Schema.Number }),
|
||||
]),
|
||||
),
|
||||
ttlSeconds: Schema.optional(Schema.Number),
|
||||
})
|
||||
export type CachePolicyObject = Schema.Schema.Type<typeof CachePolicyObject>
|
||||
|
||||
export const CachePolicy = Schema.Union([
|
||||
Schema.Literal("auto"),
|
||||
Schema.Literal("none"),
|
||||
CachePolicyObject,
|
||||
])
|
||||
export type CachePolicy = Schema.Schema.Type<typeof CachePolicy>
|
||||
|
||||
262
packages/llm/test/cache-policy.test.ts
Normal file
262
packages/llm/test/cache-policy.test.ts
Normal file
@@ -0,0 +1,262 @@
|
||||
import { describe, expect, test } from "bun:test"
|
||||
import { Effect } from "effect"
|
||||
import { CacheHint, LLM } from "../src"
|
||||
import { LLMClient } from "../src/route"
|
||||
import * as AnthropicMessages from "../src/protocols/anthropic-messages"
|
||||
import * as BedrockConverse from "../src/protocols/bedrock-converse"
|
||||
import * as Gemini from "../src/protocols/gemini"
|
||||
import * as OpenAIChat from "../src/protocols/openai-chat"
|
||||
import { applyCachePolicy } from "../src/cache-policy"
|
||||
import { it } from "./lib/effect"
|
||||
|
||||
const anthropicModel = AnthropicMessages.model({
|
||||
id: "claude-sonnet-4-5",
|
||||
baseURL: "https://api.anthropic.test/v1/",
|
||||
headers: { "x-api-key": "test" },
|
||||
})
|
||||
|
||||
const bedrockModel = BedrockConverse.model({
|
||||
id: "anthropic.claude-3-5-sonnet-20241022-v2:0",
|
||||
credentials: { region: "us-east-1", accessKeyId: "fixture", secretAccessKey: "fixture" },
|
||||
})
|
||||
|
||||
const openaiModel = OpenAIChat.model({
|
||||
id: "gpt-4o-mini",
|
||||
baseURL: "https://api.openai.test/v1/",
|
||||
headers: { authorization: "Bearer test" },
|
||||
})
|
||||
|
||||
const geminiModel = Gemini.model({
|
||||
id: "gemini-2.5-flash",
|
||||
baseURL: "https://generativelanguage.test/v1beta/",
|
||||
headers: { "x-goog-api-key": "test" },
|
||||
})
|
||||
|
||||
describe("applyCachePolicy", () => {
|
||||
it.effect("undefined cache leaves the request untouched (opt-in default)", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
system: "You are concise.",
|
||||
prompt: "hi",
|
||||
}),
|
||||
)
|
||||
|
||||
expect(prepared.body).toMatchObject({
|
||||
system: [{ type: "text", text: "You are concise.", cache_control: undefined }],
|
||||
})
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("'auto' marks the last tool, last system part, and latest user message on Anthropic", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
system: "Sys A",
|
||||
tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
|
||||
messages: [
|
||||
LLM.user("first user"),
|
||||
LLM.assistant("assistant reply"),
|
||||
LLM.user("latest user message"),
|
||||
],
|
||||
cache: "auto",
|
||||
}),
|
||||
)
|
||||
|
||||
expect(prepared.body).toMatchObject({
|
||||
tools: [{ name: "t1", cache_control: { type: "ephemeral" } }],
|
||||
system: [{ type: "text", text: "Sys A", cache_control: { type: "ephemeral" } }],
|
||||
messages: [
|
||||
{ role: "user", content: [{ type: "text", text: "first user" }] },
|
||||
{ role: "assistant", content: [{ type: "text", text: "assistant reply" }] },
|
||||
{
|
||||
role: "user",
|
||||
content: [{ type: "text", text: "latest user message", cache_control: { type: "ephemeral" } }],
|
||||
},
|
||||
],
|
||||
})
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("'auto' is a no-op on OpenAI (implicit caching protocol)", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: openaiModel,
|
||||
system: "Sys",
|
||||
prompt: "hi",
|
||||
cache: "auto",
|
||||
}),
|
||||
)
|
||||
|
||||
const body = prepared.body as { messages: Array<{ content: unknown }> }
|
||||
// OpenAI doesn't accept cache_control on messages — policy must skip.
|
||||
const flat = JSON.stringify(body)
|
||||
expect(flat).not.toContain("cache_control")
|
||||
expect(flat).not.toContain("cachePoint")
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("'auto' is a no-op on Gemini (out-of-band caching protocol)", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: geminiModel,
|
||||
system: "Sys",
|
||||
prompt: "hi",
|
||||
cache: "auto",
|
||||
}),
|
||||
)
|
||||
|
||||
const flat = JSON.stringify(prepared.body)
|
||||
expect(flat).not.toContain("cache_control")
|
||||
expect(flat).not.toContain("cachePoint")
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("'auto' on Bedrock emits cachePoint markers in the right places", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: bedrockModel,
|
||||
system: "Sys",
|
||||
tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
|
||||
messages: [LLM.user("first user"), LLM.assistant("reply"), LLM.user("latest user")],
|
||||
cache: "auto",
|
||||
}),
|
||||
)
|
||||
|
||||
expect(prepared.body).toMatchObject({
|
||||
toolConfig: {
|
||||
tools: [{ toolSpec: { name: "t1" } }, { cachePoint: { type: "default" } }],
|
||||
},
|
||||
system: [{ text: "Sys" }, { cachePoint: { type: "default" } }],
|
||||
messages: [
|
||||
{ role: "user", content: [{ text: "first user" }] },
|
||||
{ role: "assistant", content: [{ text: "reply" }] },
|
||||
{ role: "user", content: [{ text: "latest user" }, { cachePoint: { type: "default" } }] },
|
||||
],
|
||||
})
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("'none' disables auto placement even when manual hints exist", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
system: "Sys",
|
||||
tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
|
||||
prompt: "hi",
|
||||
cache: "none",
|
||||
}),
|
||||
)
|
||||
|
||||
expect(prepared.body).toMatchObject({
|
||||
tools: [{ name: "t1", cache_control: undefined }],
|
||||
system: [{ type: "text", text: "Sys", cache_control: undefined }],
|
||||
})
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("granular object form: tools-only marks just tools", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
system: "Sys",
|
||||
tools: [{ name: "t1", description: "t1", inputSchema: { type: "object", properties: {} } }],
|
||||
prompt: "hi",
|
||||
cache: { tools: true },
|
||||
}),
|
||||
)
|
||||
|
||||
expect(prepared.body).toMatchObject({
|
||||
tools: [{ name: "t1", cache_control: { type: "ephemeral" } }],
|
||||
system: [{ type: "text", text: "Sys", cache_control: undefined }],
|
||||
})
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("auto policy preserves manual CacheHints on other parts", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
system: [
|
||||
{ type: "text", text: "first system", cache: new CacheHint({ type: "ephemeral", ttlSeconds: 3600 }) },
|
||||
{ type: "text", text: "last system" },
|
||||
],
|
||||
prompt: "hi",
|
||||
cache: "auto",
|
||||
}),
|
||||
)
|
||||
|
||||
const body = prepared.body as { system: Array<{ text: string; cache_control?: unknown }> }
|
||||
expect(body.system[0]?.cache_control).toEqual({ type: "ephemeral", ttl: "1h" })
|
||||
expect(body.system[1]?.cache_control).toEqual({ type: "ephemeral" })
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("ttlSeconds in the policy flows through to wire markers", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
system: "Sys",
|
||||
prompt: "hi",
|
||||
cache: { system: true, ttlSeconds: 3600 },
|
||||
}),
|
||||
)
|
||||
|
||||
expect(prepared.body).toMatchObject({
|
||||
system: [{ type: "text", text: "Sys", cache_control: { type: "ephemeral", ttl: "1h" } }],
|
||||
})
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("messages: { tail: 2 } marks the last 2 message boundaries", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
messages: [LLM.user("u1"), LLM.assistant("a1"), LLM.user("u2"), LLM.assistant("a2")],
|
||||
cache: { messages: { tail: 2 } },
|
||||
}),
|
||||
)
|
||||
|
||||
const body = prepared.body as { messages: Array<{ content: Array<{ cache_control?: unknown }> }> }
|
||||
expect(body.messages[0]?.content[0]?.cache_control).toBeUndefined()
|
||||
expect(body.messages[1]?.content[0]?.cache_control).toBeUndefined()
|
||||
expect(body.messages[2]?.content[0]?.cache_control).toEqual({ type: "ephemeral" })
|
||||
expect(body.messages[3]?.content[0]?.cache_control).toEqual({ type: "ephemeral" })
|
||||
}),
|
||||
)
|
||||
|
||||
it.effect("'latest-assistant' marks the last assistant message", () =>
|
||||
Effect.gen(function* () {
|
||||
const prepared = yield* LLMClient.prepare(
|
||||
LLM.request({
|
||||
model: anthropicModel,
|
||||
messages: [LLM.user("u1"), LLM.assistant("a1"), LLM.user("u2")],
|
||||
cache: { messages: "latest-assistant" },
|
||||
}),
|
||||
)
|
||||
|
||||
const body = prepared.body as { messages: Array<{ content: Array<{ cache_control?: unknown }> }> }
|
||||
expect(body.messages[0]?.content[0]?.cache_control).toBeUndefined()
|
||||
expect(body.messages[1]?.content[0]?.cache_control).toEqual({ type: "ephemeral" })
|
||||
expect(body.messages[2]?.content[0]?.cache_control).toBeUndefined()
|
||||
}),
|
||||
)
|
||||
|
||||
test("returns the same request reference when policy is a no-op (pure function)", () => {
|
||||
const request = LLM.request({
|
||||
model: anthropicModel,
|
||||
prompt: "hi",
|
||||
})
|
||||
expect(applyCachePolicy(request)).toBe(request)
|
||||
})
|
||||
})
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -28,7 +28,12 @@ const recorded = recordedTests({
|
||||
provider: "anthropic",
|
||||
protocol: "anthropic-messages",
|
||||
requires: ["ANTHROPIC_API_KEY"],
|
||||
options: { redactor: Redactor.defaults({ requestHeaders: { allow: ["content-type", "anthropic-version"] } }) },
|
||||
// Two identical requests in one cassette — match by recording order so the
|
||||
// second call replays the cached-hit interaction.
|
||||
options: {
|
||||
dispatch: "sequential",
|
||||
redactor: Redactor.defaults({ requestHeaders: { allow: ["content-type", "anthropic-version"] } }),
|
||||
},
|
||||
})
|
||||
|
||||
describe("Anthropic Messages cache recorded", () => {
|
||||
|
||||
@@ -35,6 +35,9 @@ const recorded = recordedTests({
|
||||
provider: "amazon-bedrock",
|
||||
protocol: "bedrock-converse",
|
||||
requires: ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"],
|
||||
// Two identical requests in one cassette — match by recording order so the
|
||||
// second call replays the cached-hit interaction.
|
||||
options: { dispatch: "sequential" },
|
||||
})
|
||||
|
||||
describe("Bedrock Converse cache recorded", () => {
|
||||
|
||||
@@ -8,7 +8,7 @@ import { recordedTests } from "../recorded-test"
|
||||
|
||||
const model = Gemini.model({
|
||||
id: "gemini-2.5-flash",
|
||||
apiKey: process.env.GEMINI_API_KEY ?? "fixture",
|
||||
apiKey: process.env.GOOGLE_GENERATIVE_AI_API_KEY ?? process.env.GEMINI_API_KEY ?? "fixture",
|
||||
})
|
||||
|
||||
// Gemini does implicit prefix caching on 2.5+ models above ~1024 tokens. The
|
||||
@@ -28,7 +28,10 @@ const recorded = recordedTests({
|
||||
prefix: "gemini-cache",
|
||||
provider: "google",
|
||||
protocol: "gemini",
|
||||
requires: ["GEMINI_API_KEY"],
|
||||
requires: ["GOOGLE_GENERATIVE_AI_API_KEY"],
|
||||
// Two identical requests in one cassette — match by recording order so the
|
||||
// second call replays the cached-hit interaction.
|
||||
options: { dispatch: "sequential" },
|
||||
})
|
||||
|
||||
describe("Gemini cache recorded", () => {
|
||||
|
||||
@@ -29,6 +29,9 @@ const recorded = recordedTests({
|
||||
provider: "openai",
|
||||
protocol: "openai-responses",
|
||||
requires: ["OPENAI_API_KEY"],
|
||||
// Two identical requests in one cassette — match by recording order so the
|
||||
// second call replays the cached-hit interaction, not the cold-miss one.
|
||||
options: { dispatch: "sequential" },
|
||||
})
|
||||
|
||||
describe("OpenAI Responses cache recorded", () => {
|
||||
|
||||
Reference in New Issue
Block a user