Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/browserbase/stagehand/llms.txt

Use this file to discover all available pages before exploring further.

Stagehand uses advanced caching strategies to reduce latency and token costs. This includes prompt caching for repeated content and conversation history compression for long-running agents.

Overview

Caching strategies in Stagehand:
  • Prompt caching - Cache system prompts and static content
  • Image compression - Reduce token usage in conversation history
  • Conversation management - Maintain context while minimizing tokens
  • Provider-specific optimizations - Leverage native caching features

Prompt Caching

Anthropic Prompt Caching

Anthropic supports caching with cache control blocks. Stagehand automatically uses this for system prompts and accessibility trees. How it works:
// System prompt with caching
const messages = [
  {
    role: "system",
    content: [
      {
        type: "text",
        text: systemPrompt,
        cache_control: { type: "ephemeral" }, // Cache this content
      },
    ],
  },
  // User messages...
];
Benefits:
  • System prompts are cached across requests
  • Reduces input token costs by ~90% for cached content
  • Cache persists for 5 minutes of inactivity
  • Particularly effective for accessibility trees
Accessibility Tree Caching: Location: Various act/extract implementations
const ariaTree = await page.getAriaTree();

const messages = [
  {
    role: "user",
    content: [
      {
        type: "text",
        text: `Accessibility tree:\n${ariaTree}`,
        cache_control: { type: "ephemeral" }, // Cache the tree
      },
      {
        type: "text",
        text: instruction,
      },
    ],
  },
];
Token Savings Example:
// First request: 5000 input tokens
// Subsequent requests with cache: 500 input tokens (90% reduction)
// Cache hit charges: ~10% of uncached cost

OpenAI Prompt Caching

OpenAI does not currently support explicit prompt caching, but Stagehand optimizes requests by:
  • Reusing system prompts across calls
  • Minimizing message history
  • Structuring requests for potential future caching support

Google Prompt Caching

Google’s caching is handled automatically by the model. Stagehand optimizes by:
  • Structuring system instructions consistently
  • Reusing conversation history format
  • Minimizing changes to cached content

Image Compression

Anthropic Image Compression

Location: packages/core/lib/v3/agent/utils/imageCompression.ts Strategy:
  • Keep first 2 images in conversation at full quality
  • Compress all subsequent images to 25% quality
  • Reduces token usage while maintaining context
Implementation:
export function compressConversationImages(
  items: ResponseInputItem[],
  keepFirstN = 2,
): void {
  let imageCount = 0;

  for (const item of items) {
    if ("role" in item && item.role === "user") {
      const content = item.content;
      if (Array.isArray(content)) {
        for (const block of content) {
          if (block.type === "image") {
            imageCount++;
            if (imageCount > keepFirstN) {
              // Compress this image
              const base64Data = block.source.data;
              const buffer = Buffer.from(base64Data, "base64");
              const compressed = await sharp(buffer)
                .jpeg({ quality: 25 })
                .toBuffer();
              block.source.data = compressed.toString("base64");
            }
          }
        }
      }
    }
  }
}
Usage in CUA:
// In AnthropicCUAClient.ts
const nextInputItems: ResponseInputItem[] = [...inputItems];

// Compress images before adding new message
compressConversationImages(nextInputItems);

nextInputItems.push(assistantMessage);
nextInputItems.push(userToolResultsMessage);
Token Savings:
// Full quality image: ~1500 tokens
// 25% quality image: ~400 tokens
// Savings: ~73% per compressed image

Google Image Compression

Location: packages/core/lib/v3/agent/utils/imageCompression.ts Implementation:
export function compressGoogleConversationImages(
  items: Content[],
  keepFirstN = 2,
): { items: Content[]; compressed: boolean } {
  let imageCount = 0;
  let compressed = false;

  for (const item of items) {
    if (item.role === "user" && item.parts) {
      for (const part of item.parts) {
        if (part.inlineData?.mimeType === "image/png") {
          imageCount++;
          if (imageCount > keepFirstN) {
            // Compress to JPEG 25%
            const buffer = Buffer.from(part.inlineData.data, "base64");
            const compressedBuffer = await sharp(buffer)
              .jpeg({ quality: 25 })
              .toBuffer();
            part.inlineData.data = compressedBuffer.toString("base64");
            part.inlineData.mimeType = "image/jpeg";
            compressed = true;
          }
        }
      }
    }
  }

  return { items, compressed };
}
Usage:
// In GoogleCUAClient.ts:executeStep()
const compressedResult = compressGoogleConversationImages(
  this.history,
  2, // Keep first 2 images
);
const compressedHistory = compressedResult.items;

const response = await this.client.models.generateContent({
  model: this.modelName,
  contents: compressedHistory,
  config: this.generateContentConfig,
});

Conversation History Management

CUA Conversation History

All CUA clients maintain conversation history to preserve context: Anthropic Pattern:
private async executeStep(
  inputItems: ResponseInputItem[],
  logger: (message: LogLine) => void,
): Promise<{ /* ... */ }> {
  // Get model response
  const result = await this.getAction(inputItems);
  
  // Build next input items
  const nextInputItems: ResponseInputItem[] = [...inputItems];
  
  // Compress images
  compressConversationImages(nextInputItems);
  
  // Add assistant message
  nextInputItems.push(assistantMessage);
  
  // Add tool results
  if (toolResults.length > 0) {
    nextInputItems.push(userToolResultsMessage);
  }
  
  return { nextInputItems, /* ... */ };
}
Google Pattern:
private history: Content[] = [];

async executeStep(logger: (message: LogLine) => void) {
  // Compress history before request
  const compressedResult = compressGoogleConversationImages(this.history, 2);
  const compressedHistory = compressedResult.items;
  
  // Get response
  const response = await this.client.models.generateContent({
    contents: compressedHistory,
    // ...
  });
  
  // Add to history
  this.history.push(sanitizedContent);
  
  if (functionResponses.length > 0) {
    this.history.push({
      role: "user",
      parts: functionResponses,
    });
  }
}
OpenAI Pattern:
private reasoningItems: Map<string, ResponseItem> = new Map();

async executeStep(
  inputItems: ResponseInputItem[],
  previousResponseId: string | undefined,
) {
  // Use previous_response_id for history
  const requestParams = {
    model: this.modelName,
    input: inputItems,
    previous_response_id: previousResponseId,
  };
  
  const response = await this.client.responses.create(requestParams);
  
  // Track reasoning items
  for (const item of response.output) {
    if (item.type === "reasoning") {
      this.reasoningItems.set(item.id, item);
    }
  }
  
  return { responseId: response.id };
}

History Truncation Strategies

Keep recent messages:
function truncateHistory(
  history: ResponseInputItem[],
  maxMessages = 10,
): ResponseInputItem[] {
  // Always keep system message
  const systemMessages = history.filter((m) => m.role === "system");
  const otherMessages = history.filter((m) => m.role !== "system");
  
  // Keep last N messages
  const recentMessages = otherMessages.slice(-maxMessages);
  
  return [...systemMessages, ...recentMessages];
}
Token-based truncation:
function truncateByTokens(
  history: ResponseInputItem[],
  maxTokens = 100000,
): ResponseInputItem[] {
  const systemMessages = history.filter((m) => m.role === "system");
  const otherMessages = history.filter((m) => m.role !== "system").reverse();
  
  let tokenCount = estimateTokens(systemMessages);
  const keptMessages: ResponseInputItem[] = [];
  
  for (const message of otherMessages) {
    const messageTokens = estimateTokens([message]);
    if (tokenCount + messageTokens > maxTokens) break;
    
    keptMessages.unshift(message);
    tokenCount += messageTokens;
  }
  
  return [...systemMessages, ...keptMessages];
}

Provider-Specific Optimizations

Anthropic Cache Control

// Mark content for caching
const messages = [
  {
    role: "system",
    content: [
      {
        type: "text",
        text: longSystemPrompt,
        cache_control: { type: "ephemeral" },
      },
    ],
  },
];

// First request: Full token count
// Subsequent requests: Cache hit (10% cost)

Google Content Reuse

// Structure content consistently for better caching
this.generateContentConfig = {
  temperature: 1,
  topP: 0.95,
  topK: 40,
  maxOutputTokens: 8192,
  tools: [{
    computerUse: { environment: this.environment },
  }],
};

// Reuse config across requests
const response = await this.client.models.generateContent({
  model: this.modelName,
  contents: compressedHistory,
  config: this.generateContentConfig, // Consistent config
});

OpenAI Response Chaining

// Use previous_response_id to chain requests
let previousResponseId: string | undefined;

for (let step = 0; step < maxSteps; step++) {
  const response = await this.client.responses.create({
    model: this.modelName,
    input: inputItems,
    previous_response_id: previousResponseId, // Link to previous
  });
  
  previousResponseId = response.id;
}

Performance Monitoring

Track Token Usage

let totalInputTokens = 0;
let totalOutputTokens = 0;
let totalCachedTokens = 0;

while (!completed && currentStep < maxSteps) {
  const result = await this.executeStep(inputItems, logger);
  
  totalInputTokens += result.usage.input_tokens;
  totalOutputTokens += result.usage.output_tokens;
  
  if (result.usage.cached_input_tokens) {
    totalCachedTokens += result.usage.cached_input_tokens;
  }
  
  currentStep++;
}

console.log("Token usage:", {
  input: totalInputTokens,
  output: totalOutputTokens,
  cached: totalCachedTokens,
  savings: `${((totalCachedTokens / totalInputTokens) * 100).toFixed(1)}%`,
});

Log Compression Results

const before = estimateSize(inputItems);
compressConversationImages(inputItems);
const after = estimateSize(inputItems);

logger({
  category: "caching",
  message: `Compressed images: ${before}KB → ${after}KB (${((1 - after / before) * 100).toFixed(1)}% reduction)`,
  level: 2,
});

Best Practices

  1. Use prompt caching: Mark static content with cache_control
  2. Compress images: Keep first 2 at full quality, compress rest
  3. Truncate history: Don’t let conversation grow unbounded
  4. Monitor token usage: Track input/output/cached tokens
  5. Structure consistently: Consistent structure improves caching
  6. Batch operations: Fewer requests = better cache utilization
  7. Use appropriate models: Faster models for cached content

Cost Optimization

Example savings with caching:
// Without caching:
// 10 requests × 5000 input tokens = 50,000 tokens
// Cost: $0.15 (at $3/1M tokens)

// With prompt caching (4000 tokens cached):
// Request 1: 5000 input tokens = $0.015
// Requests 2-10: 1000 new + 400 cached = 1400 tokens each
// Cost: $0.015 + (9 × $0.0042) = $0.053
// Savings: 65%
With image compression:
// Full quality: 10 images × 1500 tokens = 15,000 tokens
// Compressed: 2 full + 8 compressed (400 tokens) = 6,200 tokens
// Savings: 59%

References

  • Image Compression: packages/core/lib/v3/agent/utils/imageCompression.ts
  • Anthropic CUA: packages/core/lib/v3/agent/AnthropicCUAClient.ts:351
  • Google CUA: packages/core/lib/v3/agent/GoogleCUAClient.ts:357
  • OpenAI CUA: packages/core/lib/v3/agent/OpenAICUAClient.ts:420