Skip to content

Adaptive Token Limit Management #3828

@sarojrout

Description

@sarojrout

Is your feature request related to a problem? Please describe.
Yes. There are several related problems:

Yes. The current max_tokens is hardcoded to 8192 (e.g., in Claude model at src/google/adk/models/anthropic_llm.py:298). This causes two critical issues:

  1. Truncated Responses: When responses exceed 8192 tokens, they get cut off mid-sentence, leaving users with incomplete answers
  2. Token Waste: When responses are shorter, we waste allocated tokens and increase costs unnecessarily
  3. No Adaptation: There's no way to dynamically adjust based on response quality, context window availability, or user needs
  4. Manual Workarounds Required: Developers must manually count tokens and switch prompt engineering techniques, which is error-prone and doesn't scale

Real-World Impact

Personal Experience: While building an agentic system that queries databases to provide recommendations, I encountered this issue frequently. The system would retrieve data and generate responses that sometimes exceeded 8192 tokens, causing critical information to be truncated.

The Manual Workaround I Had to Implement:

  • Manually count tokens before sending requests
  • Switch between different prompt engineering techniques based on estimated token count
  • Use shorter, more concise prompts when approaching limits
  • Sometimes split responses across multiple invocations
  • Monitor token usage and adjust prompts reactively

This same issue also occurred when using OpenAI models in other projects - responses would sometimes generate more tokens than expected, requiring similar manual intervention.

The Problem This Creates:

  • Inconsistent Quality: Different prompt styles based on token count lead to inconsistent user experience
  • Development Overhead: Constant monitoring and manual adjustment is time-consuming
  • Production Risk: Truncated responses in production can lead to incomplete or incorrect information being delivered to users
  • Cost Inefficiency: Fixed limits waste tokens when responses are shorter, and fail when they're longer

Example Problem:

agent = Agent(
    name="code_generator",
    model=Claude(model="claude-3-5-sonnet-v2@20241022"),  # max_tokens=8192 fixed
    instruction="Generate complete code solutions"
)
# If code generation needs 10000 tokens, it gets truncated at 8192

Describe the solution you'd like
I propose adding several complementary features:

Add an Adaptive Token Limit Management feature that automatically adjusts max_tokens based on response quality and context.

Proposed API:

class AdaptiveTokenConfig(BaseModel):
    """Adaptive token limit configuration."""
    enabled: bool = False
    min_tokens: int = 512
    max_tokens: int = 8192
    initial_tokens: int = 2048
    increase_factor: float = 1.5  # Multiply if response truncated
    decrease_factor: float = 0.8  # Reduce if response much shorter
    truncation_detection: bool = True  # Auto-detect incomplete responses

Usage:

from google.adk import Agent, App
from google.adk.models.anthropic_llm import Claude
from google.adk.agents.run_config import RunConfig, AdaptiveTokenConfig

agent = Agent(
    name="code_generator",
    model=Claude(model="claude-3-5-sonnet-v2@20241022"),
    instruction="Generate complete code solutions"
)

app = App(
    name="my_app",
    root_agent=agent
)

# Use with adaptive tokens
run_config = RunConfig(
    adaptive_token_config=AdaptiveTokenConfig(
        enabled=True,
        initial_tokens=2048,
        max_tokens=16384  # Allow up to 16k if needed
    )
)

Behavior:

  1. Start with initial_tokens (e.g., 2048) for first invocation
  2. Detect truncation: Check if response ends mid-sentence, has incomplete function calls, or hits MAX_TOKENS finish reason
  3. Auto-increase: If truncated, multiply by increase_factor (e.g., 2048 → 3072 → 4608 → ...)
  4. Auto-decrease: If response uses < 50% of allocated tokens, reduce by decrease_factor for next invocation
  5. Respect bounds: Never go below min_tokens or above max_tokens
  6. Session-aware: Track across invocations in the same session

Describe alternatives you've considered

  1. Manual Adjustment: Users could manually set max_tokens per invocation, but:

    • Requires deep knowledge of token usage patterns
    • Doesn't adapt to changing needs
    • Adds cognitive overhead
  2. Fixed Higher Limits: Simply increasing default to 16384:

    • Wastes tokens when not needed
    • Doesn't solve truncation for very long responses
    • Increases costs unnecessarily
  3. External Tracking: Building token management outside ADK:

    • Loses integration with execution flow
    • Harder to enforce at the right points
    • Duplicates functionality

Additional context

Use Cases:

  • Long-form content generation: Articles, reports that may exceed 8192 tokens
  • Code generation: Complete functions/systems that need full responses
  • Multi-step reasoning: Explanations that require full context
  • Cost optimization: Reduce token waste in production deployments

Implementation Notes:

  • Can be implemented as an extension to RunConfig
  • Backward compatible (opt-in via enabled=True)
  • Should integrate with existing ContextCacheConfig for cost savings
  • Can track truncation via FinishReason.MAX_TOKENS in response

Related Code:

  • Current implementation: src/google/adk/models/anthropic_llm.py:298 (max_tokens: int = 8192)
  • RunConfig: src/google/adk/agents/run_config.py

Priority:

High - This directly addresses a critical production issue that forces developers to implement manual workarounds. The problem affects:

  • Production Reliability: Truncated responses can deliver incomplete or incorrect information to end users
  • Developer Productivity: Manual token counting and prompt switching is time-consuming and error-prone
  • Cost Optimization: Fixed limits waste tokens when responses are shorter, and fail when they're longer
  • Scalability: Manual interventions don't scale as systems grow in complexity

This feature would eliminate the need for manual token management and prompt engineering workarounds, allowing developers to focus on building better agentic systems rather than managing token limits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    core[Component] This issue is related to the core interface and implementation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions