-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Is your feature request related to a problem? Please describe.
Yes. There are several related problems:
Yes. The current max_tokens is hardcoded to 8192 (e.g., in Claude model at src/google/adk/models/anthropic_llm.py:298). This causes two critical issues:
- Truncated Responses: When responses exceed 8192 tokens, they get cut off mid-sentence, leaving users with incomplete answers
- Token Waste: When responses are shorter, we waste allocated tokens and increase costs unnecessarily
- No Adaptation: There's no way to dynamically adjust based on response quality, context window availability, or user needs
- Manual Workarounds Required: Developers must manually count tokens and switch prompt engineering techniques, which is error-prone and doesn't scale
Real-World Impact
Personal Experience: While building an agentic system that queries databases to provide recommendations, I encountered this issue frequently. The system would retrieve data and generate responses that sometimes exceeded 8192 tokens, causing critical information to be truncated.
The Manual Workaround I Had to Implement:
- Manually count tokens before sending requests
- Switch between different prompt engineering techniques based on estimated token count
- Use shorter, more concise prompts when approaching limits
- Sometimes split responses across multiple invocations
- Monitor token usage and adjust prompts reactively
This same issue also occurred when using OpenAI models in other projects - responses would sometimes generate more tokens than expected, requiring similar manual intervention.
The Problem This Creates:
- Inconsistent Quality: Different prompt styles based on token count lead to inconsistent user experience
- Development Overhead: Constant monitoring and manual adjustment is time-consuming
- Production Risk: Truncated responses in production can lead to incomplete or incorrect information being delivered to users
- Cost Inefficiency: Fixed limits waste tokens when responses are shorter, and fail when they're longer
Example Problem:
agent = Agent(
name="code_generator",
model=Claude(model="claude-3-5-sonnet-v2@20241022"), # max_tokens=8192 fixed
instruction="Generate complete code solutions"
)
# If code generation needs 10000 tokens, it gets truncated at 8192Describe the solution you'd like
I propose adding several complementary features:
Add an Adaptive Token Limit Management feature that automatically adjusts max_tokens based on response quality and context.
Proposed API:
class AdaptiveTokenConfig(BaseModel):
"""Adaptive token limit configuration."""
enabled: bool = False
min_tokens: int = 512
max_tokens: int = 8192
initial_tokens: int = 2048
increase_factor: float = 1.5 # Multiply if response truncated
decrease_factor: float = 0.8 # Reduce if response much shorter
truncation_detection: bool = True # Auto-detect incomplete responsesUsage:
from google.adk import Agent, App
from google.adk.models.anthropic_llm import Claude
from google.adk.agents.run_config import RunConfig, AdaptiveTokenConfig
agent = Agent(
name="code_generator",
model=Claude(model="claude-3-5-sonnet-v2@20241022"),
instruction="Generate complete code solutions"
)
app = App(
name="my_app",
root_agent=agent
)
# Use with adaptive tokens
run_config = RunConfig(
adaptive_token_config=AdaptiveTokenConfig(
enabled=True,
initial_tokens=2048,
max_tokens=16384 # Allow up to 16k if needed
)
)Behavior:
- Start with
initial_tokens(e.g., 2048) for first invocation - Detect truncation: Check if response ends mid-sentence, has incomplete function calls, or hits MAX_TOKENS finish reason
- Auto-increase: If truncated, multiply by
increase_factor(e.g., 2048 → 3072 → 4608 → ...) - Auto-decrease: If response uses < 50% of allocated tokens, reduce by
decrease_factorfor next invocation - Respect bounds: Never go below
min_tokensor abovemax_tokens - Session-aware: Track across invocations in the same session
Describe alternatives you've considered
-
Manual Adjustment: Users could manually set
max_tokensper invocation, but:- Requires deep knowledge of token usage patterns
- Doesn't adapt to changing needs
- Adds cognitive overhead
-
Fixed Higher Limits: Simply increasing default to 16384:
- Wastes tokens when not needed
- Doesn't solve truncation for very long responses
- Increases costs unnecessarily
-
External Tracking: Building token management outside ADK:
- Loses integration with execution flow
- Harder to enforce at the right points
- Duplicates functionality
Additional context
Use Cases:
- Long-form content generation: Articles, reports that may exceed 8192 tokens
- Code generation: Complete functions/systems that need full responses
- Multi-step reasoning: Explanations that require full context
- Cost optimization: Reduce token waste in production deployments
Implementation Notes:
- Can be implemented as an extension to
RunConfig - Backward compatible (opt-in via
enabled=True) - Should integrate with existing
ContextCacheConfigfor cost savings - Can track truncation via
FinishReason.MAX_TOKENSin response
Related Code:
- Current implementation:
src/google/adk/models/anthropic_llm.py:298(max_tokens: int = 8192) - RunConfig:
src/google/adk/agents/run_config.py
Priority:
High - This directly addresses a critical production issue that forces developers to implement manual workarounds. The problem affects:
- Production Reliability: Truncated responses can deliver incomplete or incorrect information to end users
- Developer Productivity: Manual token counting and prompt switching is time-consuming and error-prone
- Cost Optimization: Fixed limits waste tokens when responses are shorter, and fail when they're longer
- Scalability: Manual interventions don't scale as systems grow in complexity
This feature would eliminate the need for manual token management and prompt engineering workarounds, allowing developers to focus on building better agentic systems rather than managing token limits.