Adaptive Token Limit Management

**Is your feature request related to a problem? Please describe.**
Yes. There are several related problems:

Yes. The current `max_tokens` is hardcoded to 8192 (e.g., in `Claude` model at `src/google/adk/models/anthropic_llm.py:298`). This causes two critical issues:

1. **Truncated Responses**: When responses exceed 8192 tokens, they get cut off mid-sentence, leaving users with incomplete answers
2. **Token Waste**: When responses are shorter, we waste allocated tokens and increase costs unnecessarily
3. **No Adaptation**: There's no way to dynamically adjust based on response quality, context window availability, or user needs
4. **Manual Workarounds Required**: Developers must manually count tokens and switch prompt engineering techniques, which is error-prone and doesn't scale

### Real-World Impact

**Personal Experience:** While building an agentic system that queries databases to provide recommendations, I encountered this issue frequently. The system would retrieve data and generate responses that sometimes exceeded 8192 tokens, causing critical information to be truncated. 

**The Manual Workaround I Had to Implement:**
- Manually count tokens before sending requests
- Switch between different prompt engineering techniques based on estimated token count
- Use shorter, more concise prompts when approaching limits
- Sometimes split responses across multiple invocations
- Monitor token usage and adjust prompts reactively

This same issue also occurred when using OpenAI models in other projects - responses would sometimes generate more tokens than expected, requiring similar manual intervention.

**The Problem This Creates:**
- **Inconsistent Quality**: Different prompt styles based on token count lead to inconsistent user experience
- **Development Overhead**: Constant monitoring and manual adjustment is time-consuming
- **Production Risk**: Truncated responses in production can lead to incomplete or incorrect information being delivered to users
- **Cost Inefficiency**: Fixed limits waste tokens when responses are shorter, and fail when they're longer

**Example Problem:**
```python
agent = Agent(
    name="code_generator",
    model=Claude(model="claude-3-5-sonnet-v2@20241022"),  # max_tokens=8192 fixed
    instruction="Generate complete code solutions"
)
# If code generation needs 10000 tokens, it gets truncated at 8192
```
   
**Describe the solution you'd like**
I propose adding several complementary features:

Add an **Adaptive Token Limit Management** feature that automatically adjusts `max_tokens` based on response quality and context.

### Proposed API:

```python
class AdaptiveTokenConfig(BaseModel):
    """Adaptive token limit configuration."""
    enabled: bool = False
    min_tokens: int = 512
    max_tokens: int = 8192
    initial_tokens: int = 2048
    increase_factor: float = 1.5  # Multiply if response truncated
    decrease_factor: float = 0.8  # Reduce if response much shorter
    truncation_detection: bool = True  # Auto-detect incomplete responses
```

### Usage:

```python
from google.adk import Agent, App
from google.adk.models.anthropic_llm import Claude
from google.adk.agents.run_config import RunConfig, AdaptiveTokenConfig

agent = Agent(
    name="code_generator",
    model=Claude(model="claude-3-5-sonnet-v2@20241022"),
    instruction="Generate complete code solutions"
)

app = App(
    name="my_app",
    root_agent=agent
)

# Use with adaptive tokens
run_config = RunConfig(
    adaptive_token_config=AdaptiveTokenConfig(
        enabled=True,
        initial_tokens=2048,
        max_tokens=16384  # Allow up to 16k if needed
    )
)
```

### Behavior:

1. **Start with `initial_tokens`** (e.g., 2048) for first invocation
2. **Detect truncation**: Check if response ends mid-sentence, has incomplete function calls, or hits MAX_TOKENS finish reason
3. **Auto-increase**: If truncated, multiply by `increase_factor` (e.g., 2048 → 3072 → 4608 → ...)
4. **Auto-decrease**: If response uses < 50% of allocated tokens, reduce by `decrease_factor` for next invocation
5. **Respect bounds**: Never go below `min_tokens` or above `max_tokens`
6. **Session-aware**: Track across invocations in the same session



**Describe alternatives you've considered**
1. **Manual Adjustment**: Users could manually set `max_tokens` per invocation, but:
   - Requires deep knowledge of token usage patterns
   - Doesn't adapt to changing needs
   - Adds cognitive overhead

2. **Fixed Higher Limits**: Simply increasing default to 16384:
   - Wastes tokens when not needed
   - Doesn't solve truncation for very long responses
   - Increases costs unnecessarily

3. **External Tracking**: Building token management outside ADK:
   - Loses integration with execution flow
   - Harder to enforce at the right points
   - Duplicates functionality
  
**Additional context**
### Use Cases:
- **Long-form content generation**: Articles, reports that may exceed 8192 tokens
- **Code generation**: Complete functions/systems that need full responses
- **Multi-step reasoning**: Explanations that require full context
- **Cost optimization**: Reduce token waste in production deployments
### Implementation Notes:
- Can be implemented as an extension to `RunConfig`
- Backward compatible (opt-in via `enabled=True`)
- Should integrate with existing `ContextCacheConfig` for cost savings
- Can track truncation via `FinishReason.MAX_TOKENS` in response

### Related Code:
- Current implementation: `src/google/adk/models/anthropic_llm.py:298` (`max_tokens: int = 8192`)
- RunConfig: `src/google/adk/agents/run_config.py`

### Priority:
**High** - This directly addresses a critical production issue that forces developers to implement manual workarounds. The problem affects:

- **Production Reliability**: Truncated responses can deliver incomplete or incorrect information to end users
- **Developer Productivity**: Manual token counting and prompt switching is time-consuming and error-prone
- **Cost Optimization**: Fixed limits waste tokens when responses are shorter, and fail when they're longer
- **Scalability**: Manual interventions don't scale as systems grow in complexity

This feature would eliminate the need for manual token management and prompt engineering workarounds, allowing developers to focus on building better agentic systems rather than managing token limits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adaptive Token Limit Management #3828

Real-World Impact

Proposed API:

Usage:

Behavior:

Use Cases:

Implementation Notes:

Related Code:

Priority:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adaptive Token Limit Management #3828

Description

Real-World Impact

Proposed API:

Usage:

Behavior:

Use Cases:

Implementation Notes:

Related Code:

Priority:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions