Skip to content

Comments

Python: Add auto_retry.py sample for rate limit handling#4223

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/add-rate-limiting-handling
Draft

Python: Add auto_retry.py sample for rate limit handling#4223
Copilot wants to merge 2 commits intomainfrom
copilot/add-rate-limiting-handling

Conversation

Copy link
Contributor

Copilot AI commented Feb 24, 2026

Motivation and Context

Every model inference API is rate-limited. Without built-in retry support in the framework, every consumer must write the same boilerplate. This sample shows two clean, copy-paste-ready patterns using tenacity so developers can focus on agent logic instead.

Description

Adds python/samples/02-agents/auto_retry.py demonstrating two approaches to automatic retry on RateLimitError (HTTP 429):

Approach 1 – Client subclass
Override get_response() in a subclass of any chat client. Non-streaming calls are wrapped in a tenacity.AsyncRetrying loop so each retry re-issues the full HTTP request (not a re-await of a stale coroutine). Streaming falls back to the parent with a note that streaming retry requires more delicate handling.

class AzureOpenAIChatClientWithRetry(AzureOpenAIChatClient):
    retry_attempts: int = 3

    def get_response(self, *args, **kwargs):
        if kwargs.get("stream", False):
            return super().get_response(*args, **kwargs)

        async def _with_retry():
            async for attempt in AsyncRetrying(
                stop=stop_after_attempt(self.retry_attempts),
                wait=wait_exponential(multiplier=1, min=4, max=10),
                retry=retry_if_exception_type(RateLimitError),
                reraise=True,
                before_sleep=before_sleep_log(logger, logging.WARNING),
            ):
                with attempt:
                    return await super(AzureOpenAIChatClientWithRetry, self).get_response(*args, **kwargs)

        return _with_retry()

Approach 2 – Chat middleware (two styles)

Class-basedRateLimitRetryMiddleware(ChatMiddleware) wraps await call_next() in AsyncRetrying.

Function-based@chat_middleware decorated function, with an inner async function decorated by @retry, that calls call_next():

@chat_middleware
async def rate_limit_retry_middleware(context: ChatContext, call_next: Callable[[], Awaitable[None]]) -> None:
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(...),
           retry=retry_if_exception_type(RateLimitError), reraise=True)
    async def _call_next_with_retry() -> None:
        await call_next()

    await _call_next_with_retry()

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.
Original prompt

This section details on the original issue you should resolve

<issue_title>Python: Add typical rate limiting handling for model clients</issue_title>
<issue_description>Every single model inference API is rate limited these days, so any practical use of the method AzureOpenAIChatClient.create_agent would need to deal with retry logic leading to boiler plate code.

I would like Agent Framework to solve this typical pattern, helping the developer to focus on the rest. A simple implementation using tenacity library would be as follows (but streaming needs a more delicate handling):

class AzureOpenAIChatClientWithRetry(AzureOpenAIChatClient):
    """Azure OpenAI Chat Client with built-in retry logic for handling rate limits."""

    retry_attempts = 3
    """Number of retry attempts for rate limit errors."""

    @staticmethod
    def _before_sleep_log(retry_state: RetryCallState) -> None:
        """Log when rate limiting is reached and retry is about to sleep."""
        attempt_number = retry_state.attempt_number
        wait_time = retry_state.next_action.sleep if retry_state.next_action else 0
        logger.warning(
            "Rate limiting reached. Attempt %d failed. Retrying in %.2f seconds...",
            attempt_number,
            wait_time,
        )

    @override
    @retry(
        stop=stop_after_attempt(retry_attempts),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        retry=retry_if_exception_type(RateLimitError),
        reraise=True,
        before_sleep=_before_sleep_log
    )
    def get_response(self, *args, **kwargs):
        """Get response with retry on rate limit errors (429 status code only)."""
        return super().get_response(*args, **kwargs)
```</issue_description>

<agent_instructions>Create a new sample in python/samples/02-agents called `auto_retry.py` and build a retry setup in two ways, 1) the way outlined in this issue, with a class wrapper that can take any ChatClient and wraps the get_response with the `@retry` decorator and 2) with a chat middleware that does a wraps the call to `call_next` </agent_instructions>

## Comments on the Issue (you are @copilot in this section)

<comments>
<comment_new><author>@eavanvalkenburg</author><body>
Given how simple this is to write, I don't think we need to do this uniformly, would you be willing to create a sample out of this @santiagxf ?</body></comment_new>
<comment_new><author>@eavanvalkenburg</author><body>
The reason we do not want to include this directly in is because there is no single standard (even though tenacity does come close) and so we won't want to get into a situation that means we have to support a number of variants, while the base implementation with any of them is quite simple.</body></comment_new>
</comments>

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: eavanvalkenburg <13749212+eavanvalkenburg@users.noreply.github.com>
Copilot AI changed the title [WIP] Add rate limiting handling for model clients Python: Add auto_retry.py sample for rate limit handling Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: Add typical rate limiting handling for model clients

2 participants