fix: add per-domain RequestThrottler for 429 backoff#1762
fix: add per-domain RequestThrottler for 429 backoff#1762MrAliHasan wants to merge 2 commits intoapify:masterfrom
Conversation
Add a new RequestThrottler component that handles HTTP 429 (Too Many Requests) responses on a per-domain basis, preventing the autoscaling death spiral where 429s cause concurrency to increase. Key features: - Per-domain tracking: rate limiting on domain A doesn't affect domain B - Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s - Retry-After header support (both seconds and HTTP-date formats) - Throttled requests are reclaimed to the queue, not dropped - Backoff resets on successful requests to that domain The AutoscaledPool is completely untouched - throttling happens transparently in BasicCrawler.__run_task_function before processing. Integration points: - BasicCrawler: throttle check, 429 recording, success reset - AbstractHttpCrawler: passes URL + Retry-After to detection - PlaywrightCrawler: passes URL + Retry-After to detection Closes apify#1437
|
Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon. |
There was a problem hiding this comment.
As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.
On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.
I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.
Do you want to try it?
| @staticmethod | ||
| def _parse_retry_after_header(value: str | None) -> timedelta | None: |
There was a problem hiding this comment.
This has no business being in BasicCrawler. Better put it in the _utils module.
There was a problem hiding this comment.
Moved to crawlee._utils.http.parse_retry_after_header in the refactor commit.
| def _raise_for_session_blocked_status_code( | ||
| self, | ||
| session: Session | None, | ||
| status_code: int, |
There was a problem hiding this comment.
Maybe this could just receive the entire HttpResponse object?
There was a problem hiding this comment.
I considered this, but I kept the current signature. AbstractHttpCrawler passes Crawlee’s HttpResponse while PlaywrightCrawler passes Playwright’s native Response, which are different types. Since the method only needs status_code, url, and Retry-After, extracting those at the call site keeps the abstraction simpler. Happy to revisit if you’d prefer a unified response layer.
| # Check if this domain is currently rate-limited (429 backoff). | ||
| if self._request_throttler.is_throttled(request.url): | ||
| self._logger.debug( | ||
| f'Request to {request.url} delayed - domain is rate-limited ' | ||
| f'(retry in {self._request_throttler.get_delay(request.url).total_seconds():.1f}s)' | ||
| ) | ||
| await request_manager.reclaim_request(request) | ||
| return |
There was a problem hiding this comment.
If, at some point, the request queue contains only requests from a throttled domain, this will become a busy wait with extra steps. If you're using the Apify platform, this will cost a lot in request queue writes.
I'm afraid that this means we cannot accept the PR in the current state. See the main review comments for possible next steps.
There was a problem hiding this comment.
Fully addressed in the refactor. The reclaim-based throttle block was removed. ThrottlingRequestManager.fetch_next_request() now handles scheduling and awaits asyncio.sleep() when all domains are throttled, eliminating busy-wait and extra queue writes.
|
Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes. |
Move per-domain throttling from execution layer (BasicCrawler.__run_task_function) to scheduling layer (ThrottlingRequestManager.fetch_next_request). - ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface - fetch_next_request() buffers throttled requests and asyncio.sleep()s when all domains are throttled — eliminates busy-wait and unnecessary queue writes - Unified delay mechanism supports both HTTP 429 backoff and robots.txt crawl-delay (apify#1396) - parse_retry_after_header moved to crawlee._utils.http - 23 new tests covering throttling, scheduling, delegation, and crawl-delay Addresses apify#1437, apify#1396
Fixes #1437
Problem
When target websites return HTTP 429 (Too Many Requests), the
AutoscaledPoolscales UP instead of down — creating a "death spiral." This happens because:SessionError→ session retires → request retriedis_system_idlereturnsTrue_autoscale()sees idle CPU → increases concurrencyThe existing
_snapshot_clientonly tracks Apify storage API rate limits, not target website 429s.Solution
Following @Pijukatel's suggestion, I created a dedicated
RequestThrottlercomponent that handles 429 backoff per domain — theAutoscaledPoolis completely untouched.Key features:
example.comdoesn't affectother-site.com2s → 4s → 8s → ...capped at60sHow it works
BasicCrawler.__run_task_functionchecksRequestThrottler.is_throttled(url)before processing_raise_for_session_blocked_status_code, the domain is recordedRequestState.DONE), the backoff counter resetsFiles changed
src/crawlee/_request_throttler.pysrc/crawlee/crawlers/_basic/_basic_crawler.pysrc/crawlee/crawlers/_abstract_http/_abstract_http_crawler.pysrc/crawlee/crawlers/_playwright/_playwright_crawler.pytests/unit/test_request_throttler.pyTests
Future work
This is a focused first step toward the full
RequestAnalyzerthat @Pijukatel outlined (with robots.txt integration, URL group management, etc.).