fix: add per-domain RequestThrottler for 429 backoff by MrAliHasan · Pull Request #1762 · apify/crawlee-python

MrAliHasan · 2026-02-20T21:31:44Z

Fixes #1437

Problem

When target websites return HTTP 429 (Too Many Requests), the AutoscaledPool scales UP instead of down — creating a "death spiral." This happens because:

429 responses trigger SessionError → session retires → request retried
Less CPU work during retries → is_system_idle returns True
_autoscale() sees idle CPU → increases concurrency
More concurrent requests → more 429s → repeat

The existing _snapshot_client only tracks Apify storage API rate limits, not target website 429s.

Solution

Following @Pijukatel's suggestion, I created a dedicated RequestThrottler component that handles 429 backoff per domain — the AutoscaledPool is completely untouched.

Key features:

Per-domain tracking — rate limiting on example.com doesn't affect other-site.com
Exponential backoff — 2s → 4s → 8s → ... capped at 60s
Retry-After header support — parses both integer seconds and HTTP-date formats
Throttled requests are reclaimed — they go back to the queue, not dropped
Backoff resets on success — consecutive 429 count resets when a request succeeds

How it works

BasicCrawler.__run_task_function checks RequestThrottler.is_throttled(url) before processing
If the domain is throttled, the request is reclaimed (returned to queue for later)
When a 429 is detected in _raise_for_session_blocked_status_code, the domain is recorded
On successful request (RequestState.DONE), the backoff counter resets

Files changed

File	Change
`src/crawlee/_request_throttler.py`	NEW — Per-domain 429 tracker
`src/crawlee/crawlers/_basic/_basic_crawler.py`	Throttle check, 429 recording, success reset, Retry-After parsing
`src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py`	Pass URL + Retry-After header
`src/crawlee/crawlers/_playwright/_playwright_crawler.py`	Pass URL + Retry-After header
`tests/unit/test_request_throttler.py`	NEW — 13 unit tests

Tests

13 new tests covering: domain independence, exponential backoff, max delay cap, Retry-After priority, success reset, expiry, edge cases
8 existing autoscaling tests pass with zero regressions

Future work

This is a focused first step toward the full RequestAnalyzer that @Pijukatel outlined (with robots.txt integration, URL group management, etc.).

Add a new RequestThrottler component that handles HTTP 429 (Too Many Requests) responses on a per-domain basis, preventing the autoscaling death spiral where 429s cause concurrency to increase. Key features: - Per-domain tracking: rate limiting on domain A doesn't affect domain B - Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s - Retry-After header support (both seconds and HTTP-date formats) - Throttled requests are reclaimed to the queue, not dropped - Backoff resets on successful requests to that domain The AutoscaledPool is completely untouched - throttling happens transparently in BasicCrawler.__run_task_function before processing. Integration points: - BasicCrawler: throttle check, 429 recording, success reset - AbstractHttpCrawler: passes URL + Retry-After to detection - PlaywrightCrawler: passes URL + Retry-After to detection Closes apify#1437

vdusek · 2026-02-23T11:49:48Z

Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon.

janbuchar

As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.

On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.

I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.

Do you want to try it?

janbuchar · 2026-02-23T15:02:18Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+    @staticmethod
+    def _parse_retry_after_header(value: str | None) -> timedelta | None:


This has no business being in BasicCrawler. Better put it in the _utils module.

Moved to crawlee._utils.http.parse_retry_after_header in the refactor commit.

janbuchar · 2026-02-23T15:04:01Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+    def _raise_for_session_blocked_status_code(
+        self,
+        session: Session | None,
+        status_code: int,


Maybe this could just receive the entire HttpResponse object?

I considered this, but I kept the current signature. AbstractHttpCrawler passes Crawlee’s HttpResponse while PlaywrightCrawler passes Playwright’s native Response, which are different types. Since the method only needs status_code, url, and Retry-After, extracting those at the call site keeps the abstraction simpler. Happy to revisit if you’d prefer a unified response layer.

janbuchar · 2026-02-23T15:32:39Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+        # Check if this domain is currently rate-limited (429 backoff).
+        if self._request_throttler.is_throttled(request.url):
+            self._logger.debug(
+                f'Request to {request.url} delayed - domain is rate-limited '
+                f'(retry in {self._request_throttler.get_delay(request.url).total_seconds():.1f}s)'
+            )
+            await request_manager.reclaim_request(request)
+            return


If, at some point, the request queue contains only requests from a throttled domain, this will become a busy wait with extra steps. If you're using the Apify platform, this will cost a lot in request queue writes.

I'm afraid that this means we cannot accept the PR in the current state. See the main review comments for possible next steps.

Fully addressed in the refactor. The reclaim-based throttle block was removed. ThrottlingRequestManager.fetch_next_request() now handles scheduling and awaits asyncio.sleep() when all domains are throttled, eliminating busy-wait and extra queue writes.

MrAliHasan · 2026-02-23T16:44:07Z

Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes.
I’ll refactor this into a ThrottlingRequestManager implementation so that the throttling logic lives in the request scheduling layer rather than in BasicCrawler.
I’ll push an updated version soon. Appreciate the guidance.

Move per-domain throttling from execution layer (BasicCrawler.__run_task_function) to scheduling layer (ThrottlingRequestManager.fetch_next_request). - ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface - fetch_next_request() buffers throttled requests and asyncio.sleep()s when all domains are throttled — eliminates busy-wait and unnecessary queue writes - Unified delay mechanism supports both HTTP 429 backoff and robots.txt crawl-delay (apify#1396) - parse_retry_after_header moved to crawlee._utils.http - 23 new tests covering throttling, scheduling, delegation, and crawl-delay Addresses apify#1437, apify#1396

vdusek requested review from Pijukatel, janbuchar and vdusek February 23, 2026 11:48

janbuchar requested changes Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: add per-domain RequestThrottler for 429 backoff#1762

fix: add per-domain RequestThrottler for 429 backoff#1762
MrAliHasan wants to merge 2 commits intoapify:masterfrom
MrAliHasan:fix/request-throttler-429-backoff

MrAliHasan commented Feb 20, 2026

Uh oh!

vdusek commented Feb 23, 2026

Uh oh!

janbuchar left a comment •

edited

Loading

Uh oh!

janbuchar Feb 23, 2026

Uh oh!

MrAliHasan Feb 23, 2026

Uh oh!

janbuchar Feb 23, 2026

Uh oh!

MrAliHasan Feb 23, 2026

Uh oh!

janbuchar Feb 23, 2026

Uh oh!

MrAliHasan Feb 23, 2026

Uh oh!

MrAliHasan commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@staticmethod
		def _parse_retry_after_header(value: str \| None) -> timedelta \| None:

Comments

Conversation

MrAliHasan commented Feb 20, 2026

Fixes #1437

Problem

Solution

How it works

Files changed

Tests

Future work

Uh oh!

vdusek commented Feb 23, 2026

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

MrAliHasan Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

janbuchar Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

MrAliHasan Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

janbuchar Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

MrAliHasan Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

MrAliHasan commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janbuchar left a comment •

edited

Loading