Skip to content

Comments

fix: add per-domain RequestThrottler for 429 backoff#1762

Open
MrAliHasan wants to merge 2 commits intoapify:masterfrom
MrAliHasan:fix/request-throttler-429-backoff
Open

fix: add per-domain RequestThrottler for 429 backoff#1762
MrAliHasan wants to merge 2 commits intoapify:masterfrom
MrAliHasan:fix/request-throttler-429-backoff

Conversation

@MrAliHasan
Copy link

Fixes #1437

Problem

When target websites return HTTP 429 (Too Many Requests), the AutoscaledPool scales UP instead of down — creating a "death spiral." This happens because:

  1. 429 responses trigger SessionError → session retires → request retried
  2. Less CPU work during retries → is_system_idle returns True
  3. _autoscale() sees idle CPU → increases concurrency
  4. More concurrent requests → more 429s → repeat

The existing _snapshot_client only tracks Apify storage API rate limits, not target website 429s.

Solution

Following @Pijukatel's suggestion, I created a dedicated RequestThrottler component that handles 429 backoff per domain — the AutoscaledPool is completely untouched.

Key features:

  • Per-domain tracking — rate limiting on example.com doesn't affect other-site.com
  • Exponential backoff2s → 4s → 8s → ... capped at 60s
  • Retry-After header support — parses both integer seconds and HTTP-date formats
  • Throttled requests are reclaimed — they go back to the queue, not dropped
  • Backoff resets on success — consecutive 429 count resets when a request succeeds

How it works

  1. BasicCrawler.__run_task_function checks RequestThrottler.is_throttled(url) before processing
  2. If the domain is throttled, the request is reclaimed (returned to queue for later)
  3. When a 429 is detected in _raise_for_session_blocked_status_code, the domain is recorded
  4. On successful request (RequestState.DONE), the backoff counter resets

Files changed

File Change
src/crawlee/_request_throttler.py NEW — Per-domain 429 tracker
src/crawlee/crawlers/_basic/_basic_crawler.py Throttle check, 429 recording, success reset, Retry-After parsing
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Pass URL + Retry-After header
src/crawlee/crawlers/_playwright/_playwright_crawler.py Pass URL + Retry-After header
tests/unit/test_request_throttler.py NEW — 13 unit tests

Tests

  • 13 new tests covering: domain independence, exponential backoff, max delay cap, Retry-After priority, success reset, expiry, edge cases
  • 8 existing autoscaling tests pass with zero regressions

Future work

This is a focused first step toward the full RequestAnalyzer that @Pijukatel outlined (with robots.txt integration, URL group management, etc.).

Add a new RequestThrottler component that handles HTTP 429 (Too Many
Requests) responses on a per-domain basis, preventing the autoscaling
death spiral where 429s cause concurrency to increase.

Key features:
- Per-domain tracking: rate limiting on domain A doesn't affect domain B
- Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s
- Retry-After header support (both seconds and HTTP-date formats)
- Throttled requests are reclaimed to the queue, not dropped
- Backoff resets on successful requests to that domain

The AutoscaledPool is completely untouched - throttling happens
transparently in BasicCrawler.__run_task_function before processing.

Integration points:
- BasicCrawler: throttle check, 429 recording, success reset
- AbstractHttpCrawler: passes URL + Retry-After to detection
- PlaywrightCrawler: passes URL + Retry-After to detection

Closes apify#1437
@vdusek
Copy link
Collaborator

vdusek commented Feb 23, 2026

Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon.

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.

On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.

I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.

Do you want to try it?

Comment on lines 1591 to 1592
@staticmethod
def _parse_retry_after_header(value: str | None) -> timedelta | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no business being in BasicCrawler. Better put it in the _utils module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to crawlee._utils.http.parse_retry_after_header in the refactor commit.

def _raise_for_session_blocked_status_code(
self,
session: Session | None,
status_code: int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could just receive the entire HttpResponse object?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but I kept the current signature. AbstractHttpCrawler passes Crawlee’s HttpResponse while PlaywrightCrawler passes Playwright’s native Response, which are different types. Since the method only needs status_code, url, and Retry-After, extracting those at the call site keeps the abstraction simpler. Happy to revisit if you’d prefer a unified response layer.

Comment on lines 1401 to 1408
# Check if this domain is currently rate-limited (429 backoff).
if self._request_throttler.is_throttled(request.url):
self._logger.debug(
f'Request to {request.url} delayed - domain is rate-limited '
f'(retry in {self._request_throttler.get_delay(request.url).total_seconds():.1f}s)'
)
await request_manager.reclaim_request(request)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, at some point, the request queue contains only requests from a throttled domain, this will become a busy wait with extra steps. If you're using the Apify platform, this will cost a lot in request queue writes.

I'm afraid that this means we cannot accept the PR in the current state. See the main review comments for possible next steps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully addressed in the refactor. The reclaim-based throttle block was removed. ThrottlingRequestManager.fetch_next_request() now handles scheduling and awaits asyncio.sleep() when all domains are throttled, eliminating busy-wait and extra queue writes.

@MrAliHasan
Copy link
Author

Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes.
I’ll refactor this into a ThrottlingRequestManager implementation so that the throttling logic lives in the request scheduling layer rather than in BasicCrawler.
I’ll push an updated version soon. Appreciate the guidance.

Move per-domain throttling from execution layer (BasicCrawler.__run_task_function)
to scheduling layer (ThrottlingRequestManager.fetch_next_request).

- ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface
- fetch_next_request() buffers throttled requests and asyncio.sleep()s when all
  domains are throttled — eliminates busy-wait and unnecessary queue writes
- Unified delay mechanism supports both HTTP 429 backoff and robots.txt
  crawl-delay (apify#1396)
- parse_retry_after_header moved to crawlee._utils.http
- 23 new tests covering throttling, scheduling, delegation, and crawl-delay

Addresses apify#1437, apify#1396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix autoscaled pool scaling behavior on 429 Too Many Requests

3 participants