-
Notifications
You must be signed in to change notification settings - Fork 617
Description
Summary
❗ Note: This is a starting point for discussion, not a final decision. The team should review and adjust before any implementation begins.❗
Crawlee JS defines a structured exception hierarchy that controls retry behavior and crawler lifecycle, while Crawlee Python is missing several key exception types. Before implementing anything, this needs to be discussed with the whole team to agree on the target state — the final structure is not decided yet.
This issue is for discussion and planning, not immediate implementation.
Current State Comparison
JS Error Hierarchy (packages/core/src/errors.ts)
Error (native)
├── NonRetryableError # Never retried
│ └── CriticalError # Shuts down the crawler
│ ├── MissingRouteError # No route found — fatal
│ ├── ContextPipelineCleanupError # Cleanup failure — fatal
│ └── BrowserLaunchError # Browser launch failure — fatal
├── RetryRequestError # Always retried (overrides maxRequestRetries)
│ └── SessionError # Triggers session rotation
├── ContextPipelineInterruptedError
├── ContextPipelineInitializationError
├── RequestHandlerError
└── CookieParseError
Python Error Hierarchy (src/crawlee/errors.py)
Exception
├── UserDefinedErrorHandlerError
│ └── UserHandlerTimeoutError
├── SessionError # ✅ Parity
│ └── ProxyError # Python ahead (JS has no dedicated ProxyError)
├── ServiceConflictError
├── HttpStatusCodeError
│ └── HttpClientStatusCodeError
├── RequestHandlerError [Generic] # Python ahead (wraps with crawling context)
├── ContextPipelineInitializationError # ✅ Parity
├── ContextPipelineFinalizationError # ✅ Parity (named differently)
├── ContextPipelineInterruptedError # ✅ Parity
├── RequestCollisionError
└── AbortError (internal)
Gap Analysis
| Exception | JS | Python | Status |
|---|---|---|---|
RetryRequestError |
✅ Always retried, overrides maxRequestRetries |
❌ Missing | Gap |
NonRetryableError |
✅ Never retried | ❌ Missing | Gap |
CriticalError |
✅ Shuts down crawler | ❌ Missing | Gap |
MissingRouteError |
✅ Extends CriticalError, thrown by Router |
❌ Missing | Gap |
BrowserLaunchError |
✅ Extends CriticalError |
❌ Missing | Gap |
CookieParseError |
✅ Dedicated type | ❌ Missing | Gap |
SessionError |
✅ Extends RetryRequestError |
✅ Standalone | Parity (different base) |
ProxyError |
❌ Part of SessionError |
✅ Extends SessionError |
Python ahead |
RequestHandlerError |
✅ Simple wrapper | ✅ Generic with crawling context | Python ahead |
What's Missing and Why It Matters
1. RetryRequestError — Force unlimited retries
In JS, throwing RetryRequestError in a handler overrides maxRequestRetries and forces the request to be retried. Python has no equivalent — users cannot signal "keep retrying this request" from within a handler.
In JS, SessionError extends RetryRequestError, which means session errors are also always retried (with a separate maxSessionRotations limit). In Python, SessionError already has special handling, but there's no general-purpose "always retry" error.
2. NonRetryableError — Skip retries entirely
In JS, throwing NonRetryableError marks the request as failed immediately without any retries. Python has no way for users to signal from a handler that an error should not be retried.
3. CriticalError — Shut down the crawler
In JS, CriticalError extends NonRetryableError and causes the entire crawler to abort. This is used for unrecoverable situations (e.g., no route found, browser won't launch). Python has no equivalent — unrecoverable errors don't trigger a clean crawler shutdown.
4. MissingRouteError — Router fails loudly
In JS, if no route matches a request label and there's no default handler, a MissingRouteError (extending CriticalError) is thrown, shutting down the crawler immediately. This makes misconfigured routers fail fast and visibly. In Python, this situation is handled differently (no dedicated error type).
5. BrowserLaunchError / CookieParseError — Domain-specific errors
Lower priority, but useful for users to catch and handle specific failure modes.
Discussion Points
Before implementation, we need to agree on:
-
Should we mirror the JS hierarchy exactly, or adapt it for Python idioms?
- JS:
SessionError extends RetryRequestError— should Python do the same, or keepSessionErrorstandalone with special handling? - Python already has
ProxyError extends SessionErrorwhich JS lacks — do we keep this?
- JS:
-
Naming conventions — Python uses both
*Errorand*Exceptionin the standard library. Should we stick with*Errorfor consistency with JS? -
What about Python-specific exceptions we already have?
UserDefinedErrorHandlerError,HttpStatusCodeError,ServiceConflictError,RequestCollisionError— these don't exist in JS. Should they stay as-is?
-
Integration with
BasicCrawlererror handling logic- Adding
RetryRequestErrorandNonRetryableErrorrequires changes to the retry logic inBasicCrawler._handle_request_function(). CriticalErrorneeds integration withAutoscaledPoolto trigger shutdown.
- Adding
-
Crawlee JS v4 direction — The v4 branch has the same error hierarchy as v3. Should we wait for v4 to stabilize, or align with the current state?
Proposed Target Hierarchy (for discussion)
Exception
├── RetryRequestError # NEW: Always retried
│ └── SessionError # CHANGED: Re-parent under RetryRequestError
│ └── ProxyError # KEEP: Python-specific
├── NonRetryableError # NEW: Never retried
│ └── CriticalError # NEW: Shuts down crawler
│ ├── MissingRouteError # NEW: No route found
│ └── BrowserLaunchError # NEW: Browser launch failure
├── UserDefinedErrorHandlerError # KEEP
│ └── UserHandlerTimeoutError # KEEP
├── ServiceConflictError # KEEP
├── HttpStatusCodeError # KEEP
│ └── HttpClientStatusCodeError # KEEP
├── RequestHandlerError [Generic] # KEEP
├── ContextPipelineInitializationError # KEEP
├── ContextPipelineFinalizationError # KEEP
├── ContextPipelineInterruptedError # KEEP
├── RequestCollisionError # KEEP
├── CookieParseError # NEW
└── AbortError (internal) # KEEP