Skip to content

Align exception hierarchy with Crawlee JS #1739

@vdusek

Description

@vdusek

Summary

Note: This is a starting point for discussion, not a final decision. The team should review and adjust before any implementation begins.❗

Crawlee JS defines a structured exception hierarchy that controls retry behavior and crawler lifecycle, while Crawlee Python is missing several key exception types. Before implementing anything, this needs to be discussed with the whole team to agree on the target state — the final structure is not decided yet.

This issue is for discussion and planning, not immediate implementation.

Current State Comparison

JS Error Hierarchy (packages/core/src/errors.ts)

Error (native)
├── NonRetryableError                      # Never retried
│   └── CriticalError                      # Shuts down the crawler
│       ├── MissingRouteError              # No route found — fatal
│       ├── ContextPipelineCleanupError    # Cleanup failure — fatal
│       └── BrowserLaunchError             # Browser launch failure — fatal
├── RetryRequestError                      # Always retried (overrides maxRequestRetries)
│   └── SessionError                       # Triggers session rotation
├── ContextPipelineInterruptedError
├── ContextPipelineInitializationError
├── RequestHandlerError
└── CookieParseError

Python Error Hierarchy (src/crawlee/errors.py)

Exception
├── UserDefinedErrorHandlerError
│   └── UserHandlerTimeoutError
├── SessionError                            # ✅ Parity
│   └── ProxyError                          # Python ahead (JS has no dedicated ProxyError)
├── ServiceConflictError
├── HttpStatusCodeError
│   └── HttpClientStatusCodeError
├── RequestHandlerError [Generic]           # Python ahead (wraps with crawling context)
├── ContextPipelineInitializationError      # ✅ Parity
├── ContextPipelineFinalizationError        # ✅ Parity (named differently)
├── ContextPipelineInterruptedError         # ✅ Parity
├── RequestCollisionError
└── AbortError (internal)

Gap Analysis

Exception JS Python Status
RetryRequestError ✅ Always retried, overrides maxRequestRetries ❌ Missing Gap
NonRetryableError ✅ Never retried ❌ Missing Gap
CriticalError ✅ Shuts down crawler ❌ Missing Gap
MissingRouteError ✅ Extends CriticalError, thrown by Router ❌ Missing Gap
BrowserLaunchError ✅ Extends CriticalError ❌ Missing Gap
CookieParseError ✅ Dedicated type ❌ Missing Gap
SessionError ✅ Extends RetryRequestError ✅ Standalone Parity (different base)
ProxyError ❌ Part of SessionError ✅ Extends SessionError Python ahead
RequestHandlerError ✅ Simple wrapper ✅ Generic with crawling context Python ahead

What's Missing and Why It Matters

1. RetryRequestError — Force unlimited retries

In JS, throwing RetryRequestError in a handler overrides maxRequestRetries and forces the request to be retried. Python has no equivalent — users cannot signal "keep retrying this request" from within a handler.

In JS, SessionError extends RetryRequestError, which means session errors are also always retried (with a separate maxSessionRotations limit). In Python, SessionError already has special handling, but there's no general-purpose "always retry" error.

2. NonRetryableError — Skip retries entirely

In JS, throwing NonRetryableError marks the request as failed immediately without any retries. Python has no way for users to signal from a handler that an error should not be retried.

3. CriticalError — Shut down the crawler

In JS, CriticalError extends NonRetryableError and causes the entire crawler to abort. This is used for unrecoverable situations (e.g., no route found, browser won't launch). Python has no equivalent — unrecoverable errors don't trigger a clean crawler shutdown.

4. MissingRouteError — Router fails loudly

In JS, if no route matches a request label and there's no default handler, a MissingRouteError (extending CriticalError) is thrown, shutting down the crawler immediately. This makes misconfigured routers fail fast and visibly. In Python, this situation is handled differently (no dedicated error type).

5. BrowserLaunchError / CookieParseError — Domain-specific errors

Lower priority, but useful for users to catch and handle specific failure modes.

Discussion Points

Before implementation, we need to agree on:

  1. Should we mirror the JS hierarchy exactly, or adapt it for Python idioms?

    • JS: SessionError extends RetryRequestError — should Python do the same, or keep SessionError standalone with special handling?
    • Python already has ProxyError extends SessionError which JS lacks — do we keep this?
  2. Naming conventions — Python uses both *Error and *Exception in the standard library. Should we stick with *Error for consistency with JS?

  3. What about Python-specific exceptions we already have?

    • UserDefinedErrorHandlerError, HttpStatusCodeError, ServiceConflictError, RequestCollisionError — these don't exist in JS. Should they stay as-is?
  4. Integration with BasicCrawler error handling logic

    • Adding RetryRequestError and NonRetryableError requires changes to the retry logic in BasicCrawler._handle_request_function().
    • CriticalError needs integration with AutoscaledPool to trigger shutdown.
  5. Crawlee JS v4 direction — The v4 branch has the same error hierarchy as v3. Should we wait for v4 to stabilize, or align with the current state?

Proposed Target Hierarchy (for discussion)

Exception
├── RetryRequestError                       # NEW: Always retried
│   └── SessionError                        # CHANGED: Re-parent under RetryRequestError
│       └── ProxyError                      # KEEP: Python-specific
├── NonRetryableError                       # NEW: Never retried
│   └── CriticalError                       # NEW: Shuts down crawler
│       ├── MissingRouteError               # NEW: No route found
│       └── BrowserLaunchError              # NEW: Browser launch failure
├── UserDefinedErrorHandlerError            # KEEP
│   └── UserHandlerTimeoutError             # KEEP
├── ServiceConflictError                    # KEEP
├── HttpStatusCodeError                     # KEEP
│   └── HttpClientStatusCodeError           # KEEP
├── RequestHandlerError [Generic]           # KEEP
├── ContextPipelineInitializationError      # KEEP
├── ContextPipelineFinalizationError        # KEEP
├── ContextPipelineInterruptedError         # KEEP
├── RequestCollisionError                   # KEEP
├── CookieParseError                        # NEW
└── AbortError (internal)                   # KEEP

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions