Skip to content

[FEATURE REQ]: Modify timeout policies for HTTP-based retry paths for a request with Per-Partition Automatic Failover and / or Gateway V2 enabled. #47419

@jeet1995

Description

@jeet1995

Background

For accounts configured with Gateway V2.0 (Thin Proxy), the CosmosClient instance should adhere to stricter latency-based SLA or push for quicker cross-region failover for point operations. Currently, for Gateway-bound requests, we have the following timeout policy.

For hot path requests (such as Document requests [or] supporting metadata requests such as PartitionKeyRange, Collection, QueryPlan and Address requests), the HttpTimeoutPolicyControlPlaneRead is 500ms, 1s and 10s with a max delay of 1s. This can be a latency-based SLA breach (>=12.5s cumulative delay for a request to be retried either cross-regionally, to trigger a failover or to bubble up as an exception to the downstream caller).

Goal

Selectively reduce these timeouts for point-reads (reference this PR - Azure/azure-cosmos-dotnet-v3#5497 and Azure/azure-cosmos-dotnet-v3#5482) where for point reads, the timeout is modified to 1s, 6s, 6s and for non-point reads, the timeout is modified to 6s, 6s, 10s (to accommodate a long running Query operation - Cosmos DB backend is allowed up to 5s to accommodate for certain long running operations).

The timeout modification should also apply to writes but writes

Acceptance Criteria

  • Ensure testing is done by injecting response delays (for Document requests) against either a ThinClient enabled account or Per-Partition Automatic Failover-enabled account. The feedback loop of seeing a failover, or a cross-region retry or an error code to the caller should be much quicker for point reads.
  • Perform a DR-drill to weed out regressions across 3-types of accounts:

Pending questions (follow up with Cosmos .NET SDK team)

  • Get more clarity on DR drill specifics. Ideally, region offline and node down (PPAF) cases can be tested across various account combinations (PPAF + ThinProxy).
  • Should timeouts for writes be wrapped as 503 to force a potential cross region retry for such writes which can break idempotency guarantees.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions