Skip to content

Follow-up: Increase blob retry window to handle slow ActiveStorage uploads #25

@wahyuwidgetworks

Description

@wahyuwidgetworks

Context

Phase 1.5 image delivery RCA identified that intermittent image failures are caused by a race condition: Chatwoot fires the message_created webhook before the ActiveStorage blob upload job (Sidekiq) completes.

Current mitigation (Fix 3):

  • 3 retries with 1s/2s backoff (~3s total retry window)
  • Handles most cases where blob commits within 3 seconds
  • If blob takes >3s to commit, all retries fail → yellow error note posted to agent

Observed in production:

  • Some blob uploads exceed the 3s retry window
  • Particularly affects larger images or when Sidekiq workers are slow

Proposed Enhancement

Increase retry parameters in chatwoot.service.ts:

  • MAX_BLOB_RETRIES: 2 → 4 (or 5)
  • BLOB_RETRY_DELAY_MS: Consider exponential backoff (1s, 2s, 4s, 8s)

This would extend the retry window from ~3s to ~15s, covering more edge cases.

Trade-offs

Pro:

  • Reduces intermittent image delivery failures
  • Better UX for agents (fewer false failures)

Con:

  • Longer webhook processing time if blob genuinely doesn't exist
  • May mask underlying Sidekiq performance issues

Related

  • Root cause analysis: Prospek/docs/whisper/image-delivery-rca-v2.md (Section 2, RC4)
  • Code: chatwoot.service.ts:1238-1282
  • Current branch: fix/image-delivery-reliability (commit 8037754c)

Priority

P2 - Enhancement (Phase 1.5 mitigates most cases; this extends coverage for edge cases)

Monitor real-world blob 404 rates in production after Phase 1.5 closes. If failure rate remains >1%, prioritize this enhancement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions