Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Persist inbound stripe webhooks #2972

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

ancorcruz
Copy link
Contributor

@ancorcruz ancorcruz commented Dec 17, 2024

Context

Currently, when our application receives webhooks from third-party services (e.g., Stripe, GoCardless, Adyen), their payloads are immediately queued in Sidekiq for background processing. While this approach works in most cases, it introduces a critical reliability issue, especially in self-hosted environments where Sidekiq may lose jobs if workers are abruptly terminated or overloaded.

This unreliability has led to issues like pending transactions that are never resolved despite receiving valid webhooks. For example, a reported issue involved a wallet credit purchase where the payment settled on Stripe, the webhook was received, but the transaction remained pending indefinitely until manually resolved.

To address this, we aim to ensure that no webhook payload is lost and processing remains consistent, even under adverse conditions.

Description

This PR introduces a robust mechanism for handling inbound webhooks by persisting them in the database before processing. Key changes include:

Database Persistence:

  • Webhooks are now stored in a new inbound_webhooks table immediately upon receipt. This ensures all webhook payloads are safely persisted in a reliable system before processing.

Background Processing:

  • A new InboundWebhooks::CreateService is introduced to persist the webhook and enqueue it for processing.
  • A new InboundWebhooks::ProcessJob Sidekiq worker processes the persisted webhooks, invoking the InboundWebhooks::ProcessService
  • A new InboundWebhooks::ProcessService processes the persisted webhooks invoking the appropriate payment provider service.

Retry Mechanisms:

  • Failed webhooks are marked as such and can be retried manually or automatically via a clock job.
  • A cleanup job is introduced to remove processed webhook records older than 1 month.

Controller Update:

  • The WebhooksController actions (stripe, adyen, and gocardless) now use the InboundWebhooks::CreateService and respond with 200 OK once the webhook is persisted successfully.

Backward Compatibility:

  • No webhooks will be missed during the transition. The change is designed to integrate seamlessly into both cloud and self-hosted environments.

Key Changes

  • Added InboundWebhook ActiveRecord model to persist webhook payloads with fields such as source, event_type, payload, status.
  • Implemented InboundWebhooks::CreateService to handle webhook persistence and Sidekiq job enqueueing.
  • Added InboundWebhooks::ProcessJob to process persisted webhooks from Sidekiq queue.
  • Added InboundWebhooks::ProcessService to call the respective provider's handle incoming webhook service.
  • Updated PaymentProviders::Stripe::HandleIncomingWebhookService to process webhooks from the database instead of relying solely on direct Sidekiq queues.
  • Updated WebhooksController#stripe action to integrate with the new service and workflow.
  • Introduced a clock job for:
  1. Retrying pending webhooks older than 2 hours.
  2. Cleaning up processed webhooks older than 90 days.

Benefits

Reliability: Ensures no webhook is lost, even if Sidekiq workers fail.
Resilience: Provides a retry mechanism for failed webhook processing.
Transparency: Makes webhook processing traceable with statuses (pending, processing, processed, failed).

@ancorcruz ancorcruz added the 🐞 Bug Something isn't working label Dec 17, 2024
@ancorcruz ancorcruz self-assigned this Dec 17, 2024
@ancorcruz ancorcruz force-pushed the fix/persisted-webhooks branch 2 times, most recently from c083548 to d3e6956 Compare December 17, 2024 16:49
@ancorcruz ancorcruz force-pushed the fix/persisted-webhooks branch from 8fb4b83 to eb955bc Compare December 19, 2024 16:09
Copy link
Collaborator

@vincent-pochet vincent-pochet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

app/jobs/inbound_webhooks/process_job.rb Outdated Show resolved Hide resolved
event_type:
)

after_commit do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are not relying on transaction right? So I guess this after_commit block is not mandatory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ensures the job is not added to the queue and tried to be processed until the DB transaction is committed. This will avoid race condition between sidekiq and postgres where sidekiq picks the job but it is unable to find the resource in DB.

Copy link
Collaborator

@vincent-pochet vincent-pochet Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but here InboundWebhooks::CreateService is called directly from the controller and no transaction is created in it so the job will always be called after the InboundWebhook creation transaction.
That's why I was asking. No a major issue anyway ;)


scope :retriable, -> { reprocessable.or(old_pending) }
scope :reprocessable, -> { processing.where("processing_at <= ?", WEBHOOK_PROCESSING_WINDOW.ago) }
scope :old_pending, -> { pending.where("created_at <= ?", WEBHOOK_PROCESSING_WINDOW.ago) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without indexes on these columns, this will be a full table scan always 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add two combined indices [status, processing_at] and [status, created_at] should do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants