Anyone, share any ideas on how you’d build an automated retry system for scheduled workflows that didn’t run because of a downtime.
Consider:
What if a downtime occurs when the retry system is supposed to execute?
Chatgpt’s suggestion:
To create an automated retry system for scheduled workflows that don’t run due to downtime in Bubble, consider a multi-layered approach that uses logging, periodic checks, and retry logic with a focus on resilience. Here’s a strategy that could address this effectively:
- Logging Workflow Schedules and Completion Status
When you schedule a workflow (or workflows on a list), log each schedule in a separate “Workflow Log” table. This table should capture:
Workflow ID
Type of workflow
Scheduled time
Completion status (e.g., “Pending,” “Completed,” “Failed,” or “Retried”)
Last attempted execution time
In each workflow, include a step to update this log as “Completed” once it finishes successfully. This way, you have a record to identify workflows that didn’t run or complete.
- Implementing a Monitoring Workflow
Set up a periodic check (e.g., every hour) that scans the “Workflow Log” table to find workflows with a status of “Pending” or “Failed” and a scheduled time in the past.
This monitoring workflow should retry any workflows that were missed, marking them as “Retried” once rescheduled.
- Building the Retry Logic
Immediate Retry Upon Detection: When the monitoring workflow identifies a missed workflow, it can attempt to reschedule it. However, instead of just one retry, consider building a series of incremental retry attempts.
Retry Attempts: For each failed workflow, set a cap on retry attempts (e.g., 3 attempts with exponential backoff, such as retrying at intervals of 5 minutes, 15 minutes, and 1 hour).
Flagging for Manual Intervention: After the maximum retry attempts, flag the workflow as requiring manual intervention. This can trigger a notification to an admin or team member to review the issue.
- Handling Downtime During the Monitoring Workflow
Since the monitoring workflow itself could be affected by downtime, consider using redundant checks. You could set up multiple periodic workflows at staggered intervals (e.g., one every hour and another every 6 hours) to catch any workflows that might be missed.
Persistent “Final Check” Workflow: At a less frequent interval (e.g., once every 24 hours), set up a “final check” workflow that reviews all logs with a “Pending” or “Failed” status. This final check can act as a failsafe, ensuring that no missed workflows linger indefinitely.
- Using Conditional “Catch-Up” Logic
Add logic within each workflow itself to perform a quick check against the “Workflow Log” before running. If the workflow was already completed, it can exit early to prevent duplication. This is helpful in scenarios where multiple retries might accidentally trigger the same workflow.
- Tracking Retry History for Insights
In the “Workflow Log,” maintain a retry count and a timestamp of each retry attempt. This will help you monitor any patterns in failures and investigate root causes.
- Consider an External Ping for Extended Downtime
For critical workflows, consider setting up an external monitoring service (like a cron job or uptime monitoring service) to ping your Bubble app at regular intervals. If it detects downtime, it can automatically notify your team to check on critical workflows.
Example Workflow Execution Plan
Let’s consider an example where you schedule a workflow at a specific time:
-
Initial Schedule: Log the workflow in the “Workflow Log” table with a “Pending” status.
-
Execution Check: When the scheduled workflow begins, it first checks the “Workflow Log.” If the status is “Pending,” it runs and marks the status as “Completed.”
-
Monitoring Workflow: Every hour, this workflow scans for any workflows with a past scheduled time and a “Pending” status. If found, it increments the retry count, attempts to re-run the workflow, and updates the last attempted execution time.
-
Retry Limit and Manual Intervention: After 3 failed attempts, the monitoring workflow flags the record for manual intervention, notifying an admin or triggering a separate alert workflow.
Resilience for Large-Scale Operations
As your app scales, this retry mechanism can help ensure that workflows critical to billing, refunds, and data processing remain reliable. The scheduled retry system will catch and retry missed workflows automatically, and the log will make it easy to review any issues in your backend.
This approach should allow you to keep track of critical workflows, recover missed schedules, and handle downtime without having to manually monitor every workflow.