What Is Bubble Doing to Prevent Scheduled Workflow Failures During Downtime?

Hello,

I recently encountered an issue with scheduled API workflows during the system downtime on November 13 at 16:39 EST, which raised some concerns about the stability of backend operations, particularly for sensitive workflows involving billing and refunds. This downtime affected backend workflows that were supposed to run at that time, they didn’t end up executing at all. Instead, they disappeared from the scheduler and didn’t perform their intended actions, such as modifying data or triggering additional workflows.

I was advised to contact support if I ever encountered this problem, but since my case happened during testing, there’s no immediate need to open a ticket. My concern, however, is what will happen once my app is live and scaling. It would be incredibly challenging to manually track every scheduled workflow amongst hundreds or thousands that didn’t run and simply disappeared from the scheduler because it was supposed to be triggered during a downtime event.

I’m hoping to get some insights into Bubble’s approach to these situations. Specifically:

  • What is the expected behavior for scheduled workflows during downtimes? Does it depends on the type of incident?
  • Are there any measures in place or in development to ensure critical workflows either resume or retry automatically after downtime?

The Bubble manual says the following:

Does this imply that “Schedule API workflow on a List” ensures all workflows are executed, whereas “Schedule API Workflow” does not guarantee this?

Hopefully, someone from the Bubble team could give an official response on this. It would help to know how to plan for future scaling and ensure the stability of essential workflows, especially for billing and other critical operations.

Share your thoughts! Even if you are not from the Bubble team.
Thanks in advance for any guidance or recommendations!

3 Likes

The only way to eliminate this problem at the moment is to have an independent server with Bubble (not totally guaranteed) . I sometimes encounter this type of error (failure or crisis), and it’s very difficult to keep up. Firstly, Bubble should send us an e-mail immediately after detecting such a situation, or a ‘server not available’ status, in order to stop, if it’s not too late, any possible execution. What’s more, all my schedules are now logged so that I can diagnose, and even restart if necessary.

4 Likes

This is the kind of thing you can get an answer for from support :slight_smile:

SAWOL means that the recursion won’t fail.

If you use a recursive workflow, you need error handling by default, as if any one workflow stops, the entire loop will stop. With SAWOL, even if one workflow has an error, the rest will still run.

1 Like

Thank you for the clarification, but that’s not quite the point we’re discussing. The main focus here is understanding what happens to any scheduled workflow, whether scheduled via a recursive workflow or not, if it’s set to run at the precise moment a downtime occurs.
The question is about the behavior and reliability of scheduled workflows that are already in the queue during a downtime.

I mentioned “Schedule on a List” in this discussion not because of its recursive capabilities, but because the manual indicates that, in the event of server downtime, this type of workflow will pause and then resume when resources are available again.

I appreciate you sharing your insights though!

Yes, because you can’t ‘pause’ a recursive workflow…

The point is, if you have a recursive workflow, or any mission critical workflow, you should build error handling in. We take it for granted that Bubble (generally) works, but it’s normal in apps in any framework to have to be built to handle when things go wrong.

Things like this make your app more robust. When you schedule a workflow (or multiple), create a log for each one which contains the workflow ID. Pass that to the workflow itself, and mark the log as complete when the workflow is complete. If any logs are incomplete, you know something didn’t run.

3 Likes

Sure we can just ask everything to support and not use the forum. I just think this would a valuable information for all the community to have. Also, an opportunity for the community to share their feedback about this for the Bubble team to keep improving the product.

Thanks for your input

1 Like

This are some good insights, thank you for sharing

1 Like

What I understand is: specifically the action of “scheduling the list of workflows”, has the capability to pause during a downtime and continue scheduling workflows later when server is back up.

However, if the workflows were already scheduled, they will behave just as if they were scheduled by a single “Schedule API workflow” and, according to my experience, not run at all if a downtime occurs during their scheduled date (this final part is what I’m looking for Bubble to clarify, it might depend on the type of incident, although we might all have our thoughts, I can’t find an official pronouncement about it).

1 Like

Anyone, share any ideas on how you’d build an automated retry system for scheduled workflows that didn’t run because of a downtime.

Consider:

What if a downtime occurs when the retry system is supposed to execute?

Chatgpt’s suggestion:

To create an automated retry system for scheduled workflows that don’t run due to downtime in Bubble, consider a multi-layered approach that uses logging, periodic checks, and retry logic with a focus on resilience. Here’s a strategy that could address this effectively:

  1. Logging Workflow Schedules and Completion Status

When you schedule a workflow (or workflows on a list), log each schedule in a separate “Workflow Log” table. This table should capture:

Workflow ID

Type of workflow

Scheduled time

Completion status (e.g., “Pending,” “Completed,” “Failed,” or “Retried”)

Last attempted execution time

In each workflow, include a step to update this log as “Completed” once it finishes successfully. This way, you have a record to identify workflows that didn’t run or complete.

  1. Implementing a Monitoring Workflow

Set up a periodic check (e.g., every hour) that scans the “Workflow Log” table to find workflows with a status of “Pending” or “Failed” and a scheduled time in the past.

This monitoring workflow should retry any workflows that were missed, marking them as “Retried” once rescheduled.

  1. Building the Retry Logic

Immediate Retry Upon Detection: When the monitoring workflow identifies a missed workflow, it can attempt to reschedule it. However, instead of just one retry, consider building a series of incremental retry attempts.

Retry Attempts: For each failed workflow, set a cap on retry attempts (e.g., 3 attempts with exponential backoff, such as retrying at intervals of 5 minutes, 15 minutes, and 1 hour).

Flagging for Manual Intervention: After the maximum retry attempts, flag the workflow as requiring manual intervention. This can trigger a notification to an admin or team member to review the issue.

  1. Handling Downtime During the Monitoring Workflow

Since the monitoring workflow itself could be affected by downtime, consider using redundant checks. You could set up multiple periodic workflows at staggered intervals (e.g., one every hour and another every 6 hours) to catch any workflows that might be missed.

Persistent “Final Check” Workflow: At a less frequent interval (e.g., once every 24 hours), set up a “final check” workflow that reviews all logs with a “Pending” or “Failed” status. This final check can act as a failsafe, ensuring that no missed workflows linger indefinitely.

  1. Using Conditional “Catch-Up” Logic

Add logic within each workflow itself to perform a quick check against the “Workflow Log” before running. If the workflow was already completed, it can exit early to prevent duplication. This is helpful in scenarios where multiple retries might accidentally trigger the same workflow.

  1. Tracking Retry History for Insights

In the “Workflow Log,” maintain a retry count and a timestamp of each retry attempt. This will help you monitor any patterns in failures and investigate root causes.

  1. Consider an External Ping for Extended Downtime

For critical workflows, consider setting up an external monitoring service (like a cron job or uptime monitoring service) to ping your Bubble app at regular intervals. If it detects downtime, it can automatically notify your team to check on critical workflows.

Example Workflow Execution Plan

Let’s consider an example where you schedule a workflow at a specific time:

  1. Initial Schedule: Log the workflow in the “Workflow Log” table with a “Pending” status.

  2. Execution Check: When the scheduled workflow begins, it first checks the “Workflow Log.” If the status is “Pending,” it runs and marks the status as “Completed.”

  3. Monitoring Workflow: Every hour, this workflow scans for any workflows with a past scheduled time and a “Pending” status. If found, it increments the retry count, attempts to re-run the workflow, and updates the last attempted execution time.

  4. Retry Limit and Manual Intervention: After 3 failed attempts, the monitoring workflow flags the record for manual intervention, notifying an admin or triggering a separate alert workflow.

Resilience for Large-Scale Operations

As your app scales, this retry mechanism can help ensure that workflows critical to billing, refunds, and data processing remain reliable. The scheduled retry system will catch and retry missed workflows automatically, and the log will make it easy to review any issues in your backend.

This approach should allow you to keep track of critical workflows, recover missed schedules, and handle downtime without having to manually monitor every workflow.

1 Like

Yea that could work

The absence of a database locking feature in Bubble creates a risk of the monitoring system triggering the same workflow twice if it checks the workflow status at the exact moment it’s already being executed but hasn’t yet been marked as “completed.” This overlap could result in duplicate executions, which may not be a major concern at a small scale but could become increasingly problematic as your app scales. Without a locking mechanism to ensure that only one process can modify or access the status at any given time, this monitoring system could introduce new, potentially more significant issues than the original problem of retrying missed workflows due to downtime, particularly when dealing with workflows critical to business operations.

2 Likes

I have the same concerns you do regarding critical backend workflows. It would be great if bubble was able to catch what scheduled tasks were missed during an outage and restart them.

1 Like

Vote for this feature here

Don’t overengineer this. For anything critical, on that same table, create a “Date checked” and “Date scheduled field.” Every so often, according to your WU cost tolerance, run a scheduled workflow on a list to find cases where there’s a discrepancy between the date checked and date scheduled field.

Thank you for replying. That monitoring scheduled workflow might also fail to execute during an outage, which would cause it to not run and not re schedule itself, breaking the recursion. But I understand your intention and I’m truly a fan of not over engineering, I wish there was a simple and reliable solution for this, that’s were I think Bubble could help.

Not recurring, schedule workflow on a list, that’s concurrent.

Oh, so you meant to manually trigger the monitoring system periodically, yes that could work. However, I think it’s not a scalable long term solution. An automated system would be more appropriate.

Wanted to get some more detailed information from the support team directly, so I reached out to them asking them to provide more information about what happens to workflows scheduled during downtime. Here’s the response I received:

I completely understand your concern regarding scheduled workflows, especially in light of the recent downtime. I want to clarify that Bubble does attempt to re-run any scheduled workflows that occur during brief database or main cluster outages. When there is a short downtime, for example, the database goes offline for a few seconds, our procedure retries any queries that were affected during the crash. However, in rare instances, a very small number of these queries might not be retried successfully.

That being said, if the database or main cluster experiences an extended outage (for longer than a few seconds) the workflow will likely fail. In such cases, please don’t hesitate to reach out to our team. We’ll do our best to provide more details on which workflows were affected and help troubleshoot further.

Our engineering team is actively working to improve this process, and we are keeping a running list of all occurrences to maintain transparency with our users.

That said, I want to emphasize that this situation is extremely rare, and our retry procedure for queries after brief outages is generally very reliable. If you experience any issues following a more major outage, please don’t hesitate to reach out to us.

Conclusion: For important workflows, the type of failsafes you’ve discussed are crucial. We’ll definitely get going on that for our apps.

3 Likes

Thanks for sharing. While the recent outage was brief (as they said), I did experience scheduled workflows failing to execute entirely. This highlights a critical issue: Bubble still lacks a truly reliable infrastructure solution for this, and didn’t provided you with a clear roadmap on how they are addressing it.

These are the kinds of improvements that would help Bubble evolve into a more robust platform capable of supporting enterprise-grade applications at scale. Reliable infrastructure is essential for users building and scaling such apps.

In addition I’d say that bubble lack as well a way to access the scheduled workflow list programmaticaly. For example if you need to schedule 1 workflow maximum per user and need to check if one already exist before scheduling another.

Also, a “schedule workflow on a list” does not have the same usecases than a recursive workflow - as things are not run in sequence. Giving out advices to use one instead of another is not a solution in many cases.

1 Like