Recursive Workflow Failure

So last night I set up a recursive workflow to run on over 5,000 entries in development mode…The workflow was set up correctly because before I logged off the computer, the workflow had ran successfully around 600 times.

I fully expected to log in today and see if not all of the entries completed, at least a large chunk…instead I see only 617 were completed.

So, I needed to investigate in the server logs to see what would cause this failure.

This is the last registered workflow that started running at 1:04:55AM

So, looking at the capacity usage, because of course, if I was maxing out capacity that would be an obvious reason for the recursive workflows to fail

But looking at that chart it clearly shows that at the time the last workflow was triggered (roughly 1:05AM)…there were no max_out_times (0) between 1AM and 1:15AM

Max Out 1:15AM Max Out 1AM

Checking out other graphs in the server logs makes things even more confusing as to why did this recursive workflow fail

Could it be that for some reason the number of page views was so ridiculously high that something occurred?

No, because there were 0 page views between 1AM and 1:15AM

Checking out workflow runs you can see things were going well, then there was a drop off.

Why such a dramatic decrease? Shouldn’t the recursive workflow run at the same interval, so that there are a similar number of workflow runs over the period of time the recursive workflow is running?

What other goodies do we get in the server logs to help explain issues like this? The server capacity would be a good place to see how my usage, although not maxing out capacity, may help explain it.

Okay, so usage against available capacity of 11.44% at 1AM ---- could that really cause the issue; I hope not, only using 11.44% of server capacity shouldn’t cause the app to fail. Then at 1:15AM it was 0.04% — because by that time the failure had already taken place.

So, I figured there must have been something wrong with my data…because something off with the data may have been reason for the failure.

When I logged in today the number of entries stood at 617. Right before I began writing this post I started the recursive workflow again…now the number of entries is 748…So, no problem with my data.

So, What caused this failure?

Could the main cluster of had an issue of downtime…that must be it

Oh, no downtime recorded…so What? What would be the reason that this recursive workflow failed to continue?

@eve any ideas of who could explain this? I asked in the past on a similar thread and got no response.

@DavidS is there any particular department at Bubble with members who could help explain these types of issues, or look into them? I am sure it is not really a bug that could be ‘reproduced’ so I feel like the normal bug report would not be a worthwhile channel to report to.

@allenyang I’ve seen that you have posted in the forum about performance issues, is this type of issue something you may be able to shed some light on?

I’ve experienced enough times to be very concerned about it.

Hello!

Thanks for posting about this issue; we’re sorry to hear that you’ve run into trouble with your recursive flows! Logging a bug report is the best way to get this issue addressed. Providing our team with time stamps of your logs and details about the process (as you have done in this thread) is generally enough to give our team a place to get started, though do keep in mind that issues that our team cannot reliably reproduce are ultimately very difficult to track down. We do understand that this has been frustrating for you in the past – it’s just as frustrating for us, too!

Thanks in advance for filing the report. :slight_smile:

That’s why, the workflows will only run while you’re looking to the screen.

Just kidding :joy:

Man, this is such a thorough topic. I really wish you get some answers from the team about this, because looking at the data, everything seems right.

Until you get a better solution from the team, some people here in the forum do mention creating auxiliary workflows, which are used to check progress and, in some cases, restart the loop again in case of failure, which is a bit dangerous in my view.

Edit: Yup, indeed you got an answer in ninja time. The team beat me to it :yum:

Curious here as well, @boston85719. One question I would have is when in the workflow are you rescheduling the workflow?

If you are rescheduling it at the end and any of the actions fail then it wouldn’t schedule the next iteration. But that would be impossible to know without having built in a manual error checking flow as @vini_brito mentioned.

So this is a four step recursive workflow.

Step 1 is to create the retailer-listing by taking the data from another data type. At the end of this workflow it triggers the next step with a 2 second waiting period.

Then step 2 with a one second wait to trigger the next step…I figured the need to wait a second or so is not completely necessary as the trigger shouldn’t be fired until the first event in the workflow is complete, but also the next step doesn’t rely on any finished data entries from this step.

3rd Step

And finally the fourth step which the final event in this workflow is the recursive portion triggering the first API in the series

This fourth step in the series is actually the last workflow that started running at 1:04:55AM which would lead me to believe the final event to trigger the first API in the series did not occur for some reason.

What would a manual error checking workflow setup entail? I could only imagine that the manual error checking workflow would fail at the same time, but I really don’t have a clue what the set up of such a thing would be like.

But what @vini_brito mentioned of a auxiliary workflow seems like a dangerous ‘backup’ plan as I could imagine it would cause data redundancy and would also still be susceptible to these types of failures as well.

It just really makes me nervous about trying to create a business that relies on data integrity, where things like invoices, inventory updates etc etc could fail and have no real understanding of why the failures take place…especially when they are not able to be reproduced and therefore not supported by Bubble support.

As @eve mentioned

Before, posting this, as I stated already I started the recursive workflow again…at this point the total number of entries has 4,493 — so it is pretty clear I can’t even reproduce the failure, and with no indication from the server logs of a possible cause, it seems impossible to figure out.

The same situation was true of my first post on this subject a couple of months ago. I had the failure during the night…came back the next day, ran the recursive workflow again and it worked.

So these failures seem to be those intermittent, non-reproducible and unexplainable failures that are dreadful as I would never really be able to mitigate them…unless going to a dangerous backup plan of auxiliary workflows - but really shouldn’t need to have a ‘duplicate’ workflow in case of failure.

Even so, I guess I will file a bug report as seems like that is the only option to hope for an explanation and resolution to avoid this in the future.

Got a reply from support

I investigated the logs and it seems that on 2:04:55 pm EDT on 07/13/20 (which I assume is the 1 am in your timezone), the workflow scheduled another one to run, but that one was never ran by our scheduler, and seemingly disappeared. I have reported a bug to our engineering team for further investigation in this week

So, it is great that they could look into the workflows with a little more understanding than I could to see that the scheduled workflow seemingly disappeared.

Hopefully they will dive into the root cause of it and make any adjustments necessary.

2 Likes

That’s unfortunately something I’ve also experienced and reported mor than a year ago. The recursive workflow just stopped. Few times but it did. Capacity was << 100%

1 Like

This topic was automatically closed after 14 days. New replies are no longer allowed.