Mysterious Workload/Workflow Issues

So we have been getting an enormous amount of workload units being used for one specific step in one specific workflow.

Initially, we thought it was because we were doing multiple searches of large datasets (60,000 at the time of writing) in that one step. Obviously room for improvement right?

So we optimized, used data triggers to dynamically gather data to reference instead of searching for entire lists, etc…

We eventually got the search down to one single search with filters/conditions. However, this really doesn’t seem to have helped.

Below are screenshots of our app metrics. Yes, you can see that correctly - 10,000-20,000 WU in a single minute is not uncommon if 5-10 users are actively using this one feature/step at a time.



And here’s the single step that app metrics is telling us is using a vast majority of the Workload Units:

The singular datatype search we managed to condense the step down to must be the problem, right?

The “Lead” datatype is being searched. The lead datatype in total has 275,000 entries, but the filters / constraints brings that down to less than 10,000 and then we just grab a :random item from that list. We are not using any “:count” endings for the lists/searches which I know can drastically increase workload units…

The strange thing is, from the optimization we’ve done, the workload units have already drastically decreased (not nearly enough, but definitely decreased). The one issue that still remains is the total amount of workflow runs. We have a loop step at the end with a 3 second delay, a hard constraint, and also a “result of step…” in one of the fields (which I believe forces it to wait for that step to finish?).

To me, it seems that step 1 (which has the initial search) is not returning a result and therefore looping thousands of times until it does. The rest of the steps in that workflow have a condition (do not continue if result of step 1 is empty) which may be why they also aren’t running thousands of times, but if step 1 is actually returning an empty result we should DEFINITELY see that in the server logs?

We’ve also added a step to change a specific field in the datatype entry that step 1 creates IF the result of step 1 is empty. That way we can see in the database new entries being created if the result of step 1 is empty - but there’s nothing being created which I would assume means that that is not the issue…

How many results would be returned from Search for Ringing Calls :each Item’s Lead’s unique id? And how heavy are the Ringing Calls and how heavy are the Leads?

That one expression will need to retrieve all fields for every single Ringing Call, and retrieve all fields of all of those Leads just to get the unique ID’s you’re looking for. (Normally relationships aren’t formed using unique id, but instead the actual Bubble Thing for this reason)

Try removing that constraint and seeing if anything changes.

1 Like

As pointed out your nested searches are gobbling up your WUs. Can’t you grab those list values as a WF parameter instead?

Good question. That’s one of the optimizations we did. The “Ringing Calls” datatype is only ever 5-20 entries max at any given time (it might go up to 100 if we scaled massively).

What we’re doing there is just a datatrigger which creates a ringin calls entry when the call is initiated, and then deletes the entry when the call is connected. So that search is literally only going a handful of entries at any given time.

Ringing Calls datatype - 4 fields
Lead datatype - 100 ish fields

I feel the most important thing to notice here is the shear amount of workflows being run. This workflow is running thousands of times, but only 50-100 “campaign call logs” are actually being created. This tells me there’s an issue with the actual “Create campaign call log” step in which it’s looping a ton but not producing a result… Which in turn causes the large Workload Units.

We’ve done tests to see if we can generate some sort of indication if Step 1 is returning an empty result or field, but we aren’t getting anything.

The only nested search there for Ringing Calls is searching only a handful of entries every (5-10 entries at any given point in time in total).

That size of a list wouldn’t gobble WU’s would it? The lead datatype itself has about 100 fields though - not sure if that would change things?

The size of returned data does affect the WU so having 100 fields does result in the increased WU.

You should create a lighter version of the Leads datatype or use API calls to return just the data you need.

Would the large amount of fields in the leads datatype cause an issue with the search itself and cause the workflow to run thousands of times per minute with a majority of searches returning empty results? That’s the primary issue it seems.

For example during a peak usage hour yesterday, there were 22,000 workflow runs in which the search step used about 8,500 workload units.

This is telling me the workload units per run aren’t actually that much, but it’s the number of workflow runs that is the issue:


To add more context to that, there were only 100-200 campaign call logs datatype entries created during that hour. The way the workflow is set up, it should be creating a campaign call log datatype entry for every 3-4 workflow runs. This tells me workflows are running, but nothing is produced from a majority of them.

UPDATE

Okay so we’ve definitely made progress here (see updated search structure below. We deployed at 1pm and there’s a very noticeable difference in Workload Unit usage), however I still think there is a ton more optimization that can be done. Specifically by potentially create a “lite” version of the Leads datatype (which currently is now at 80,000 records and has about 100 fields).


The “Ringing Calls” datatype is extremely light (5 fields) and there are only ever 5-10 entries at any given time (entries are deleted automatically after 45 seconds).

The real trick now will be figuring how to create a lite version of the Leads datatype which only has the fields required for the search and dynamically updates the fields based on it’s parent-lead’s data. Potentially just purely through data triggers? However I’m thinking that those data triggers alone will then start consuming an enormous amount of workload units because they will need to trigger whenever one of the fields that the lite leads datatype uses…

Anyone have any creative ideas to create a “lite” version of a datatype that dynamically updates?

From the screenshot I’d say the biggest problem area are the search constraints with “doesn’t contain” and “is not in”, they cause a full table scan, and there are three of them.

So the search has to examine this number of rows:
nbr leads X nbr campaigns
+
nbr leads X nbr campaigns
+
nbr leads X nbr ringin calls search

Could the pool of available leads be determined ahead of the search?

Also, “:random item” is not very efficient, although on just 100 rows it may not be a problem here.

(Edit) The DNC constraint is now the opposite of the first screenshot, is that intended?

(Edit) see below for partial retraction. Note to self: brain function is low at 2am :crazy_face:

Good catch on the DNC. I didn’t even notice. We only had the first screenshots search structure live for a few minutes thankfully haha…

Interesting about the doesn’t contain and is not in. What is the best practise alternate to that? For the “unique id isn’t in Search for Ringing…” we could instead just do a “minus list” to the list of leads instead of having as a nested search?

The searches only function really is to find the pool of available leads. The issue is that the pool is extremely dynamic and that’s why we need to run a search every single time we initiate this workflow (every time we want to make a call we need to find the next lead to call). It’s not a static list unfortunately.

Yeh :random item is the only possible way we could figure out how to not cause the race condition issue in Bubble. Before, we were taking the first item in the list and this caused a massive amount of race conditions resulting in the same lead being found over and over and over again which 100x’d the workflow runs/units. Random item does the job perfectly and we’re also going to test with dropping it down to 30-50 to see if that has a positive impact on the workload units.

We’re also going to try this version of the search (see screenshot below) to see if it reduces workload units. The theory being that generating a smaller lister initially to filter (Campaign’s Lead Group’s Leads) instead of doing a Search for Leads may massively decrease WU’s… but could be wrong.

Haha woops, in my sleepy state I misread the search constraints. Those two “doesn’t contain Campaign” are filtering Lead by a single value, so aren’t multiplying the scanned rows.

Assuming Bubble applies search constraints in an optimal order:

Lead rows 275k → constrain to 10k → times 5-10 ringing calls
gives a lower bound of 50k scans and upper bound of 100k scans,
reduced an unknown amount over time as indexes are automatically built.

Since ringing calls size is so low, perhaps remove it from the constraints where it is a multiplier, and add to the post-search, after the filter, using minus list.

:random item can be replaced with the more efficient :item # and give it a random 2-digit number, using formula create random string. But I expect minor gains there.

Can you explain more about the number of workflow runs? is the peak 22k runs in an hour? Does it make sense from the number of active users etc?

(Edited the numbers)

I also thought of moving the ringing calls to a :minus list. I’ve added that to the next iteration to test.

It’s hard to quantify whether the workflow runs per hour (yes that is 22k per hour at peak usage) makes sense for the number of active users because this is a new feature, but one thing is for sure - it’s not scalable unless we can get the WU’s under control.

Would changing the “Search for…” to the structure screenshotted above (campaign’s Lead Group’s Leads: filtered…) move the needle or am I wrong in assuming that?

It really seems as though the biggest impact would be to create a lite Lead datatype somehow instead of referencing the continually-growing Lead datatype.

Also - is there no way to get a breakdown of workflow runs per workflow in the given time range? I’m only seeing data based on capacity and WU’s.

I believe we have already confirmed that there is no mysterious workflow loop occurring though. I create a datatype called “Workflow Runs Debug” and in the workflow I added a step to create a new entry for that datatype with zero constraints - the idea being to compare the # of entries created vs. the # of Campaign Call Logs created (what step 1 in the workflow is meant to create). They were the same - which I believe indicates that there’s no wasted workflow runs from empty results somewhere and confirms that the WU issue is actually just from the inefficiency of the search in Step 1.

Here’s the next search structure we’ll be deploying Mon/Tues to compare:

So you did :slight_smile: Apologies I didn’t notice … I think it will make a measurable difference.

It’s hard to predict, because it depends on Bubble’s implementation. You may find the two methods are equivalent, or it may be the case that :filtered misses out on indexing gains, or the upfront reduced record list gains overwhelm other deficiencies.

The data size and processing size of the app may make it a candidate for assistance from a Bubble engineer, they have monitoring tools to help decide on efficiencies.

I agree with @ihsanzainal84’s suggestion.

1 Like

Slight changes on next iteration - realized that if we take the first 100 results from the search without first “minusing” the ringing calls list we may end up with an empty result at scale in the scenario that all 100 lead results found are in the ringing calls entries (that would be big scale but might as well plan for it now).

Also going to play around with the # of “Items until”. Not sure if bringing it lower than the current 100 will decrease WU’s but worth a shot. Only need 1 result per WF run, but need to have a large enough list that the :random item has a low enough probability of pulling the same lead from the list that we don’t end up with getting the same result back to back.

1 Like