I’ve been doing some data migration for a client project and learnt something today which I want to share.
SUMMARY
I am linking a large data type (1.4M records) to a smaller one (100 records). The large data type has a text field which I am using to do a search in the smaller one, find the linked record, and then I link the two by updating a new field in the large data type.
CHALLENGES
First of all, the Bulk option in the Data tab cannot handle so many records at a time, so I wrote a backend workflow and chunked the data into 100K using a SAWOL (schedule API workflow on a list) which I manually trigged 13 times. A bit of a pain, but worked nevertheless.
LEARNING
While this data manipulation was going on, I tried to upload data into another 2 small data types (about 400 records in one and 1200 in another). This upload processing would not start and I tried multiple browsers and while the file upload step would complete, the Processing step wouldn’t even begin. It took me a while to realize that the processing did not start because a 100K batch of migration was running. I accidentally (and painfully) found out that when a 100K batch wasn’t running the upload would work normally.
SOLUTION
The simplest solution would be to offer a warning message to indicate that the database is too busy and processing is delayed until a later time. Of course, the ideal, and longer term solution is that the database gets a load of steroids (and some love) so it can process multiple large dataset operations all at the same time.
Hope this finding helps someone save time and frustration.
Yes, if you are doing anything intensive (even a text search), doing 100k tranches at once will bring everything to a halt. Usually it’s better just to do 28 batches of 50k. It is also probably better not to do it manually and instead use a recursive WF (which allows you to spread it out without becoming impatient).
However, in this case, it seems like SAWOL on each record (in the large dataset) isn’t the best way to go. Assuming I’m reading it right, there’s an average of 14k matches from the larger dataset to the smaller dataset. Therefore, I would just run it from the small list by finding up to 9,900 records from the large dataset that have the smaller dataset’s text field and updating them in one fell swoop. This would decrease the number of searches required to update the 1.4 million records by 99.9986%.
As much as bubble advocates say that the database can handle this amount of data, I have never had a good experience with very large databases in bubble. I believe that migrating to a more robust solution like Supabase or Xano is a better solution to take instead of insisting on using the bubble database.
yes, I agree. For this level of data you’d be best using a 3rd party database.
it will be cheaper for bulk updates because they don’t charge WUs
it will be significantly faster to update bulk data
you can more easily view/edit the data (bubbles backend data tables are sooo limited!)
I hope by linking 1.4 million records to 100 records you don’t mean storing the records as a list on the 100… that would be quite inefficient. You’d either be better to create a connector data record or store the 100 records on 1.4 million records.
Recursive workflow would also be much better here as the api workflows on large lists can be quite unreliable. If it is done recursively then you’d just repeat it until the recursion is completed - adding a temporary field to track what records have been processed is a good way to manage it.
Interesting take. Not sure if that would result in more searches or less. Rather than dropping the searches, I preferred to keep the search volume low (bytes returned from the server).
If only I could test both approaches and see which is exactly better, it looks like a large amount of data either way.
On another note, I have found SAWOL to be more reliable that recursive WF and leaving the interval to bubble.
Yes, the client is now asking to do this.
Heck no! that many records as a list would be a disaster.
SAWOL seems to be quite reliable at 100K records and I am happy to avoid the pain of writing recursive workflows
As I said, it will reduce searches by 99.9986% so it will definitely REDUCE the number of searches . (back of the napkin 200/1.4M - assuming that 2 std deviations away is under 30k records; either way it reduces searches by 99%+).
You can do whatever you prefer, obv but if you want to be able to upload or use the app during such updates, then reducing the searches is the way to go and will save you big time on WU as well. The number of searches and the type of searches are what matter from a performance perspective, but in this situation, data returned by the server is still tiny relatively.
I disagree. Bubble can easily manage a few million records. If the record is substantial, breaking up the dataset could be beneficial, but Bubble can handle that data volume and then some. Your update seemed to be a one-time batch; if such updates were frequent, couldn’t be scheduled during downtime, and were resource-intensive, then Bubble might not be optimal.
More importantly, if Xano/Supabase is the right solution → the right move is to abandon Bubble entirely.
Any argument, if there are any, for staying with Bubble, given its current state and direction, that isn’t based on resisting change or sunk cost fallacy, only applies if you’ll utilize Bubble’s greatest strength - a fully customizable backend and frontend solution. If you need to stop using either Bubble’s backend or frontend for any app, it’s time to move that application off Bubble completely.
There is no pain to write a recursive workflow. You can trigger it manually on any given record and execute it the same.
This math doesn’t add up. I ran some numbers based on the record size and whether you search the large data set first or the small one, the end result is that the WU cost is very very close, and almost identical. This is because the surface area of the searches in both methods is exactly the same. I don’t think searches, or WU, will drop by nearly 100% with an alternate method.
I disagree. I wouldn’t move a large, complex app to another platform solely for database performance reasons. Rebuilding on another platform entails rebuild costs and regression risk, which I would never advise anyone to undertake. Don’t throw the baby out with the bathwater.
I am fascinated by performance engineering and would love to have a friendly chat on zoom to go over the use case, and the numbers. I am based in Toronto (EST) and would be more than happy to arrange a “Database Performance - Learning by example” session.
Number of searches is only a small part of the costs. A search is 0.3 WU, but updating a thing is 0.5 WU, running a server side action is 0.6 WU, scheduling a workflow is 0.1 WU, etc.
Obviosuly its not the whole cost but by essentially removing the search costs youre removing 20% of the cost and minimizing server resources. All in ~5 mins with downside
Not what I said. I said Bubble can easily handle millions of records BUT IF you’re gonna switch to external DB, which entail rebuild connections etc, then ditch Bubble altogether, as you’re prob going to have to do so within a few years anyways.
Again, not talking about the surface areas, merely the NUMBER of searches decreasing. And it makes a major impact on WF (see below).
What you’re saying makes sense @georgecollier but as far as I’m concerned WU isn’t actually a mathematical undertaking as there are so many intangibles. I’m def not interested in getting into the weeds of WU but here’s a quick example.
Uploaded 4k records to large dataset with search text field (one of eat, sleep, play, repeat) and bulk wf text (either small or large, 2k apiece).
Created 4 Small dataset records (eat, sleep, play, repeat)
Created a WF to link each of the large datasets to a small dataset.
– SAWFOL for all large datasets unlinked to a small dataset where bulk wf = large (2k records)
– SAWFOL on all small datasets
— for each small dataset find all large datasets where linked $ Small Dataset is Empty; bulk wf = “small” and search text = this small dataset’s search text (also 2k records).
So first of all let me say that the WU units is all over the map when I do a different amount of records (lets say 600 instead of 2k).
But the instance where the number of WUs for the Link by Large WF and the Link by Small WF were closest still had a significant gap (and more than I expected and that one would expect based on @georgecollier breakdown (which is correct, technically, but in real life…). In that instance, the Link by Large WF used 2,811 WUs to updated 2k records while the Link by Small WF used 1,038 WUs to update 2k records which is 63% less Workflow units!
Can you flesh out how you found WU costs to be similar? It shouldn’t be similar based on Bubble’s WU metrics, and while those are often hard to replicate in real life, I have found WU to substantially decrease, as noted above.