Multiple serious issues with bulk data processing

As you may have seen from my previous post, I am currently uploading a lot of data from an external database into Bubble as a migration to a “nocode” tool.

Because the migrated tables are relational (unlike Bubble) it makes it very difficult to build efficient queries so we needed to populate some extra data.

Over the course of this, I have had to raise a number of bugs on this bulk process that Bubble offers. In short, I have just not been able to do what I needed to do in any kind of reasonable timeframe.

  1. “Fetching List” is constantly displayed when trying to run a bulk action
  2. Unable to search in data tab on large amounts of data - you get “an error occurred”
  3. Bulk operations confirmed as “1 per second” no matter what plan. 250,000 rows of data will therefore take around 70 hours to process.
  4. Field “is empty” does not trigger when the field is a reference to another thing, and that thing has been deleted. You now cannot select those rows as they are both “not empty” and don’t have anything in them.
  5. Same 20 rows are being selected when doing items until # 20 with a filter
  6. Cancelling “ALL” workflows does not work unless you pause them first

Yes, there are a lot of rows but nowhere does it say “don’t use the data tab if you have a large table” or “don’t use bulk actions” etc

Bulk data operations in Bubble are not fit for purpose. You simply have no way at all to process operations on large amounts of data.

This NOT mean that Bubble does not handle volumes of data, it does that well. What it does not do is allow you to manipulate that data from inside the tool.

@malcolm and @josh need to be upfront about the limitations here. And what, if anything, they intend to do about this. I raised issues 5 years ago with the data tab (that are still not fixed) and nothing really seems to have changed since then apart from the really good csv uploader.

These are seriously limiting issues with the platform here - which go well beyond “we do appreciate this feedback on the process” from @eve that I have received last week.

As it stands at the moment, I could never recommend to anyone to use Bubble for large data operations, despite what their technical team will tell you. Look at Xano (or similar) instead.

A sad few weeks for me here, I really thought Bubble was better than this :frowning:

14 Likes

Sadly we can confirm the same issues and this is something that bubble never did well. I would hope this will get more attention as more and more mature products are being built on bubble and these problems become more evident.

If a service line xano offers a faster backend than bubble native, you know you have a problem.

2 Likes

Yeah, we also never recommend using Bubble for large data processing (even medium).
The best way to process it outside the Bubble - and then display data via API/SQL.

1 Like

As a non technical person, can one of you kindly give a real world example or two of large data processing that Bubble does not do well, so that I can know if it applies to the app I am creating

  1. Migrate some data into Bubble using the (excellnt) CSV importer. Now run something to create Bubble style object relations between the two. So half a Million “book” rows in one Table and 200,000 “author” rows in another. Use a field on one of them to build the Author(s) <> Books relation.

  2. Try to delete 10,000 rows of data

  3. Anything that goes through large amounts of data and updates counts/totals in one go. Try to do it as you go, not rely on “batch” work.

1 Like

This is interesting. At what point does it become unusable, 5000 rows?

Volume itself is not an issue in Bubble. Half a million rows can be work quite happily even on a lower tier.

The issues crop up when running workflows on those rows in a short period of time.

Bulk Delete is definately an issue though, and I have had problems even at 1000 rows. Don’t try to delete them, add a flag and ignore.

2 Likes

Ok thanks. I have over 450,000 rows of data and a few fields are nested.

In a few months time I will have to query that data to provide some insights for my company.

@NigelG my initial thinking is to have a separate page in my app and create a reporting dashboard type page.

But

Also I’ve been thinking about if it is slow and an absolute pain in the ass I would just download everything in a CSV and use a 3rd party tool like Excel.

What would you do?

Yeah, it takes a few hours just to bulk delete 5,000 items O.o

You can use the ‘copy from live to dev’ as a trick.

Ahh, good idea. We are not live yet but will note this. Thanks Nigel

1 Like

Here are the techniques we have used to data process on the Bubble server. Using your examples:

  1. For the book importing, we create empty author and book things, where book contains fields raw author as a text and book author as a reference to author. We then add a trigger on any change to raw author that looks up the author and sets the book author to the best match. We then upload all the authors first. Once that is complete we then upload all the books.
  2. For bulk deleting we go old school/low level using a deleted flag and behind the scenes garbage collection. We hide any records that have a deleted flag in the UI/UX and have triggers that run behind the scenes to eventually get around to deleting. This is basically how the on disk delete operations work, except much faster, in classic RDBMSes.
  3. For statistical processing at scale we move off of Bubble and use Map/Reduce techniques on external services like AWS Lambda. All Bubble has to do is respond through it’s native Data API.

Maybe @malcolm can provide an engineering explanation as to why Triggers seem to run much more robustly than scheduled APIs for bulk data processing? I suspect the call stack for Triggers sits DB side and is much larger and more stably supported, and that the call stack for Workflows sits Fibers side and is smaller and more brittle.

It is worth noting that the internet is not nearly as fast as we perceive it.

XKCD File Transfer

2 Likes

Thanks @aaronsheldon that is really interesting.

We still managed to get triggers to fail at scale.

Yes, that is definitely the way to go here. In our case we had 100,000s rows generated in test by API calls and nested recursive workflows. There was duplicate data so we really needed to clear it down regularly.

2 Likes

Love this :slight_smile:

Several years ago, a large financial company with very tight InfoSec told me they had shut down file transfers of any type by any method.

Took me 20 seconds to go to file pizza and transfer off a dummy “secret” work doc.

2 Likes

Wow triggers failing at scale. And they are the better alternative…*?!

That said I elided much of the discipline around our design pattern in triggers:

  1. We moved everything off of cascading scheduled workflows, absolutely everything.
  2. We insulate trigger code within a scheduled workflow, the only thing our triggers do is call exactly one scheduled workflow.
  3. We only pass thing now from the trigger to the scheduled workflow. This appears to commit the database transaction of the trigger, no reversies, no take-backs.
  4. Our triggers are tightly conditioned on “thing before’s some field is not thing now’s some field”, we specifically detect and run against one field change at a time.
  5. When possible we do time delay the scheduled workflow, to space consecutive triggers out. Note that we still have multiple triggers running in parallel, this just introduces a rate limiter into triggers cascading to other triggers.
  6. Any field in a thing is only ever written once by the triggers.

Overall we use triggers in a “calculated field” design pattern. It took months of re-factoring to get to this highly disciplined design pattern. But it was absolutely necessary because we have to interface with a number of archaic legacy systems that do horrible things like send us emails of PDFs of faxes of images of documents.

NORM Normal File Format

…healthcare IT is why we can’t have nice things…

Backups

1 Like

Just to add 2 cents to the convo here.

I say take your data off bubble and harness aws or Firebase for large data. Heck, why not auth too? The Price for this extra capacity in data data storage/retrieval you’ll receive? Dollars a month for most not huge apps (or less).

The initial investment is not cheap but you can get away from the limitations and really let the data flood in and be processed very quickly.

1 Like

Bumping this as it has been over a year since @NigelG brought this up – It still is a major problem.
I’m in the process right now of running a Bulk operation on 4700 records to update 1 field and this is painful - It still closes out the UI and prevents you from doing anything else. Most of the time, I do recurring schedule api… which is slow itself, however, figured I’d give the Bulk operation a shot again. Mistake. Time to go grab coffee… and perhaps lunch… maybe dinner… and will keep my laptop on overnight. :upside_down_face: