Best Practices for Merging Duplicates in Bubble After API Imports

Hi everyone :waving_hand:

I’m working on a Bubble app that handles tens of thousands of objects (think of them as user profiles, places, etc.).
These objects are automatically created via an external API import, based on fields like name and city, but without checking for duplicates during creation — this is to avoid performance issues and workflow overload.

As a result, we now have many duplicates in the database — objects with the same name (sometimes with slight spelling differences), each linked to other data types (like events, bookings, etc.).

Our goal is not to change the way we import, but rather to handle deduplication afterward, using a cleanup workflow:
• Identify duplicates based on a computed unique_key (e.g. normalized name + city)
• Keep one object (preferably the oldest)
• Reassign all linked objects to this “main” profile
• Delete the duplicates

Questions:
1. What are the best practices in Bubble for managing duplicates at scale (10k+ entries)?
2. Can we use CSV imports to update existing objects using a custom field like unique_key instead of Bubble’s internal ID?
3. What’s the most optimized approach to merge duplicates (API workflow on a list? Manual batch processing? Something else?) that won’t cause a workload spike?

Huge thanks in advance for any insights :folded_hands:

Bubble is not a great tool for bulk processing data. But 10k is not a huge number.

In the old world you’d think of fuzzymatch in your database, soundex etc (not available in Bubble’s DB)- new world - why not use AI? You could probably give your whole CSV file to Gemini and tell it to de-dupe. (you could do it in Excel..)

Bubble would be a great tool for building the UI to manage the AI prompts and orchestrate external services - but workflows themselves for Bulk data processing, like list iterating - it’s clumsy.

@axel5 I just released a plugin called Data Jedi

It has a server side action to compare two lists which was purpose built for the ever increasing need for daily API data dumps where there is a saved list in Bubble that needs to be compared with a new incoming list. The server side action will process the two lists, comparing them based on an id field specified. It also allows you via checkbox to enable modification of items with no errors while detail logging those with errors, or to just fail the action if one error exists.

There are four lists of objects returned for use in subsequent workflow actions, unchanged items, modified items, new items, and updated list, where updated list is combination of the unchanged items, modified items and new items so that you can then save to the DB the full updated list and do any other types of processes with the other three lists all as separate data sets.

do they have a unique identifier key, like ‘id’ or anything else that can be used to determine their uniqueness?

I would suggest maybe changing slightly the way you import. I am not sure how you do it now, but if you provide a bit of insight into it, I’ll explain how you could use my plugin to do it in a very similar manner as you do currently but with added benefits, like not needing to run any kind of clean up or deduplication workflows afterward.

1 Like

You have your answer already, the best practice is to use a custom unique key instead of Bubble’s internal ID. This is by far the best approach.

I echo the other’s comments. Bubble is not going to be a great tool for this.

Doing a database search for every record imported by API to determine duplicates (either at the time of import, or later) is going to be really expensive from a WU usage perspective. I’m thinking 5k-10k WU per execution for 10k records from similar things we’ve done. Ouch.

In those cases, I only needed to do this once or every 3-6 months, so I manually manipulated the data in a spreadsheet, and used the import/update functions in the Bubble database tab to do this with no heavy WU implication.

Also you need a remote key, which you’ve said in your post. Preferably the ID from the remote database to ensure reliability. Fuzzy mapping of data (e.g. concatenating name and city) is a PAIN in any context.