How to find and delete duplicates in a list of database things in backend workflow?

hi_bubble · June 19, 2024, 9:28pm

Hello guys,

I know this question goes a little bit into many bubblers uncomfy zone - however I wanted to ask it anyways

My application aggregates keywords from various sources which I am saving to the database in a backend workflow. Since those sources might contain overlapping information it happens that maybe 5%-10% of the data is duplicates.

At a later point of the flow I would like to identify and delete those duplicates. Can anybody here suggest a smart way of doing so?

What I know I could do is check each time I save a keyword if it exists or not. Yes that could potentially work but a) its wu expensive and b) does not work well with the current setup (where I am counting the keywords and based of the count take further actions. So if I now stop to save some of the expected and already counted keywords I have trouble elsewhere. Therefore I’d like to find another solution which might also come with a high wu usage - but that is okay for now).
something I theoretically thought of: duplicate the list of keywords I already have, then compare the original list with the new list and take further actions… But my feelings say there is probably a way smarter approach.

Thanks for any hints!

hi_bubble · June 20, 2024, 1:39pm

After thinking more about this problem the quoted approach does not even work since the save keyword workflow runs multiple times in parallel while saving those keywords from different sources. Therefore the only when condition might not even be reliable checking for duplicates since the possibility is given that at the exact same time the exact same keywords is being saved.

Does anybody here ever had the same problem and solved it somehow in the backend?

cmarchan · June 20, 2024, 5:55pm

Create a function (a custom event) that merges both lists and “spits” out a list of unique entries.

This may be a good use case to explore processing this list outside of Bubble in order to save WUs

hi_bubble · June 20, 2024, 7:00pm

So I have found a solution with the help of @chris.williamson1996 and the following post:

For anybody who comes across this I will try to elaborate a little bit more how I did it:

I have a new field inside the database that takes in a list of text.
In a wf I make changes to this thing:

Bildschirmfoto 2024-06-20 um 8.27.22 PM1414×708 123 KB
where I search for a specific set of keywords
this list is than immediately grouped by it’s identifier (in my case the keyword name itself) and aggregated by number (sorry not visible but just one click):

Bildschirmfoto 2024-06-20 um 8.28.10 PM1414×698 125 KB
than filtered if a group has more than one item (which exactly means that there is one or more duplicates of this keyword inside one group):

Bildschirmfoto 2024-06-20 um 8.28.23 PM1398×624 99.4 KB
In the end I save each identified keyword to the list.
Than I run an API call with the list:

Bildschirmfoto 2024-06-20 um 8.29.14 PM706×966 108 KB
inside this new API workflow I delete the keyword like this:

Bildschirmfoto 2024-06-20 um 8.48.58 PM1394×688 89.4 KB
and than take the first element.
after this the workflow schedules itself again as many times as items are on the list. The index token will be +1 each time it reschedules and the condition on API Workflow level will be index <= maxIndex. And that is also how you know when the last iteration occurs (where I’d suggest to say index >= maxIndex).

Be aware that this only removes one duplicate. If you have more than one duplicate of the same thing make sure to catch it also.

This approach does not involve any issues with race-conditions - however being on a wu plan this could come at a higher price than other methods.

Another solution could be to make use of any external service APIs e.g. any model from openAI could do the job and depending on the amount of things this could be cheaper.

Thanks again for everyone involved.

Cheers

Topic		Replies	Views
How to automate de-duplication / deleting double items from database? Need help	8	1575	December 9, 2019
Duplicated Insertions in DB : APIs / Backend Workflow Need help	3	344	December 25, 2022
Backend workflow "creating" blanks & duplicates Need help	5	375	April 4, 2023
Backend workflow to remove duplicates Database	0	352	March 6, 2021
Best Practices for Merging Duplicates in Bubble After API Imports Database	4	26	May 2, 2025

How to find and delete duplicates in a list of database things in backend workflow?

Related topics