I know this question goes a little bit into many bubblers uncomfy zone - however I wanted to ask it anyways
My application aggregates keywords from various sources which I am saving to the database in a backend workflow. Since those sources might contain overlapping information it happens that maybe 5%-10% of the data is duplicates.
At a later point of the flow I would like to identify and delete those duplicates. Can anybody here suggest a smart way of doing so?
What I know I could do is check each time I save a keyword if it exists or not. Yes that could potentially work but a) its wu expensive and b) does not work well with the current setup (where I am counting the keywords and based of the count take further actions. So if I now stop to save some of the expected and already counted keywords I have trouble elsewhere. Therefore I’d like to find another solution which might also come with a high wu usage - but that is okay for now).
something I theoretically thought of: duplicate the list of keywords I already have, then compare the original list with the new list and take further actions… But my feelings say there is probably a way smarter approach.
After thinking more about this problem the quoted approach does not even work since the save keyword workflow runs multiple times in parallel while saving those keywords from different sources. Therefore the only when condition might not even be reliable checking for duplicates since the possibility is given that at the exact same time the exact same keywords is being saved.
Does anybody here ever had the same problem and solved it somehow in the backend?
this list is than immediately grouped by it’s identifier (in my case the keyword name itself) and aggregated by number (sorry not visible but just one click):
after this the workflow schedules itself again as many times as items are on the list. The index token will be +1 each time it reschedules and the condition on API Workflow level will be index <= maxIndex. And that is also how you know when the last iteration occurs (where I’d suggest to say index >= maxIndex).
Be aware that this only removes one duplicate. If you have more than one duplicate of the same thing make sure to catch it also.
This approach does not involve any issues with race-conditions - however being on a wu plan this could come at a higher price than other methods.
Another solution could be to make use of any external service APIs e.g. any model from openAI could do the job and depending on the amount of things this could be cheaper.