So I’m building a product comparison website. Think Trivago but for tech products. This means I’m going to be dealing with tens of thousands of products (within the niche I’m focusing on). So my general question is whether it’s more effecient to store all of these products as “things” in my database, or to try and run results purely off API’s.
Things to consider:
Page speed for the user
Costs (especially future costs as Bubble is likely to move to some type of DB restriction on plans)
I wish I could just run the site purely from API’s but:
Bubble doesn’t allow multiple API sources in a single repeating group.
The same product will show up multple times since multiple retailers sell it.
Some APIs have call limits, others don’t. If I understand correctly, if 1000 people go to my site in a single day, that’s already 1000 calls in a single day.
Different APIs have different value fields. Amazon might say “productprice”, Best Buy might say “regularPrice” and Walmart might say “price”. With a DB entry, all 3 of those values can feed into single “price” field.
I also like the idea of each product having it’s own dedicated page. So let’s say the PS5 have a product page and on that page you can see pricing pulled in from retailers like Amazon, Best Buy, Walmart, etc. I’d also like to keep a price history, which seems easy if I’m constantly updating the price values in my DB.
I think you get the point of what I’m trying to accomplish. Would you suggest storing the data internally, or is there a way I’m not thinking of that makes this type of site possible running purely off API catlogs?
I do believe you just argued with yourself into building out a DB populated by crawling APIs. If for no bigger reason than data normalization and standardization; which will make front-end UI/UX development so much easier.
I’d say the best approach to this needs designing based on the capabilities of the API(s) you are using and what you want to achieve. Some thoughts:
How do you handle newly added products and discontinued/deleted products
Do you care that price & availability change frequently (these could change many times a day)
If you are calling multiple API(s) how do you handle duplicate products
Is GTIN accurate in all your data sources (I’m assuming this is the only thing you have as a unique product ID when using multiple APIs (product catalogs) that can contain the same products as each other), my experience is that they are not and this may give you a lot of headaches if you are trying to de-dupe across multiple product catalogs.
Making some assumptions about all of the above. I would probably go with:
One source of truth (i.e. one API/product catalog) for all the products (possibly storing that in the DB but you’ll need to think about additions & deletions for new & old products or you just rewrite the DB every night (not great if you are talking tens of thousands of products))
Fetch price & availability using API calls in the frontend (you don’t want to be running a recursive API workflow every N minutes to get X thousand prices & availability and storing them in DB)
Ideally what you want is 3rd party provider who handles most of this stuff for you and just provides a clean API. I’m sure they exist but I assume they’ll be quite expensive, so I think you’ll have to live with a few imperfections whichever DIY route you take.
These are great thoughts! Really appreciate you taking the time to reply. Here are my thoughts:
At the moment I’m not 100% sure. The “easy” route is to build in a community feature to the site where users can click to flag products where the info is either incorrect or no longer exists. I’m sure there are various ways a more automated approach could be built in.
Very much! What I’m building is basically like a Trivago but for PC setup shopping. In general, even refreshing the pricing once a day is fine, but ideally I’d love to get it down to refreshing the pricing every 10-30 minutes, which I know could be expensive and taxing on the server.
This is the primary reason I’m leaning towards saving these products as items in my database. A single product will be sold on various retailer APIs but the one thing they all have in common in a “model number”. As far as I’m concerned, you can’t have multiple API sources feed into a single repeating group and even if a way was found, filtering out my model number in real-time seems like it could lead to headaches. I’d love to be wrong on this.
A good example of what I’m wanting to accomplish is what you see on this site. When you click into a product it will take you to a dedicated page for that product and show you up-to-date prices from multiple sources. You can see the same thing on a site like this as well, which has an easier time because they are JUST focused on Amazon pricing.
I was originally going this route, almost a “hybrid model” where maybe the DB just holds the “model number” and maybe the product name (to avoid duplicates and to be able to create a slug for a dedicated page) and everything else is just always fetching from an API. I suppose my biggest concern is both potential limits with the APIs as well as how searching works. Like if on my site I wanted to say “show me monitors under $100” I imagine if the pricing were all “real-time” it would basically be a RG that goes to through all of the model numbers, finds the ones that are monitors, and then runs separate API calls for each retailer to fetch the latest pricing, determine the lowest price, and then show that product with the lowest price. They can then click into the dedicated product page to see all of the other prices for comparison. Sounds complicated, but perhaps I’d need to see it.
I have looked into it, but it’s just frustrating because a big part of what you are paying for is access to the APIs. Most people don’t qualify to get access to these feeds but I already do because I’m an affiliate with all the major retailers. So I’ve thought about potentially housing all of the data on a 3rd party place like Firebase or Airtable, but then I worry I’m going to run into the same problem of not knowing how to use THEIR API builders to pipe in all those datapoints into one nice spreadsheet/database. Plus all of the requests. if I had 5,000 go to my site and the average user makes 5 different searches when researching pricing, isn’t that 25,000 calls in a single day? Seems like the cost would be massive.