Ideas for URL Scraping in Bubble?

Hey all,

Need some help here. How can I enable a feature where users submit a URL through a “Upload URL” button, and the app scrapes and stores the content?

I’m looking for general advice on using scraping plugins or APIs suitable for handling various content types. Meaning: pdf, sitemaps, youtube urls, linkedin.

What are the recommended tools or methods for this within Bubble?

Thanks in advanced!

I use a Google Cloud Function for this that is more or less just a GET request to a URL that extracts the website’s text content. I worked with https://urlbox.com/ once, and it worked okay.

1 Like

I have been using Open ai agents API for scrapping and it has been a good experience!

My users paste the URL, I sed the HTML body of the page in the API call to my agent and my agent answers back in JSON to create an entry in my database.

You can easily get the HTML body of any page by just doing a Get API call to the URL you want to scrap :slight_smile:

Thanks, did urlbox.com work for entire sitemaps?

Hey, thanks. So the flow would be:

1/ user uploads urls.
2/ you scrape the url html (do you do this for entire sitemaps?)
3/ you send it to an openai assistant
4/ you get a response in json of the entire url
5/ you store it in bubble’ dataset

Is this correct?
My doubt is step 2 since this should work for pdf, youtube url, linkedin url, entire sitemaps (ie: an entire blog).

1 Like

I haven’t done it to scrap sitemaps, but yes it would be the same process
If the sitemap is too big, you might need to split it into smaller pieces though

You don’t need to store the JSON in a bubble dataset, you can use it to create an entry with the data of the JSON. I explained about it in a guide here in the forum.

You can search here in the forum: How to: Use Open AI Assistants functions to create an entry in your database

I would add the link directly, but I get banned always I try to share a link to a guide I made in the forum… :roll_eyes: random rules

1 Like

This topic was automatically closed after 70 days. New replies are no longer allowed.