Prevent scrapers to scrape certain pages? Cloudflare?!

Hello,

I currently have user profile pages on my site in the format: website.com/members/USERNAME. It works somewhat like Upwork, where each user has their own public profile.

My concern is that scrapers could crawl these profile pages to harvest contact details and other sensitive information, which could then be used to copy the site or steal data.

I’m already using Cloudflare, but from what I understand, Bubble doesn’t fully support the Cloudflare proxy. Does that mean Cloudflare can’t really be used here to block this type of scraping?

What’s the best way to tackle this issue? Ideally, I’d like to implement some form of rate limiting and bot protection to keep my users’ information safe.


Bubble exposes robot.txt in your app settings. You can start with that.

What do you mean? Cloudflare proxy offers protection at the domain level. So as long as requests hit your proxy first, your app is protected. Unless your origin (Bubble app IPs) is already exposed and the scrapers sends requests directly to them.

Hey @ihsanzainal84, thanks for your comment.

But most scrapers people build tell the scraper to ignore the robots.txt all together so that will stop some yes but sadly not the most.

In bubble (My understanding and testing) you can not have proxy enabled on the A records, then you get:

We found bad DNS records for test.example.com. Our infrastructure does not support IPV6 lookups: AAAA record 2606:4700:3108::ac42:2b84, AAAA record 2606:4700:3108::ac42:287c These records are pointing at a Cloudflare account that does not belong to Bubble: A record 172.66.40.124, A record 172.66.43.132.

And to my understanding the proxy has to be enabled in order for cloudflare to protect with the services they offer:

  • CDN (Content Delivery Network): Your traffic no longer passes through Cloudflare’s edge servers, so caching, performance optimizations, and global distribution won’t apply.
  • DDoS protection & Web Application Firewall (WAF): Since Cloudflare isn’t sitting in front of your server anymore, malicious traffic goes directly to your origin.
  • SSL/TLS termination (Cloudflare’s Universal SSL): Visitors won’t get Cloudflare’s managed HTTPS certificate. You’ll need to serve HTTPS directly from your origin server.
  • Rate limiting, bot management, firewall rules: These only apply to proxied traffic. With DNS-only mode, they won’t protect you.
  • IP masking: Your origin server’s real IP will be exposed publicly.

What still works:

  • DNS resolution: Cloudflare will still resolve your domain, but nothing else will pass through Cloudflare’s network.

This is my understanding of it but i might be wrong?

Ahh okay. Yes, that’s correct, Bubble already uses Cloudflare so you’ll need to set up a reverse proxy but that’s going to be quite a headache to set up.

An “easier” way is to point your Bubble app to a subdomain and then your users interact with the root domain - which is CF proxied. Use a worker to proxy requests to that subdomain.

Yeah that sucks… Hmm that sounds interesting but how does that work with SEO, indexing etc?

So if i understand correctly, the bubble app is hosted on for example test.domain.com

And then in cloudflare i point with workers domain.com to test.domain.com?

Correct-o.

This is where things get a little tricky.

  • You need to ensure you whitelist the appropriate crawlers to crawl the subdomain since it’s where the Bubble content is
  • to not expose the subdomain, you have to add ‘canonical tags’ in every page of your Bubble app you want crawled, that point to the root (where the proxy lives). This tells search engines to display the root domain address instead.

Note that it might still be possible that your subdomain gets displayed instead.

I’ve been doing some learning to shift my apps to CF managed domains hence why I know this stuff. I haven’t actually tried anything related to SEO since my clients don’t need SEO optimization.

Everything public on the internet by definition is for everybody…

BUT

If you don’t have a sitemap exposed, I doubt anyone would waste resources on brute forcing a profile page, because there is no predictable pattern for the item slug.

Plus Bubble doesn’t really deliver much content for a curl GET requests anyway, they will have to run headless browser/s to scrape the data but this data is already voluntarily available.

I have a similar setup and my slugs are random. If I don’t send you a link or it isn’t posted somewhere - there is close to 0 chance of figuring it out.

And about the data type being public - this is true but you can still protect it with “Current user isn’t empty”. Anonymous users are also users.

In this regard, I am not sure what Bubble considers as a user but I guess it has to be an actual browser opening the page, which adds another layer of complexity to scrape.

CF - you can proxy if you point CNAME to some bubble subdomain (forgot it) and ditch the A Records. But Bubble is a tens of millions SAAS, they must have some adequate Enterprise Cloudflare stuff enabled already

Thank you! I think i need to read more about that for sure hahah

Hey @akamarski, thanks for your detailed comment! Very interesting, but with for example firecrawl and google dorking they can get most of the URLs that has been indexed, right? With for example site:website.com/u/?

Yes true so you think that they already offer some kind of anti bot and DDos protetion?

If they are indexed - they should be public by definition, no? If you don’t allow indexing in robots.txt it will be really hard to find unindexed slugs and in that case, they have to spin a headless chrome to load one page.

When we use the bubble A Record way of connecting the domain, all traffic is proxied via their enterprise edge. They probably protect internal services as well as our apps.

Or you can use the CNAME method to use your own cloudflare and use this Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

1 Like

Thank you! This seems like a good way to start, sadly there are 0 tutorials out there on things like this and I thought it was a standard trying to prevent things like this on a bubble app?

If you dont mind me asking is there some good documentation on “using the CNAME method”? That i can check out?