Enable AI Crawlers on bubble pages

Hi,

I have blogs on my website, which I want to be crawled by AI agents. I don’t think it’s happening now. Is it because all the content is rendered through javascript?

Is there anything I can do so that my pages can be crawled?:slight_smile:

1 Like

Unlikely

Optimize them for AEO and check out robots rules for OpenAI’s crawlers

As far as I understand it, there is nothing you need to do to be crawled, it is more that there is something to do to not be crawled.

When I gave one of my article to chat-gpt and asked it to fetch all the information like title, description, meta-tags, schema etc. It failed and gave me the below response.

----------------response from chat gpt--------------

I wasn’t able to open that exact page directly—likely because of how it’s built on Bubble.io.

Bubble apps often render much of their content dynamically via JavaScript, meaning:

  • If the page doesn’t serve pre-rendered HTML (no server-side rendering or prerendering service like prerender.io), most AI crawlers and search engine bots that don’t execute JavaScript will not see the main content.

  • If Bubble’s SEO settings are properly configured and prerendering is enabled, then search engines and AI crawlers that respect robots.txt and can fetch the prerendered HTML will be able to read it.

  • Without prerendering, only crawlers with full browser rendering (like Googlebot’s modern rendering engine) could see the text, but lighter-weight AI crawlers might index nothing beyond the title and metadata.

  • Most AI crawlers that analyze raw HTML responses without executing JavaScript (e.g., older bots, or tools that don’t render pages like a browser) likely cannot access meta tags or the main content of this Bubble.io page.

  • More advanced crawlers—like modern search engines (e.g., Googlebot)—that execute JavaScript might see the content and meta tags if the page loads them dynamically.

  • But for many AI tools and basic crawlers, the page may appear “empty” or severely limited.

Interesting. When I’ve asked just to look at the site it seems to be okay, but asking the specific requests as you had, it came up empty on the meta-tags and schema sometimes, but always on schema. I took a look at prerender.io and they have a quick graphic that seems to explain it

Then ChatGPT mentioned this

Screen Shot 2025-08-14 at 10.16.33 AM

Looks like it may or may not be an issue with Bubble versus AI crawlers. After pressing ChatGPT on the subject it responded like this

I then ran some API call tests to use GET with my blog post URL and set data type to text and have header that is user agent to test against different AI bots and google and all seem to show the correct results as if the page is accessible to the different bots. So the bots can crawl and fetch the html and robot.txts test show they are all allowed except for the version-test.

Here is final response from ChatGPT after I ran all necessary tests it provided to me for confirming whether or not my blog page could be crawled, indexed, cited by AI bots.

Got you. Short version: your page is fine. The mismatch is about me vs. real crawlers.

Why I said I couldn’t “see it”:

  • In this chat I don’t crawl the web by default. Unless browsing is turned on, I can’t fetch live URLs—I can only read what you paste. So when I said I couldn’t see the structured data/metadata, that was about my session’s limitations, not your site.

What your tests prove:

  • robots.txtUser-agent: * with only /version-test/ disallowed = all bots allowed on the live site (Googlebot, Bingbot, GPTBot, CCBot, GTBot, etc.).

  • Fetching the article HTML with different UAs returns the page and includes JSON-LD BlogPosting + normal meta tags = machine-readable.

  • Schema/markup validator passes = structured data is valid.

So can ChatGPT/GPTBot “find” and cite it?

  • GPTBot (the crawler used for training): allowed by your robots and will be able to fetch the page. (Whether/when it’s used in model training is separate and not instant.)

  • ChatGPT during a conversation: it only cites pages when browsing is used in the session (or when given the direct URL). It doesn’t “know” new pages unless it browses to them or they’re already in its training/search index. That’s why I couldn’t cite it here—I wasn’t actually browsing.

Bottom line

  • Your site is crawlable and your post is structured and cite-ready.

  • My earlier “can’t see it” = chat environment limitation, not a problem with your page.

If you want extra confirmation beyond curl:

  • Check server logs for hits from Googlebot, Bingbot, GPTBot, etc.

  • Submit the URL in Search Console and Bing Webmaster Tools (fetch as bot / URL inspection).

  • Optional: curl -I -A "GPTBot" against the article to confirm a 200 OK and no X-Robots-Tag: noindex.

If I search for “how to cook fish for dogs” on google search my article is the top article.
But when I open AI mode on google and search for the same. My article never appears on the citations on the right. This is the same if i search on chat gpt, claude or any other AI tool.

This is what I was talking about when I meant bubble pages are not crawled by AI agents/bots.

1 Like

Okay, that might not be the most technical way about checking.

As I mentioned

you can also do this from your bubble app. If you do not want to run those api calls to check there is a website (I found on google) that will run these checks for you. It is basically just checking robot.txt

For my bubble built site everything looks accessible

So, I do not believe there is anything you need to do to enable the AI crawlers on your Bubble pages, it is more that you need to do something to DISABLE AI crawlers.

I could be wrong, but all the tools and information I’ve found and tests I’ve run, bubble pages are accessible to ai crawlers

This website shows that everything is allowed, but ChatGPT is still unable to read the content from my site - not even the schema or meta tags. I also tested it with static websites, and it wasn’t able to read those either.

It returned the following error: “Error: Error from browse service: Error calling browse service: 500”.

I’d like to know if anyone else has been able to get their website content successfully picked up by AI models.

@georgecollier @boston85719 @randomanon

I think so

The page above has zero structured data on it

The page above has blog posting structured data on it and passes the schema markup validator test

When I run the api calls to test against robots.txt they are all accessible for any crawler

I tried for one of the article you mentioned and this is the response I got.

What I meant above was when any existing top ranking article was searched in AI mode on google, did you still get the information from your website and was it found in the reference list?

I used the same prompt you did and got the same sort of response.

Here’s what I could (and could not) pull from a fresh crawl of your URL right now:

Crawl summary

  • URL tested: https://talatgan.com/blog/en/an-in-depth-look-at-the-pineapple-express-cannabis-strain (with and without trailing slash).
    Result: the fetch returns no readable HTML (“Total lines: 0”). Talat Gan+1

  • Title detected (from the document response header):
    An In-depth Look at the Pineapple Express Cannabis Strain: Origins, Effects, and More | Talat Gan. Talat Gan

  • I also tried other pages on the same site (to rule out a one-off issue) and the homepage. They also return empty markup to the crawler. Talat Gan+1

Metatags & schema

  • Because the server returns no HTML markup, I cannot access:

    • <meta> tags (description, robots, viewport, etc.)

    • Open Graph / Facebook tags (og:*)

    • Twitter Card tags (twitter:*)

    • Canonical link

    • Structured data (e.g., application/ld+json Article schema)

In short: there’s no parseable head/body coming back from the page to extract these items. Talat Gan+3Talat Gan+3Talat Gan+3

What this likely means (and how to fix)

While I can’t see server headers from here, the behavior is typical of one of these situations:

  • Bot protection / anti-scraping (e.g., a challenge page served to non-browser clients).

  • Client-side rendering only (content built entirely by JavaScript without server-side rendering or pre-rendering), so crawlers that don’t run JS see nothing.

  • Firewall / WAF rule blocking unknown user-agents.

Quick checks you (or your dev) can run:

  1. Load the URL with curl -I to verify status code and any WAF headers (e.g., Cloudflare).

  2. Test Google’s “URL inspection” in Search Console (Live Test) to confirm what Googlebot sees.

  3. If using a JS framework (Next/Nuxt/React/Vue), enable SSR or add a pre-render/fallback for critical pages so meta tags and JSON-LD are in the initial HTML.

  4. Ensure OG/Twitter/LD+JSON are emitted server-side (not injected after hydration).

  5. If using a WAF, allowlist common crawler UAs you trust (Googlebot, etc.) and avoid blanket blocks.

If you want, I can also draft a minimal head + Article JSON-LD template you can drop into the page for robust previews and rich results.

I’m not really sure what the issue is though. I was not able to get chat GPT 5 to open the notion page of documentation I gave it, despite the page being available to anybody with a link. I think there is some internal issue with how Chat GPT at least is able to ‘browse the internet’.

Speaking with ChatGPT on the subject, it suggest a curl call to test accessibility of the tags and og:twitter stuff, and when I run that call everything is accessible in terms of meta title, description og:twitter and facebook as well as the json-ld. It told me this as the reason for that discrepancy

You’re right: your page does return title, meta, OG/Twitter, and JSON-LD. The reason I reported “no readable HTML” earlier is because the fetcher I use to crawl (not curl) got a different response from your server—effectively an empty/skeleton document. That’s on me to explain better.

Why this happens (most probable):

  • Bot/WAF variance by User-Agent or IP. Many CDNs/WAFs (Cloudflare, etc.) serve a blank/challenge or skeletal markup to unknown headless clients. Your curl used a “real” browser UA and was allowed; my crawler likely presented a generic UA from a data-center IP and got nothing.

  • Content negotiation quirks. Some setups change output based on headers like Accept, Accept-Encoding, HTTP/2 vs 1.1, or missing Accept-Language. Headless fetchers can trip these.

  • Geo / rate / fingerprint rules. IP reputation, region, or heuristics sometimes lead to stripped responses.

Then I questioned it some more. And got two different api call tests to run to mimic different bots or browsers.

Got results that seemed to indicate all bots and browsers can access

It then gave me this series of tests as it says it couldn’t fetch because of some environmental variance.

# Bubble-only Crawler Variance Diagnostic (copy/paste)

This is a **Bubble-only** harness to verify whether your page returns different HTML (head/meta/JSON-LD) to different **environments / user agents**. It uses **one parametric API call**, a **CrawlerProfile option set**, and a **RepeatingGroup** to show, per profile, the body size and whether key tags are present.

---

## Step 1 — Create a parametric API call in **API Connector**

**New API**  
- **Name:** `Diag_Fetch`  
- **Use as:** `Data`  
- **Data type:** `text`  
- **Method:** `GET`  
- **URL:** make it a parameter: `https://talatgan.com/blog/en/an-in-depth-look-at-the-pineapple-express-cannabis-strain` (expose as `url`)

**Headers (ALL as parameters so you can vary them at runtime):**
- `User-Agent` → parameter `ua`
- `Accept` → parameter `accept` (default: `text/html`)
- `Accept-Language` → parameter `al` (default: *blank*)
- `Accept-Encoding` → parameter `ae` (default: `gzip, deflate, br`)

**Initialize** the call once using a normal browser UA (e.g. Chrome) so Bubble saves it. You should see raw HTML in the response body.

---

## Step 2 — Define a **CrawlerProfile** Option Set

**Option Set name:** `CrawlerProfile`  
**Attributes (all text):** `ua`, `accept`, `al`, `ae`

**Create these options (copy the values exactly):**

- **Browser**
  - `ua`: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36`
  - `accept`: `text/html`
  - `al`: `en-US,en;q=0.8`
  - `ae`: `gzip, deflate, br`

- **Curl**
  - `ua`: `curl/7.88.1`
  - `accept`: `*/*`
  - `al`: *(blank)*
  - `ae`: *(blank)*

- **PythonReq**
  - `ua`: `python-requests/2.31.0`
  - `accept`: `*/*`
  - `al`: *(blank)*
  - `ae`: *(blank)*

- **Googlebot**
  - `ua`: `Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)`
  - `accept`: `text/html`
  - `al`: `en-US,en;q=0.8`
  - `ae`: `gzip, deflate, br`

- **Twitterbot**
  - `ua`: `Twitterbot/1.0`
  - `accept`: `text/html`
  - `al`: `en-US`
  - `ae`: `gzip, deflate, br`

- **Slackbot**
  - `ua`: `Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)`
  - `accept`: `text/html`
  - `al`: `en-US`
  - `ae`: `gzip, deflate, br`

- **EmptyUA**
  - `ua`: *(blank)*
  - `accept`: `*/*`
  - `al`: *(blank)*
  - `ae`: *(blank)*

*(Add more later if you want, e.g. `facebookexternalhit/1.1`, `WhatsApp/2.23.10`)*
  
---

## Step 3 — Build a one-page visual checker

**Page UI**
1. Add a **RepeatingGroup**:  
   - **Type of content:** `CrawlerProfile`  
   - **Data source:** `All CrawlerProfiles`  

2. Inside each cell add:
   - **Text (Profile name):**  
     `Current cell's CrawlerProfile`  

   - **Text (Body size):**  
     Data source: `Get data from external API Diag_Fetch` with params:  
     - `url = "https://talatgan.com/blog/en/an-in-depth-look-at-the-pineapple-express-cannabis-strain"`  
     - `ua = Current cell's CrawlerProfile's ua`  
     - `accept = Current cell's CrawlerProfile's accept`  
     - `al = Current cell's CrawlerProfile's al`  
     - `ae = Current cell's CrawlerProfile's ae`  
     
     Then append:  
     `:body:number of characters`  
     
     *(Shows how big the returned HTML is for that profile.)*

   - **Text (Signals — OK/FAIL lines)** using **`:extract with Regex :count`** on the **same `:body`**:  

     **OG title present?**  
     `:body:extract with Regex (?i)<meta\s+(?:property|name)=["']og:title["'][^>]*> :count`

     **Meta description present?**  
     `:body:extract with Regex (?i)<meta\s+name=["']description["'][^>]*> :count`

     **Canonical present?**  
     `:body:extract with Regex (?i)<link\s+rel=["']canonical["'][^>]*> :count`

     **JSON-LD present?**  
     `:body:extract with Regex (?is)<script[^>]*type=["']application/ld\+json["'][^>]*>.*?</script> :count`

     **Display logic:** for each line, show **`OK`** if `:count > 0`, else **`FAIL`**.  
     *(You can concatenate with static text like `OG title: ` then the OK/FAIL condition.)*

   - *(Optional)* **Preview snippet**  
     `:body:truncated to 300`  
     *(Quick eyeball of the head chunk.)*

---

## Step 4 — How to read the grid

- If **Browser** row shows a **large character count** and all signals = **OK**, but **Curl / PythonReq / EmptyUA** rows show **tiny counts** and signals = **FAIL**, your CDN/WAF is **varying by environment** (UA/headers/IP). That explains any “empty crawl” from non-browser clients.  
- If **Googlebot/Twitterbot/Slackbot** rows show **FAIL**, previews and/or indexing might be impacted—fix this urgently at the edge.  
- If **all rows** show **large counts** and **OK**, your page is consistent; the earlier miss was due to specifics of a single external fetcher (e.g., IP reputation, TLS quirks).

---

## Step 5 — Quick fixes if variance is confirmed

- **Edge/WAF rule:** allow `GET` to `/blog/*` without challenges for unknown/low-rate clients. Keep bot defenses on POST/PUT or other paths.  
- **Serve the same `<head>` to every UA.** (You already emit meta server-side—good.)  
- **Stabilize social images:** prefer a fixed 1200×630 asset and include `og:image:width` / `og:image:height`.  
- **Housekeeping tweaks** (not required for this test but recommended):
  - Add `<link rel="canonical" …>`
  - Use `og:type="article"` + `article:published_time` / `article:modified_time`
  - Switch `twitter:image:src` → `twitter:image`, `twitter:site_name` → `twitter:site`
  - JSON-LD: `@context="https://schema.org"`, `mainEntityOfPage` = canonical URL, `isFamilyFriendly` boolean, absolute `https://` URLs, remove empty `keywords`.

---

### Notes
- The **only** difference between profiles is the request headers; all calls are executed **from Bubble’s servers**, so this directly tests the “how my crawler hit it” scenario.  
- You can duplicate the RG to target **other URLs** by changing the `url` parameter dynamically.  

It finally gave me 3 actionable steps we do have control over

1) Don’t block the bots (robots.txt + WAF)

Goal: Explicitly allow the AI/user-agents most assistants rely on, and expose your sitemap.

The first one comes across more as ensure they are allowed. So it gave steps to add explicit allowances for ai bots

  1. Make your sitemap great (and visible)
  1. Push URLs to indexes used by AIs (IndexNow)

At some point I’ll probably run the test pasted in markdown above. I have a strange feeling all results will show everything is accessible and the only real thing that might work is the final set of 3 suggestions with the first being the most likely one to get a positive result

This did not work for me, I had the below robots.txt

User-agent: *
Disallow: /version-test/

User-agent: GPTBot
Disallow: /user_profile/
Allow: /articles/

User-agent: OAI-SearchBot
Disallow: /user_profile/
Allow: /articles/

User-agent: OAI-SearchBot/1.0
Disallow: /user_profile/
Allow: /articles/

User-agent: ClaudeBot
Disallow: /user_profile/
Allow: /articles/

User-agent: PerplexityBot
Disallow: /user_profile/
Allow: /articles/

User-agent: CCBot
Disallow: /user_profile/
Allow: /articles/

I’ll try to use indexnow to see if that helps.

This topic was automatically closed after 70 days. New replies are no longer allowed.