Impossible to fetch more than 50k records via API?

Had an issue this weekend where I wasn’t able to fetch beyond 50k records using the cursor/pagination parameters available to us in the data api. Not sure if this is a new limitation as I haven’t run into the issue before (and was pretty sure I 'd recently exceeded 50k calls… a number of times).

Essentially, if I call either of my test or live data api endpoints with a cursor value greater than 50,000, I get a status 200 OK response but only [] empty brackets in place of the actual result I was expecting (which is confusing as this shouldn’t be a status 200 response if I’m not supposed to be able to get information). I’ve tried getting around the limitation without any luck.

There is nothing on the forum or the manual about this as far as Google can show me I now see the manual states:
image

How new is this limitation? And if I can’t fetch every object in by database via API, then how is one supposed to be able to copy those objects for use in 3rd party databases, search indices or backups? Forcing the use of workflows to accomplish this would be unfortunate given how slow they are/limitations they impose :frowning:

@eve

1 Like

That is a worry …

Just to clarify …

If you do get with cursor = 1, limit = 50,000 it works

If you do get with cursor = 50,001, limit = 50,000 it doesn’t ?

The one, the only, the living legend NigelG has entered the chat! Haha, thank you for your message.

It is a super worry :frowning:

To answer your question, limit, as far as I am aware, cannot be more than 100 as bubble just wont return more than that number of records per call. So the method of extracting all records was historically to increment cursor by 100 to make subsequent calls as long as there were still records to fetch (which could be deduced from theremaining: 1 key value pair provided in the response.

Here’s a Python snippet for fetching a set of records beyond the 50,000th record

import os
import requests
import json
BUBBLE_API_KEY = "ENTER_YOUR_KEY"
DATA_TYPE = "YOUR_DATA_TYPE"
BASE_URL = 'https://YOUR-APP-NAME.bubbleapps.io/version-live/api/1.1/obj/'

url = BASE_URL + DATA_TYPE
search_options = [{'key': 'Lots of spaces here? ye',
          'constraint_type': 'equals',
          'value': 'here is my data'}]
payload = {"cursor": "50000",
            "limit": 100,
            "api_token": os.getenv("BUBBLE_API_KEY"),
            # "constraints": json.dumps(search_options)
          }
response = requests.get(url, params=payload)

if response.status_code != 200:
  print(response.json())
else:
  pass
chunk = response.json()['response']
chunk

Currently, that simply will not return any result. Docs say it will return up to 10 million for Bubble users on Dedicated plans but then fail to return any records after that (I imagine, while still giving an HTTP Status 200).

In any event, this is our data and I want to be able to query all of it via API - as the lord intended. If any other cloud provider in existence was like,“hey, come build your business on our database… but don’t get too big because if you do, we wont let you export all of your data” who would use that service? Charge me to retrieve it if you must - that’s normal - but don’t force me to go from paying $29 per month to several thousand just to do that while still capping me at a million records.

Am I missing something? Or is the above just misdirected frustration since I don’t know the trick?

1 Like

While waiting for others to chime in, here’s an argument for why api data fetch limits should not exist (or should require a nominal charge).

We’re paying for our database records

According to recent pricing talks with @grace.hong (and as generally disclosed by Josh here), Bubble is going to start charging for the underlying memory usage of our records. I told Grace that was not only appropriate but “it’s about time” as, from their perspective, they previously had an unbound cost in how large our databases could be.

The charge will be a few dollars per Gig per month which affords several thousands of database records depending on structure… A-Okay

Fetching 50k records does not exhaust the API rate allowance

Bubble affords 1000 API calls per minute and we can boost that 3 times a month for free or pay a nominal amount to increase the rate. But fetching 50,000 records (at 100 records per fetch) requires only 500 api calls - not close to the limit even if placed in a minute.

The alternatives we’re left with are more expensive for everyone, Bubble included.

Most services we’d like to sync data to accept bulk inserts (some can handle 100 or 1000 records at a time). But if we can’t GET records past the first 50k, the only programmatic mechanisms available are;

  • One-at-a-time: We create a backend workflow that is able to export one record to our 3rd party db. The requirement of one-at-a-time is there because we’ll then have to run a “Bulk” job which runs that backend workflow for each record of interest. Now, instead of making only 500 HTTP connections to complete that sync, we’re forced to make no less than 50,000 connections - way more taxing on Bubble’s infrastructure and ours. And this isn’t even fully programmatic because a person has to click the “Bulk” button in the data tab to trigger said re-sync.
  • All-at-a-time: We schedule a search that loads all 50k+ (could be 700k) records into a backend workflow. We can’t pump that many records into our 3rd party DB at once so the workflow has to slice off the bit it can handle then recursively carry/schedule the remainder into a new instance of the same workflow. Carrying that remainder from workflow to workflow until it’s exhausted is also more taxing on Bubble’s infrastructure - both in memory and in CPU - than if we could just do the same via vanilla API.

None of those solutions can be baked into a plugin, which is something that needs to be possible to easily share such automations with Bubblers.

1 Like

Just wanted to chime in to help! :slight_smile: This behavior is actually an existing limitation of our Data API, and it’s set in place to preserve the reliability and performance of Bubble apps. In order to retrieve data past the first 50k entries, the workaround you can use is to filter the query. For instance, if you’re sorting by the created or modified date for 100k entries, you can add a constraint to filter and return results from 50k-100k.

3 Likes

Grace, really appreciate your message :slight_smile: Thank you!

Is this to say we use the constraints parameter to search for say, records that were created between a start_date and an interim_date, where interim_date is just the start_date + some_time_offset . Then we

  1. fetch as many records as we can (calibrating the interim_date so that the number of records between start_date and interim_date is less than 50k) 100 at a time,
  2. increment start_date and interim_date by some_time_offset
  3. repeat steps 1 - 2 until interim_date exceeds the date the last record was created.

Is that what you’re suggesting? And if a plugin automated this for 4000 bubble apps (ranging in db size of 1k to 1million records), would Bubble allow this keeping in mind that it would only be an occasional event for each app?

It generally wouldn’t make sense to do this on a large scale rather than using the data API but with the limitation it may. You can set up a workflow endpoint that works similarly to the data API where you send a ‘from’ and ‘count’ parameter to the endpoint and return a list of records and an ‘is_more’ boolean for whether you can retrieve more records.

As you noted, seems this would take considerably more capacity vs a direct call to the database but does work.

5 Likes

Really appreciate the input Eli :slight_smile: I think this is another way to skin the cat but agreed, as you say, not as efficient as direct API call. Seems another example of how limiting the API at 50k records to preserve reliability and performance of Bubble apps, unless I’m missing something, actually has the opposite effect.

Am implementing my psuedo code from above now. Happy to share the snippet once I get it working if anyone’s interested.

3 Likes

I’ve also ran into the issue with 50k records. For what it’s worth my solution was to setup a second DB in Xano to export the records. I did a one-time import of all the records from Bubble to Xano and run a daily sync cron from Bubble to Xano for all the new records. The Xano API is much more flexible and powerful so running large exports from there isn’t a problem. This works well for logs etc, but it’s much trickier to implement if your database is dynamically changing.

1 Like

This makes sense :slight_smile: Strange part is I was pretty sure I didn’t have this limitation on a (non-dedicated plan) project I did 2 months ago syncing over 100k records. :man_shrugging:

Sort of feels like you need to build a “cursor” table into Bubble with various limits in, so that you can then set up the filter dynamically.

1 Like

A reasonable third option :slight_smile: Though problematic as one needs to also handle gaps in the cursor when some middle record was deleted.

Which brings us to our 4th and I believe best option for those working in Python - introducing my best bubble data fetch script yet:

All our datas are belong to us

The following creates a Python generator that fetches all of your records via the Data API. It works by first checking how many records you have, then incrementing through them using one of two approaches:

  • If DB has less than 50k records, we use the naive-cursor-increment approach which simply walks the cursor across every record in your DB up to 50k.
  • If DB has more than 50k records, we slice the database into chunks using date constraints and then walk the cursor through those chunks 45k records at a time. See this pseudo code for details.

Every data fetch has a built in pause so as to not overload your bubble app as well as a retry mechanism to gracefully standby until your DB comes back online (if it were momentarily unreachable for whatever reason). All of these variables are configurable so you can, for example, increase the pause between data pulls. Here it is:

import os
import requests
import json
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import datetime
from dateutil import parser
import time

url = 'https://YOUR_BUBBLE_APP_NAME.bubbleapps.io/version-live/api/1.1/obj/DATA_TYPE_OF_INTEREST'
api_key = 'YOUR_APP_KEY'

def crawl_bubble_table(url, api_key, pause_in_seconds = 0.07, start_cursor=0):
  hard_api_limit = 50000
  soft_api_limit = 45000

  retry_strategy = Retry(
    total=6,
    backoff_factor=4,
    status_forcelist=[429, 500, 502, 503, 504],
  )
  adapter = HTTPAdapter(max_retries=retry_strategy)

  def fetch_records(session = None, cursor=0, limit = 100, constraints=None, sort_field=None, descending=False):
    payload = {"cursor": str(cursor),
                "limit": limit,
                "api_token": api_key,
                "constraints": json.dumps(constraints),
                "sort_field": sort_field,
                "descending": descending,
              }
    if session:
      response = session.get(url, params=payload)
    else:
      response = requests.get(url, params=payload)

    if response.status_code != 200:
      return response.status_code, {}
    elif response.json()['response']['results'] == []:
      return 404, {}
    else:
      return response.status_code, response.json()['response']

  def extract_python_datetime(_datetime):
    _datetime = parser.parse(_datetime)
    return _datetime.replace(tzinfo=None)

  def get_start_date_and_db_size():
    status, result = fetch_records(limit=1, sort_field= 'Created Date')
    start_date = result['results'][0]['Created Date']
    records_remaining = result['remaining']
    return extract_python_datetime(start_date), records_remaining

  def get_end_date():
    status, result = fetch_records(limit=1, sort_field= 'Created Date', descending='true')
    return extract_python_datetime(result['results'][0]['Created Date'])

  def add_millisecond(_datetime):
    _datetime = _datetime + datetime.timedelta(milliseconds=1)
    return _datetime, _datetime.isoformat() + 'Z'

  def subtract_millisecond(_datetime):
    _datetime = _datetime - datetime.timedelta(milliseconds=1)
    return _datetime, _datetime.isoformat() + 'Z'

  def ensure_tail_records_not_in_next_session(result):
    all_records = result['results']
    last_date = all_records[-1]['Created Date']
    returnable_records_mask = [key['Created Date']!=last_date for key in all_records]
    records_count = sum(returnable_records_mask)
    ignored_records_count = result['count'] - records_count
    returnable_records = [item for keep, item in zip(returnable_records_mask, all_records) if keep]
    returnable_result = {
      'count': records_count,
      'cursor': result['cursor'],
      'remaining': result['remaining'] + ignored_records_count,
      'results': returnable_records
    }
    return returnable_result

  status, result = fetch_records()
  if status != 200:
    yield status, result
  else:
    start_date, records_remaining = get_start_date_and_db_size()
    cursor = start_cursor
    if records_remaining < hard_api_limit:
      # Download chunks using naive cursor increment
      with requests.Session() as s:
        s.mount("https://", adapter)
        s.mount("http://", adapter)
        while records_remaining > 0:
          time.sleep(pause_in_seconds)
          status, result = fetch_records(cursor=cursor)
          records_remaining = result['remaining']
          cursor += result['count']
          yield status, result
    else:
      # Download chunks using date slicing increment
      end_date = get_end_date()
      end_date_plus_millisecond, end_date_plus_millisecond_string = add_millisecond(end_date)
      incremental_start_date, incremental_start_date_string = subtract_millisecond(start_date)
      # database_age_in_seconds = (end_date - start_date).seconds
      # records_per_second = (records_remaining + 1) / database_age_in_seconds
      cycling_sessions = True
      while cycling_sessions:
        search_constraints = [{
          'key': 'Created Date',
          'constraint_type': 'greater than',
          'value': incremental_start_date_string,
        }, {
          'key': 'Created Date',
          'constraint_type': 'less than',
          'value': end_date_plus_millisecond_string,
        }]
        with requests.Session() as s:
          s.mount("https://", adapter)
          s.mount("http://", adapter)
          while cursor < soft_api_limit and records_remaining > 0:
            time.sleep(pause_in_seconds)
            status, result = fetch_records(cursor=cursor, constraints=search_constraints, sort_field= 'Created Date')
            records_remaining = result['remaining']
            cursor += result['count']

            if cursor >= soft_api_limit and records_remaining > 0:
              # Last loop between sessions (right before next big crawl)
              trimmed_result = ensure_tail_records_not_in_next_session(result)
              # records_remaining = trimmed_result['remaining']
              # cursor += trimmed_result['count']
              yield status, trimmed_result
            else:
              yield status, result
        
        incremental_end_date = extract_python_datetime(result['results'][-1]['Created Date'])
        # records_remaining = True
        if end_date == incremental_end_date and result['remaining'] == 0:
          cycling_sessions = False
        else:
          cursor = 0
          incremental_start_date_string = trimmed_result['results'][-1]['Created Date']
          time.sleep(pause_in_seconds * 3)

And here’s how we use it:

crawl= crawl_bubble_table(url = url, api_key= api_key, pause_in_seconds = 0.1)

for status, result in crawl:
  print(result['remaining'], result['results'])
  # Do something with result['results'] here

This generator returns two variables, the status and the result of each call to our database. As long as the status is 200, we’ll have data in the result (otherwise result=[]). result is the standard Bubble response object containing the following keys, in order of usefullness:

  • results: your data stored as a list of Python dictionaries.
  • remaining: the number of remaining results to be fetched not including the current results.
  • cursor: the cursor value locating this group of results… not generally useful.
  • count: the number of results (dictionaries) in this fetch (almost always 100)… not generally useful.

Cheers,

8 Likes

I have a similar interest here in that we will be importing ecommerce records for up to 12 months to deliver the solution. I wonder if an ETL tool like Kettle or Talend would make more sense rather than a long running API call. You could create a database in bit.io - ETL the data there and then sync bit.io to Bubble. Bit.io is .17/GB/month for idle data and $.01/million rows queried. So an 800GB database is about $150/mo - add cost but bit.io connects via API Connector and then you don’t have storage limitations. Not sure what Bubble’s storage cost went to. I am looking at Bit.io as a storage site for detail data for up to 3 years for an ecommerce site (shows buying trends for consumers) and then processing that and delivering it to Bubble in aggregate.

This isn’t a long running API call but instead hundreds of API calls chained together to respectfully download data from one’s own bubble app. Not sure if that’s a meaningful difference for your use case but - in any case - if one is building cloud infrastructure that requires regular access to Bubble’s full db records (say to sync those to a 3rd part service like bit.io), then this is (so far) the best way I’ve found for doing that… otherwise one is exporting records one by one (the above approach is batch export) or using Bubble’s compute units to get the job done (less performant on a per exported record basis).

1 Like

Just wondering if you considered a data warehousing strategy that implements differential updates. So you track the changes in your Bubble app via database triggers and push those changes out to an external database. I’m building a strategy along those lines and was wondering if you had given consideration to that approach and what downsides it might have.

1 Like

Hey bubble.trouble, that’s a great point. In practice, most applications may only need to do a full table or database download once (or a few times a year) and then - if mirroring to a 3rd party data store - incrementally mirroring additional/edited records.

As for me, I needed this for a plugin I’ve built which automates this mirroring capability for my users. Indeed, this plugin also has actions for incrementally updating their 3rd party datastore.

1 Like

We’re currently beta testing a database backup tool built specifically for Bubble by an active dev on this forum and it’s v handy.

On schedule whilst our platform is quiet it comes and makes a full off-site backup solely via API, no workflows. Zero setup required.

2 Likes

@exception-rambler that sounds great! It seems like there are several of us who have seen similar needs. It will be interesting to see how it compares, but definitely something that is needed. I’m excited to see it!

3 Likes

Howdy, any chance you can update on how the build for the data load for on offsite data backup plugin is coming?

While studying the API limitations I had already thought to create the backup database in Azure. If the plug-in is already available I’d love to see it.

From @lindsay_knowcode

1 Like