Editor so slow it is almost unusable

well its back online here in australia and faster than before

Live is also back for me in Budapest. Editor still not working.

Mmmm I take it back, Only a few seconds of glory and now its gone.
Back to the beer

http://status.bubble.is/

Seems we’ve got a bit of a yoyo thing going on…

Hey, very sorry for the outages. Investigating them now – they were slightly below the thresholds that automatically wake us up – I was getting notifications, but they were automatically resolving before the time cutoff at which I get a phone call – which is why we weren’t giving status updates earlier. I may tweak the thresholds based on this incident, though, since this sounds like it was severe enough that it warranted emergency response.

Anyway, looks like whatever went wrong is no longer going on, but we will hopefully have a root cause diagnosis / fix later today. Will update this thread when I finish analyzing the incident.

Re: the remarks above on the thread about 24/7 support, and Bubble’s growth trajectory, for some context, 4 months ago, we were a 2 person team; we are now a 5 person team, and very soon a 6 person team. So, we see the importance of Bubble and are hiring and training people as fast as we can. We are trying to hire people in groups of 1 or 2, not 10 or 20, because Bubble is very complicated from a technical standpoint so training new hires is very involved… that’s the main bottleneck on our growth right now. So, yes: our team is lagging behind our community, and we are trying to fix that lag. I don’t have any brilliant ideas here other than hire great people, and train them as well as I can – I think we just have to keep pushing forward until we have a team that’s big enough for the product.

19 Likes

On the positive side … if you can measure product/market fit by how loudly people scream when the product is taken away (even for a short while)…I think you have this nailed :slight_smile:

10 Likes

@josh

Thanks for the update Josh! While we do get frustrated with the downtime, especially with the increased frequency that has been of late, we also appreciate the fact that you are a small company experiencing very rapid growth, and also that what you are offering the community is of tremendous value for what it costs. In short, we chose to put our eggs in one basket, in a company that’s still young, knowing there would be a risk of hiccups. So don’t get me wrong, you deserve our patience in this.

Whatever you can do both in terms of communicating problems, tweaking thresholds and of course doing what you can to avoid problems is immensely appreciated.

Lastly, to avoid drowning you in messages and extra work: should we report these outages in the bug form? Or will you know anyway, and we are just wasting your time?

1 Like

Please feel free to send bug reports and comment on the forum. With something like this, we generally will already know there’s a problem (today I woke up to a bunch of alert notifications), but it’s helpful to us to know what an outage looks like to our users, because that can provide more information that helps investigation, and it covers gaps in our automated alerting + metrics.

1 Like

Will do!

Let`s not forget this Thread is about the editor being slow.

@josh In what way would it be beneficial that we gather information from the Chrome developer console in regards to 1. Memory leaks 2. Slow api/xhr responses 3. Slow functions

The community is willing to help to uncover the core editor issues. It may save you tons of debugging time.

2 Likes

Hi Josh,

Maybe change the thresholds/gauges on the status.bubble.is site so we can go there to see if/that an issue is happening. Most of the time its all green - while we are seeing obvious errors. This should give us the transparency we need (without having to “bug” the team) and you folks can get to fixing issues while we (with good status data) can manage the outage from a productivity standpoint.

So situations like the one today generally do show up on status.bubble.is: they may not show up in the green dot at the top of the page, but if you look at the system metrics, you can see that something is wrong:

Other times, there may be something broken with a single feature of Bubble, which may be a very big deal to a few of our users, but not the majority of them. For that, I think a bug report is the best way of handling it, because that kind of bug is very hard to detect in an automatic way, which means often by the time we know there’s a problem, we’re already debugging and deploying a fix for it; also, it’s often hard to know how many users it affects, and if we report every single piece of functionality that breaks it might make the status page too noisy to be useful.

Hey @gurun, I’m a little skeptical about the community helping us profile the editor, because for a report to be useful, it’s not enough to identify the line of code that’s a hot point: we need to understand why it is a hot point, in descriptive terms relative to the way the editor works, which is hard to do without an intimate knowledge of the editor code base. A report of “hey, we spend a lot of CPU time on line 3422” doesn’t save us much debugging time, because most of the work is figuring out why we hit that line so many times, and whether we should be hitting that line that many times.

That said, if the community is able to do real diagnosis, it would be useful:

  1. Memory leaks – these are extremely annoying to debug, so a good report here would be very useful. Generally, reproduction cases look like “do stuff in the editor for 20 minutes and the memory usage goes up” which is very hard to work with, since the editor has a lot going on, so being able to identify what is allocating memory and why that memory is not able to be reclaimed is definitely useful.

  2. Slow API / xhr responses: not as sure there are useful things to report here, because generally slowness here has to do with the quantity of data being sent / number of requests being sent, which has to do with application size. The news that sending megabytes of data is taking a while isn’t very actionable for us. If you find a case where a response that does not send much data reproducibly takes a very long time, that’d be interesting, but whenever I’ve looked at slow editor complains I haven’t seen bottlenecks like that.

  3. I’m already aware that we have several functions that scale linearly with the amount of elements / events / actions on a page, which is a bottleneck for editor speed with big applications. Generally with that kind of thing, identifying the problem is pretty easy, and the effort there is fixing it (there are generally ways of making things scale sub-linearly with the # of elements, but they often take major algorithm changes). I would be interested in reports of functions that take human-visible amounts of time regardless of the size of the page, though last time I profiled the editor I didn’t find any.

4 Likes

agreed bugs are bugs, and those that skirt the threshold even worse. Can you change the threshold or add more services:

?

Hi @josh, Thanks for the update, I am sure you had a horrid morning.

I know we all get frustrated and the forum is the only place we can turn for information to see if others are experiencing the same problems.

I am glad it’s back up and running and I look forward to finding out what’s the root cause, and like BubbleBoy and I were talking about this may be the push a lot of us need to move to Dedi Plans.

1 Like

Wow I can’t believe 4 months ago Bubble.is was just a 2 person team. It’s amazing that 2 people built all of this until recently. That means even more brilliant things will be possible when the team is bigger…

3 Likes

So, to follow up on this and give you guys some transparency into how we’re handling the outage earlier today, here’s where I am with my investigation.

Two things went wrong here:

  1. There was a sharp increase in average time to process server requests, severe enough to lead to downtime. It looks like this affected all apps including the editor and user apps

  2. We weren’t woken up by our automatic alerting system, even though this was a major outage

For 1, unfortunately, I’m not 100% sure what caused the increase in average time. I have two theories, but neither is testable after the fact given the data we have. The theories are:

A) The performance of the queue that manages allocating CPU time to different apps may have degraded under load. I know there was a large increase in the number of queued items during the outage, and I identified a bug that could cause the queue to slow down as more items were added to it. Unfortunately, while we measure how long it takes to run each thing in the queue, we don’t currently measure the overhead of the queue itself, so I don’t know for sure if this was the cause.

B) All connections to the database that stores applications may have been in-use by long-running queries, choking off the ability to fetch application data. There was a large increase in the average length of time to acquire a connection during the outage. Unfortunately, our metrics around the application database connection pool aren’t great. We’ve had relatively few problems in the past with it, compared to the database that stores user data, so the monitoring and protections we have around it aren’t nearly as robust as for our user database. So, I don’t know for sure whether this was the problem either.

For 2, I identified a couple issues with our alerting system that caused us not to be woken up:

  • Our alerting for average page speed load times was set to emit warnings rather than emergencies. That’s because it’s fairly new (we only introduced page speed metrics recently), so we weren’t sure how accurate it was. However, it correctly alerted on this outage, and it hasn’t been giving false positives, so it makes sense to have it emit emergency alerts going forward.

  • Our downtime alerting is designed to handle prolonged outages, not intermittent outages. We got several alerts sent by it, but the alerts were cancelled shortly afterwards because the system briefly restored itself. We don’t have our alerting system wake us up immediately (or we’d get burned out by false alarms), but because the alerts kept flickering on and off, they never stayed on long enough to trigger a wake-up call. A human watching the pattern of alerts would have realized that the problem wasn’t really solved, but our automated systems aren’t smart enough to do that.

  • We have an automated monitoring system that attempts to resolve issues without human intervention by taking actions such as restarting servers. I believe it’s likely it would have resolved the issue had the system activated. However, this system did not detect the downtime, because it’s configured to check the server health, but not the health of apps running on the server.

So, I am taking the following actions:

  • I’m adding two new metrics to monitor overhead in the CPU allocation queue
  • I’m fixing the bug in the queue that could potentially cause it to slow down with a lot of items in it
  • I’m adding metrics around the connection pool to our application database
  • I’m going to limit the number of connections to the database that can be used by a single app, so that an issue with one app can’t block other apps from accessing the database
  • I am upgrading our page speed alerting to emit emergencies
  • I am investigating whether it’s possible for our uptime monitoring to be smart enough to treat a bunch of brief periods of downtime as a single incident
  • I am changing our automated recovery system to check app health as well as server health
11 Likes

Hi Josh,
During the error; our app pages were loading most of the time (even if it was slow) but when it came to database queries with repeating groups etc., 1 of 2 queries was failing. Think of two dropdowns which pull data from two separate search queries. The first one was loading 9 times of 10 but the second was freezing. Another example was a repeating group with pagination. The first result was returning but if we tried to navigate to other rows of the repeating group, the page was freezing. I know this does not tell much but I hope it helps.

Same experience as @cem - pages would load, but dropdowns and repeating groups wouldn’t pull data.

I experienced one issue which made us very aware of a security issue. During the loading of a page, the app loads a client profile into a state of the page. The page then shows a Repeating Group with data related to his profile. Let’s say the client is a shop, and the RG is a list of invoices.

The problem was, because of the server trouble, the client profile was never loaded, and the state remained empty. The Repeating Group was loaded, however, but because the client state remained empty, it showed ALL invoices instead of just the client’s invoices.

Now, this is not a Bubble security hole per se – secure systems need much stronger protection in place, which I’m aware of, and the Privacy rules would probably have prevented this. Still, as many Bubble users are non-developers, such a precaution may not be obvious. I thought I’d post it here as a precaution. Simply adding a constraint a RG to keep information confidential cannot be considered secure.

2 Likes