Hi all,
This is another in our series of monthly community updates (link to previous one).
This month was a mixture of wins and setbacks. We made some progress that we’re very excited about, but we’re also feeling acutely the tradeoff between shipping things fast and managing reliability, which is informing a number of our current priorities.
Changes we made this month
On the performance front, we had another big release, this time focused on our “Make change to a list of things…” and “Copy a list of things…” actions. It involves better backend logic for batch processing, and accomplishes three things: 1) decreasing the amount of capacity these actions use, 2) decreasing the total time they take to run on the server, causing workflows to complete faster, and 3) paving the way for future performance improvements that will improve other data modification actions as well. In terms of numbers, on a professional-plan benchmark app we use, copying 99 things went from 26.5 seconds to 6.0 seconds, and even more dramatically, now only uses 5% of the capacity it used to use. We’re excited by the progress and look forward to more improvements in the future.
(Note that even with this change, we still recommend the “Make change to list of things…” action for working with small lists on the order of 1 - 100 items. For large-scale data processing, we still recommend using recursive API workflows or the editor Bulk tool.)
The other very visible change this month was the release of our new homepage design! We’re happy with how it came out, and are enjoying the change of scenery on the pages we visit every day. We’re still making changes to our other websites and assets (such as this forum) to bring them inline with the new look, and expect them to roll out over the next couple weeks.
On the educational front, we launched a 10 video Bubble Crash Course that’s perfect for new Bubblers looking to master the basics. We’re also continuing to see very high demand for our bootcamps, which we are very excited about.
Our community continues to amaze: we’ve published 20 new “App of the Day” blog posts featuring the incredible apps you are all building.
On the product front, we didn’t have too many releases this month (although we have a couple things we got close to release: see the “What we’re currently working on” section below). We made a small improvement to the Bubble-built Segment plugin to support sending events to Segment from the server in addition to from the client, which is useful for high-importance one-off events that you want to guarantee make it to Segment. Sending from the client uses less capacity, but can get blocked or fall through depending on the user’s web browser or internet connectivity.
We’ve rolled out a new process for doing retrospective reviews on our customer success team. Every Wednesday, we now review every customer interaction where we got a less-than-good rating to see if there was a way we could have handled the situation better. It seems like it may be paying off: 91% of the users who rated our support in July said it was great, up 8% from the previous month.
On the team front, we’re excited to have a few new faces: Yaw and Ethan on the growth team, and Jonah on the engineering team. And Theodor will be joining the engineering team on Monday. We’re also happy to have our engineering interns Zoe and Sweta back with us for part of the summer. If you contact support@bubble.io, you might see their names: we have all new hires, regardless of what their role is, spend a couple of weeks on the success team, because we want everyone at Bubble to understand from first hand experience what we’re all about, namely solving problems for our users and helping them succeed.
This month in numbers
-
Total customers who reached out to us through bug reports or support@bubble.io: 1,633 (up 8% from last 30 days)
-
Total received messages: 2,828 (down 1% from last 30 days)
-
Average response time to messages (5h 18m counting only our work hours; 13h 9m in absolute terms)
-
Total bug reports: 616 (down 8% from last month)
-
Time to resolve bug reports escalated to the engineering team: for bugs resolved in the last 4 weeks, it took on average 4.0 days for engineers to investigate and deploy a fix or find a workaround for the customer.
Things on our minds
We, along with many of our users who do business in Europe, are concerned about the implications of the recent Schrems II judgment striking down the EU-U.S. Privacy shield. Whenever an industry-wide legal change like this occurs – especially a sudden, court-initiated one – it creates a massive amount of uncertainty and confusion as lawyers all over the world attempt to interpret the often-ambiguous path forward. We, like you, are in the same boat of trying to figure out what to do about it. You can follow our updates on this forum thread: our current plan of action is to implement Model Clauses, which we are currently working with our legal team to do.
Several users have asked about hosting in Europe as a response to this. Although we can use AWS to set up servers in Europe (which we offer to customers on dedicated plans), our team is mostly based in the United States, and we rely on US-based subprocessors. Attempting to completely eliminate our dependencies on the US would involve much more than just changing the physical location of our servers: in order to maintain and support them without sending data to the US, we would likely need to establish local offices and restructure chunks of our infrastructure. Long term, our aspiration is to be a truly global company that has no dependencies on any particular jurisdiction, but as a short-term response, this would be challenging to make practical at our current size. So our first avenue of investigation is working with our legal team to find a path forward. This is a situation we share with all US-based tech companies that aren’t multinational corporations, so I’m optimistic that the industry will collectively reach an equilibrium.
On the reliability front, we’ve had some struggles this month, some as a result of long-term issues that came to the forefront, and others as a result of ongoing development work we’re doing.
One major source of problems was due to a bug we’ve had for years that under certain circumstances, can result in the creation of multiple elements that share the same underlying unique identifier. This was a relatively infrequent issue that had been causing occasional situations where user apps got stuck in a corrupt state: we were aware that this sometimes occurred, but weren’t sure why. Recently, some changes we made to the new application template led to this occurring on a much more frequent basis, causing what used to be a once-in-a-while problem to become a major source of bug reports and issues. We went through a few iterations of trying to fix it: a couple years ago, we wrote code that attempted to automatically repair apps in this situation, but we discovered, due to the recent influx of bug reports, that this code often caused more problems than it solved, and was a major cause of “Some elements in my app disappeared mysteriously!” bug reports. We went through a couple iterations of trying to improve this auto-fixer code, some of which made the situation worse, and eventually got to what I think is a good equilibrium: a) we solved the main cause of this situation occurring in the first place, b) we put in a protection mechanism that will block this situation from occurring due to other causes by throwing an error at the point of time this would happen instead of letting the app get corrupted, and c) we updated the auto-fixer so that it fixes this when it’s safe to do so, and outputs debugging information that a human can use to manually fix it when it is not safe to fix automatically. We’ve seen the number of incidents related to this drop over the past week, so I believe we’re on the way to recovery, although there might still be apps out there that have a problem but haven’t noticed it yet.
The other major source of problems has been our work on our new asset generation system. As a reminder, this is a major project we’ve undertaken in the last couple months that’s meant to address reliability problems: specifically, other forms of apps getting stuck in corrupt states, and issues with version control merges. It also paves the way for a number of performance improvements, because it makes it easier for us to compile assets. This includes moving the issue-checker server-side instead of running it in the web browser, which would be a major editor performance win. We’re still very excited about this project, but unfortunately this month it was a case study in temporarily making things worse in order to make things better: because it’s a major change, it moves a bunch of things around, which disrupts the equilibrium of the system and gives rise to new bugs. We believe these are mostly growing pains bugs, and should be solvable, but for the most part, this month was spent fixing problems introduced by this work rather than actually making forward progress on the project, which is disappointing to us and frustrating to our users. The problems we’ve been battling this month fall into two major categories:
a) Page load errors and missing element errors when running user applications. These problems occur because of the way the new system determines when it needs to regenerate assets as user applications change: it has a tracking mechanism that determines what parts of the app it needs to read in order to figure out the results of computations. Unfortunately, there are ways in which old code that wasn’t designed with this system in mind can sabotage the computation, leading to miscalculations that can break proper rendering of applications. We’re systematically tracking down each one of these bugs and fixing them, and at this point, we think that we’ve narrowed it down to a handful of rare cases that every once in a while breaks a user application. Luckily, there’s an easy workaround for this kind of bug: making an edit to the element that isn’t rendering properly will almost always make the issue disappear, at least for the set of rarer bugs that are still out there. (The symptoms of this issue is that an element won’t be drawn on the page, and there will be a low-level system error which has a message containing ‘pushes’, e.g. “Cannot read property ‘pushes’ of undefined” in the browser’s javascript console).
b) Performance issues loading pages in run-mode: a number of users have reported occasional, hard-to-reproduce occurrences of pages sometimes taking over 30 seconds to show anything at all on the page (or timing out with a CloudFlare error). This happens when our asset-building system has to completely rebuild all the assets for a page. Normally, this shouldn’t occur when editing an app, because our system only rebuilds based on the changes that were actually made. However, we discovered recently that we weren’t saving the generated assets as long as we thought we were saving them for, leading to having to do total rebuilds for pages that were only mildly edited, or not even edited at all. It took us a while to realize this was going on, because we were expecting the very first time you load a page after a major change to be a little slower because of the new system, which got in the way of us realizing that the problem went beyond that. (This problem – things being slow the very first time you load it after major changes – is something that will probably be around for a while, since it’s just one of the tradeoffs of the new architecture; we have a few things we’re working on to mitigate it and make the effect less noticeable, and we have some long term plans to eliminate the problem entirely). Once we caught on to what the issue was, we rolled out a solution, which involves a) adjusting the way we hold onto compiled assets to keep them from disappearing prematurely, and b) building a back up system that stores them near-permanently so that if they disappear from the primary system, we can recover without doing a rebuild. We rolled this out a earlier this week. Unfortunately, the changes only apply to newly-generated assets, so there might be one more slow load per page per app before this problem disappears: we expect the problem to go away slowly over the next couple weeks, rather than be solved instantly.
In addition to the above issues, we also had a number of bugs caused by us rolling out bad code and having to immediately revert it after getting bug reports. Each time this happens, we build an automated test to protect against that bug every happening again, and we’re steadily building our test coverage. We’re also working on a project (see below) to mitigate these occurrences.
So, all in all, it’s been a tough month on the reliability front. The total number of open bug reports is higher than it was at the beginning of the month, and our response times slipped a bit. That said, our new system of having engineers respond directly to users while prioritizing getting as many users un-blocked as quickly as possible still seems to be working, and while our numbers aren’t as good as last month, we haven’t fallen completely behind.
What we’re currently working on
On the engineering team:
-
We’re now basically finished with our proof-of-concept for moving apps between different databases in our main cluster in a behind-the-scenes way. The next step is to test and implement the actual migration: we plan to spread out our main cluster apps from one huge database cluster onto a number of small clusters, which should be a big performance and reliability win.
-
Also on the infrastructure front, we’ve begun work on building the capacity to split the main cluster into a fast track that gets code immediately, and a slow track that gets code after it has been in production long enough that we’re more confident that it’s stable. While a lot of our users want the latest and greatest version of Bubble, users who are at the point of scaling up their applications would often prefer to trade instant availability of new code for more stability. This is one of the reasons that users upgrade to dedicated plans, but we think there’s a middle-ground of users who are far enough along that stability is really important to them but who aren’t at the point of being able to invest in dedicated. We plan to offer this as as an opt-in feature on our higher-tier main cluster plans, as a way of mitigating the impact of bad code rollouts and transient production issues on our customers with more established businesses.
-
We’ve entered our internal QA phase for a feature that allows bulk importing and exporting of translations from the Settings -> Languages tab
-
We’ve also entered the QA phase of a Slack plugin that will allow Slack login to apps, creating Slack bots that are controlled by apps, and allowing users to automate Slack actions.
-
In June, we had almost completed an overhaul of the code that powers the Input element, which should lead to increased reliability, and better behavior on Android. We temporarily put this work on pause in July for resourcing reasons, but we plan to resume it this week and expect it to roll out shortly.
-
We fixed a few more bugs in our Google Optimize integration that came up during alpha testing, and are now almost ready to launch it to beta.
-
We’re in the initial development stages of a Zapier integration that will allow native two-way integration between Bubble apps and Zapier
-
We’ve started work on fully-customizable page URLs: instead of https://my-app.com/my-page/the-title-of-my-thing-1596308125537x153766304138575070, we’ll support https://my-app.com/my-page/the-title-of-my-thing.
-
We’re kicking off work next week on exact-match database searching. Currently, when you build an operator that checks if “Current User’s text contains Some Text”, we do a precise match on “Some Text”, whereas if you do a “Search for Users where text contains Some Text”, we use Postgre’s full-text-search implementation that does fuzzy matching rather than looking for that exact phrase. This is confusing and often not what the app developer wants, so we’re building out the option of doing precise matching in searches (we’ll keep the old full-text-search functionality for people who want the fuzzy matching, but we’ll rename it to avoid confusion).
-
As mentioned above, we didn’t make much progress on the transition to our new asset-building system this month because our time was spent fixing bugs with it, but we hope to move it further along this month: our next big milestone is the rollout of server-side issue checking. We see this project as critical path for resolving some of the bugs with our version control feature.
The two big multi-month projects we’ve been working on are still ongoing:
-
The complete redesign of our editor has moved forward: the Workflow tab is now done, and we’ve broken ground on the Plugins and Styles tab. Two of our newer engineers are now far enough along in their training to start serious project work, and they are now pitching in to get this project over the finish line.
-
The ground up overhaul of all our educational and reference content, including the reference, manual, and video tutorials is continuing: the release, mentioned above, of the Bubble Crash Course is the first major user-facing milestone.
We’ve also made progress against our commitment to make sure Bubble is accessible as a resource for underrepresented communities: we’ve put the team in place, and we’re now in the program-design phase. We hope to have more announcements on this front later this month.
Finally, we’re working on a couple of internal projects to improve team productivity:
-
We’re replacing the software that we use to manage support@bubble.io and our bug report form with a more flexible tool.
-
We’re continuing to build out out internal analytics capabilities, along with a training course to help our team get up to speed faster on using them: this will help us make better prioritization decisions around where to spend our time.
-
We’re in the middle of developing an internal training curriculum for new engineers joining the team; we have part of it written already, and our newest engineering hires are currently taking it. We’re excited about this, because it should let us hire and ramp up engineering talent significantly faster: engineering time is one of our biggest bottlenecks to our ability to do things quickly.
Finally, we’re continuing to hire: we’re searching for an additional member of our Success team, and we’re taking advantage of all the top college students who are choosing to take this year as a gap year for COVID reasons to try to recruit some of the best and brightest graduating engineers.
Thank you for reading this (lengthy) update, and for all the support and enthusiasm!
Best,
Josh and Emmanuel