[New Feature] New Aggregation Operators

matthew.slesinski · July 20, 2022, 3:37pm

Hi Everyone!

We wanted to let you know that we have added two new operators for Lists in dynamic expressions - “approximate count” & “cached aggregations”. Both of these operators are performance optimizers that explicitly allow you to trade accuracy for speed and lower capacity usage.

For some additional background, these two features were inspired by issues we faced on the bubble.io homepage where we display the total number of Bubblers on the platform. With frequent page visits and millions of users to count, the volume of database calls we used to calculate this number was putting a lot of stress on our system and slowing down our homepage. To work around this, we decided to trade having a super accurate number for something that was relatively accurate but was speedier to return and consumed less resources. This led to the development of the ‘approximate count’ and ‘cached aggregation’ operators.

:approximate count

As the name suggests, this operator will return an ‘approximate’ count for the number of items in a list. We used some database internals to return an approximate count which from our internal tests on large tables runs orders of magnitude faster than the standard count command and has been within 2% of the actual count.

:cached aggregation

Caching is a concept that allows you to efficiently reuse previously retrieved or computed data. The new cached aggregation operator allows you to strategically cache expensive aggregation operations (depending on the type, this could be count, approximate count, max, min, median, average, sum, product). Specifically, it tells our servers that an aggregation’s result can be saved and reused across all page visits for a period of time, which we set to 30 seconds. This means, for example, that your application would only calculate an expensive (slow) aggregation at most 2 times a minute, instead of for every page visit, reducing your capacity usage and delivering a speedier (but potentially outdated) result for your customers.

Here are some example screenshots of what it looks like to use these new operators:

As always, we welcome your feedback and any questions you might have!

Happy Bubbling!

ZubairLK · July 20, 2022, 4:56pm

Amazing. I’m really loving the new product features for power users!

Thanks
ZubairLK

w.fly · July 20, 2022, 5:38pm

Agreed – Love these “power user” features. Much appreciated!

lstk.kb · July 21, 2022, 8:05am

Amazing.

The second one is very useful for me: cached aggregations. As for me, it would be nice to specify the caching time myself. In my opinion, it can be 1 hour or 24 hours.

pintonuno · July 21, 2022, 9:16pm

Hi, in one of my applications I expected more results:
Count: 11 seconds
Approximate count: 8 seconds
Cached aggregation with Approximate count: 8 seconds

matthew.slesinski · July 21, 2022, 9:38pm

Hi pintonuno,
It does look like your application did see a performance improvement. Are you sure that the rest of the 8 seconds the query took was because of the approximate count operator and not some other unrelated operation taking that long? If you are confident, then that is definitely a lot slower than the operator took in all of our testing. So if you think something’s not behaving correctly can you file a bug report here.

I should also mention that while the approximate count operator is supposed to basically always provide a speed improvement over the count operator, the cached aggregation operator will only provide a speed improvement to subsequent queries after the first one in the 30 second cache window.

jmalmeida · July 22, 2022, 7:06am

Hi @matthew.slesinski!

Thanks for the update!

I can see the “approximate count” very useful for ‘marketing’ metrics, where the accuracy is not that important. I’d trade not having the exact metric for not stressing the server in that cases every time.

→ Have you consider having ‘approximate XXX’ so it could be applied to other operations like sum, product, average, etc.?

Introducing the ‘cache’ is even better! Also I’ve seen that it can be applied to all the available operations even to the ‘approximate count’ which is great.

→ And I agree with @lstk.kb I’d like to be able to control the expiry time, I’d add an ‘expiry time’ field in seconds, that would alleviate the server stress even more.

I think that hese features can be of great help in making applications scalable.

Thank you for your work, you are improving the platform a lot!

matthew.slesinski · July 22, 2022, 2:34pm

Hi @jmalmeida

I’m glad you see getting some good use out of these operators! To respond to your ideas:

Sadly, an “approximate XXX” operator, where “XXX” is anything other than “count”, isn’t an option in the way it was for “count”, because we implemented “approximate count” by hooking into an internal tool for our database that estimates the number of entries that get returned for a given query. That tool doesn’t estimate results for any other aggregation operator.

As for the idea of being able to control the cache timeframe that you and @lstk.kb proposed: I can see where you both are coming from with that idea. But we determined it wouldn’t actually be useful, because the actual amount of time that the results are cached for doesn’t really matter. The point of the operator is so that if your app gets a ton of visitors requesting an aggregate result in the same short period of time, it drastically reduces the number of times the database needs to calculate that result. If the database needs to calculate that result once every 30 seconds or once every 24 hours, it wouldn’t really make too much of a difference, because the load on the database would be well dispersed relative to the time it takes to calculate the result. But there’s a very large difference between needing to calculate it, for example, 1000 times in 30 seconds vs. 1 time in 30 seconds, since many of those requests would probably be overlapping.

I hope this addresses your thoughts on these new operators, and please continue to share your ideas with us!

jmalmeida · July 22, 2022, 3:35pm

Hey @matthew.slesinski,

I totally understand the difference on the stress with that comparison.

And at the same time I can’t help but think that longer cache time will mean fewer operations, less power consumption, less hardware usage, reduced network congestion, etc. That multiplied by millions of applications is no small gain (even from an ecological point of view). In the end, we would all win. 1 operation/30 seconds vs 1 operation/24h is 2.880 times less. But hey, you guys have the data!

Don’t mind me too much, I’m thinking out loud.

And I definetly see your point, thanks for your response!

NigelG · July 23, 2022, 9:06am

Quick question.

Is approximate count “accurate” at the bottom end of the scale. So if there are no results would it always return zero. Or is one approximately zero?

@matthew.slesinski

matthew.slesinski · July 26, 2022, 1:20am

@NigelG

It’s hard to say for sure. As I mentioned in my original post, the approximation implementation uses a capability provided by our database, so we don’t have complete knowledge over how it behaves. In our testing, it has seemed non-deterministic, in that even if two database tables have identical data, it won’t necessarily return the same approximate count for them. Of course, this has made it harder to get a completely reliable sense of how it behaves through testing, so I can really only answer your question by saying that it’s context dependent.

I should also stress that this operator is intended for cases where you would want to perform a count of a lot of things. It is conceivable that the approximation is less accurate than 2% for small table/count sizes (for instance, in your example, I can’t definitively say it wouldn’t return 1 when the actual count is 0, especially if the table is actually quite large but there are a lot of filters/privacy rules. And if it did, it would technically be off by an infinite/undefined amount, even though the magnitude of the error would be much less than if it was off by 2% for a much larger table/count). If accuracy for small numbers is important for you, I’d recommend testing the use case, possibly with mock data, and confirming it behaves as you’d want often enough. And if it doesn’t, you can use conditionals so that if the result of the approximate count is smaller than X where X is a moderately large number, the data source uses an actual count instead of an approximate one (if you use this approach, it would be important to make sure that the conditional is based off of the approximate count value, not the actual count, because otherwise you would be calling the actual count operator every time the result is requested, no matter what!). This way you’d guarantee accuracy for small counts, but you’d only call the actual count operator in the cases where it wouldn’t be expensive.

I’m sorry I can’t give a more definitive answer, but I hope this helps.

NigelG · July 26, 2022, 8:56am

Thanks

It was more accuracy of zero vs any other number. 1 vs 2 doesn’t matter. 0 vs 10,000 does matter.

So maybe in this case a filter would be better vs a count being <= 0 ?

But thanks for the explanation.

gazinhio · August 4, 2022, 8:46am

Thanks Matthew. My understanding is that all of the : options for sum, count etc were client-side expressions and I have always used “group by” instead to perform aggregation as I believe this ran directly on the database for performance. Am I wrong in this? And now with :cached aggregation - is this a server-side process?

Let’s say I have an RG and each cell has to fetch a number of aggregated values to populate different text boxes for each main record. I am aware that this is pretty heavy on the db with web sockets etc. Will :cached aggregation help with this at the expense of live updates every 30 seconds (which we can live with)?

Thanks.

agiledood · August 4, 2022, 12:56pm

approximate count is a cool update. I have been previously using a temp ‘stats’ table that gets updated nightly. I thought about updating it on-demand when a new record is added to one of the tables, but real-time data for these stats aren’t a big deal.

matthew.slesinski · August 4, 2022, 5:48pm

Hi Gazinhio,

All of the aggregation operators (count, approximate count, sum, product, median, max, min, average, and any others I’m forgetting) run on the server in a bunch of situations. In fact, I think they’re likelier to run on the server than not. The main situations where we run them client side are usually when the search expression is complicated and can’t be turned into a database query, when there is an advanced filter applied, and when we can rely on client side optimizations (such as if the underlying search has already been run and the results are in the client’s cache) so we don’t need to request much new information. In the example you gave, there aren’t enough specifics for me to say if the aggregations would run client or server side. But one way to think about it is that if the data you want for the text boxes is already on the items associated with each RG cell (e.g. each RG cell’s item has a field that is a list of values, and you want a count of that list), we will probably perform the operation client side. A contrasting example where we would perform the operation server side is if each RG cell’s item has a field that is a list of things, none of those things are already loaded on the client, and you want a sum of each of those thing’s X field.

I should also be clear: the cached aggregation operator shouldn’t reduce the number of live update responses from the server, even if it makes those updates a lot less expensive to calculate on the server side. If underlying data changes, the server will still notice, recalculate the cached aggregation, and send it to the client. But that recalculation will be a lot faster because it will rely on the cached value instead of going to the database (if it’s within the 30 second window of the original computation), and then the client will just ignore that response because it will see that the value sent from the server is the same as the value already being displayed.

Also, a word of advice: there are two different kinds of performance concerns to consider when thinking about data requests: how intense of a load is put on the database and how many network requests go between your app’s client and our servers. The intensity of the load on the database is kind of self explanatory in the cases where you have a very large query/computation. My point, though, is that when the query/computation isn’t too large, client side computation isn’t uniformly less performant than a server side one. What makes client side computations expensive is most often the round trip time of requests to the server for a lot of data, as well as the overhead of having more/larger data to handle. So if the client needs information that it can only get from the database, then it is faster to have the database do the aggregation work. This is not so much because the database is faster than the client, but because once the database has done the computation, the size of the data that needs to be sent to the client is much smaller, which will require less handling work for the client and possibly fewer requests/responses sent over the network. But if the client has most of the data it needs already to perform a calculation, it’s definitely more performant if it just does that calculation right away, since it saves the round trip time to the server.

So in general, if you’re worried about the performance of a computation, we have a lot of logic to try to be smart about doing computations in the smartest/fastest way, and the main limitation to our ability to do that is if we can’t make the right assumptions about the type of query that is the best to do. So a good rule of thumb is that when choosing how to define an expression, go for the most direct way, for what you want. For example, if you want to apply a filter that can be specified by putting a constraint on the search, do that instead of using an advanced filter. If you want to get the max value for a field of a type of thing, use the max operator instead of sorting the search for that thing and taking the first item. If you want to perform an aggregation, call that aggregation directly instead of putting unnecessary operators in between like the group by operator. The closer the expression describes what you want, with the most specific operators we provide, the less likely our logic will make an assumption it shouldn’t make and do something less efficient. I hope this gives you some food for thought!

mghatiya · September 7, 2022, 12:11pm

@matthew.slesinski
I recently used “approximate” operator and at a place where the number was supposed to be approximately 1000, it showed number to be 11300.

This is a high inaccuracy level for our usage. People in our team got scared looking at that number actually and got in a panic.

Is this level of deviation expected? If the deviation is of this degree, then that would render these operators useless. We may as well hide that number if this “approximate” operator doesn’t even give us a sense of scale accurately. I mean for a number that is supposed to be ~1000, I can understand inaccuracy of 20-30%. But an inaccuracy of 1000% is pretty useless.

Is this operator more helpful for larger numbers?

w.fly · September 7, 2022, 1:32pm

@mghatiya
Just to be clear, when you use “:count” it shows 1000… and then for the same calculation, if you use “:approximate count” it shows 11300?

mghatiya · September 7, 2022, 2:38pm

Yes, that’s what we saw.

matthew.slesinski · September 7, 2022, 4:09pm

Hey @mghatiya
What you’re describing definitely sounds like an issue and outside of the bounds of both what we saw in testing and what we were expecting with the operator. This is the first time I’ve seen/heard of this operator having results that are off by more than ~2% when the underlying count is more than 100. I can imagine how seeing such a different result than expected could be very scary. I’d encourage you to file a bug report with all the details of what you’re seeing. Some things I’m personally curious about for your use case are:

How consistently/deterministically can you reproduce this large inaccuracy?
How long have you been seeing this for?
If you file a bug report we’ll be able to see where in the app you have this approximate count specified, but in the meantime, do you have any filters applied to the underlying search?
Where else in your app do you do a search or aggregation on this type of data/thing, and is the inaccurate result reproduced both before and after you view the table for this data type in the App Data tab of the editor?
Have you done (or observed) any significant/large operations occur on this data type prior to observing this inaccuracy?

In the meantime, some workarounds I would suggest to mitigate this issue are

You can switch to using a cached aggregation for a normal count operation. Also, a count of ~1000 is not very expensive, all things considered, so if you don’t expect it to get much larger for awhile there’s not much need to avoid using a normal count operation.
You could have a conditional that if the approximate count result is less than X (where X is meaningfully larger than 11300), that it then triggers a normal count operation.
Sometimes with database internal stuff like what we use behind the approximate count operator, triggering different but related operations on the same table could have effects on the table’s statistics that would cause the approximate count to get more accurate, so you could maybe try playing around with doing other queries/aggregations of that data type.
If this table doesn’t change super often, you could periodically (e.g. with a scheduled workflow) run and store the result of a normal count operation, and whenever you run an approximate count, check if the result is off by more than X%, and if so, re-run and store the normal count operation and use that instead.
If you only just started seeing this weird result, it’s possible that our database will update its internal statistics at some point in the near future (within a day or so, possibly). This could mean that the issue could resolve itself (but of course if it does, that makes it harder to investigate and see what happened to try to prevent this in the future, which makes it even more important that in any bug report you file that you include any information that might be relevant related to operations that have been done recently on that data type)

I hope this gives you some ideas to work with and stuff to try.

rpetribu · September 7, 2022, 8:24pm

Can someone give me a good use case for approximate count ?

Why I would like to have approximate information?

Topic		Replies	Views
Improvement in implementation of “:count is 0” Announcements	31	970	September 14, 2024
Need help replacing aggregate search Need help	5	270	June 21, 2023
Bizarre Performance Variations - Factor of 100...! Database	2	470	November 22, 2018
Data aggregation - Help creating a table with "counts" Database	0	400	March 18, 2021
[New Feature] Smarter query optimization for "merged with" and "intersects" Announcements	2	3014	March 1, 2018

[New Feature] New Aggregation Operators

Related topics