#28477 closed Cleanup/optimization (fixed)
Strip unused annotations from count queries
Reported by: | Tom Forbes | Owned by: | Simon Charette |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Reupen Shah | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
The query below produces a SQL statement that includes the Count('chapters'), despite not not being used in any filter operations.
Book.objects.annotate(Count('chapters')).count()
It produces the same results as:
Book.objects.count()
Django could be more intelligent about what annotations to include in the query produced by queryset.count()
, stripping out any annotations that are not referenced by filters, other annotations or ordering. This should speed up calls to count() with complex annotations.
There seems to be precedent for this: select_related calls are ignored with count() queries.
Attachments (1)
Change History (29)
comment:1 by , 7 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 7 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 7 years ago
comment:5 by , 6 years ago
Has patch: | set |
---|
comment:6 by , 6 years ago
Patch needs improvement: | set |
---|
The PR is still marked [WIP] and there are test failures.
comment:7 by , 6 years ago
I have also run into problems with QuerySet.count()
being very slow on annotated query sets. Django uses a subquery for the count but injects a group by into the subquery. This typically causes the database server to deduplicate all matched rows using the group by columns which is, in general, extremely slow when there are a large number of matched rows.
For example, consider the following model:
class Person(models.Model): """Person model.""" first_name = models.TextField() last_name = models.TextField() country = models.TextField(null=True, blank=True)
and query set:
from django.db.models.functions import Concat from django.db.models import Value queryset = Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name'))
queryset.count()
generates the following query under PostgreSQL:
SELECT COUNT(*) FROM (SELECT "support_person"."id" AS Col1, CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name" FROM "support_person" GROUP BY "support_person"."id", CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name"))) subquery
list(queryset)
generates:
SELECT "support_person"."id", "support_person"."first_name", "support_person"."last_name", "support_person"."country", CONCAT("support_person"."first_name", CONCAT(' ', "support_person"."last_name")) AS "full_name" FROM "support_person"
I am not entirely sure why the subquery for the count needs to be any different from the query used when the query set itself is evaluated. There are some relevant comments in the source code here: https://github.com/django/django/blob/5deb7a86e8b54d052a3b1dbed1ae7142d362b1c5/django/db/models/sql/query.py#L404-L414
This has all been tested under Django 2.1.7.
comment:8 by , 6 years ago
This is somewhat related to #30158 where the compiler is not smart enough to determine if it can exclude subquery annotations from GROUP BY
on aggregation.
comment:9 by , 6 years ago
The behaviour is actually a bit more bizarre than I thought.
With the same person model, here are four count variations and the generated queries:
1.
Person.objects.count()
SELECT COUNT(*) AS "__count" FROM "people_person"
2.
Person.objects.values('pk').count()
SELECT COUNT(*) AS "__count" FROM "people_person"
3.
Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).count()
SELECT COUNT(*) FROM ( SELECT "people_person"."id" AS Col1, CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name" FROM "people_person" GROUP BY "people_person"."id", CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) ) subquery
4.
Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).values('pk').count()
SELECT COUNT(*) FROM (SELECT "people_person"."id" AS Col1 FROM "people_person") subquery
So there's a simple workaround for the case I gave in my last comment in calling .values('pk')
before .count()
. However, that's not much help if Django or other libraries are the ones calling .count()
.
On the plus side, I have a better understanding of what the problem is from looking at the Django source code.
comment:10 by , 6 years ago
Cc: | added |
---|
comment:11 by , 6 years ago
Just a note to say that the behaviour I was describing was rectified in https://github.com/django/django/pull/11062.
The third case from above now executes the following SQL query:
SELECT COUNT(*) FROM ( SELECT CONCAT("people_person"."first_name", CONCAT(' ', "people_person"."last_name")) AS "full_name" FROM "people_person" ) subquery
Although the original ticket was about eliminating unneeded annotations completely, the result above is good enough for me, as the group by has been eliminated which was the real performance killer.
by , 4 years ago
Attachment: | ProductAndServiceUsage.json.zip added |
---|
comment:12 by , 2 years ago
Owner: | changed from | to
---|
comment:13 by , 2 years ago
Patch needs improvement: | unset |
---|
I submitted a revised PR that strip unused annotations not only for .count()
for but any usage of .aggregate
.
To take example from comment:7 and comment:9 the resulting query is now
Person.objects.annotate(full_name=Concat('first_name', Value(' '), 'last_name')).count()
SELECT COUNT(*) FROM support_person
The adjusted logic goes along these lines.
Instead of systematically performing a subquery wrapping when there are existing annotations only do so when the pre-existing annotation are aggregate or window functions. In both cases, pre-existing aggregation/window or not, strip annotations that are not referenced by the aggregates.
comment:14 by , 2 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:22 by , 19 months ago
Is there a way to opt out of this behavior from application code? After upgrading from 3.2 to 4.2 some of our pagination stopped working correctly and it was traced back to .count()
not providing the correct results, likely from the updates in this ticket.
Here is an example of the failure (using PostgreSQL JSON query):
>>> qs = MyModel.objects.annotate(table_element=Func("data", Value("$.MyArray[*]"),function="jsonb_path_query",output_field=JSONField())).filter(pk=1) >>> qs.count() 1 >>> len(qs) 2
This assumes a very simple model MyModel
with a JSON field data
that is has an instance (pk of 1) that has something the data field set to:
{ "MyArray": [{"id": 1, "name": "test"}, {"id": 2, "name": "test2"}] }
The issue is now the count
is not factoring in the annotation which will actually increase the number of rows returned in the queryset (for this example due to the jsonb_path_query which returns a set). It is only counting the number of rows of MyModel
which due to the pk
filter will only have one row returned.
Is there anyway to force the count operation to expand the query and include the annotation?
I tried to filter on the count of the jsonb path query, but it failed:
>>> qs = MyModel.objects.annotate(my_count=Count(Func("data", Value("$.MyArray[*]"),function="jsonb_path_query",output_field=JSONField()))).filter(pk=1, my_count__gt=0) >>> qs.count() Exception: django.db.utils.NotSupportedError: set-returning functions are not allowed in HAVING
comment:23 by , 19 months ago
Dustin, I think this answer is your way out.
Basically the optimization should be disabled for any set-returning function but since Django only has single native one,Subquery
, the optimization is only disabled through duck typing when the subquery = True
attribute is set.
In order to truly solve this issue I think we should introduce a new documented Expression.set_returning: bool
(better name welcome!) flag that defaults to False
but is set to True
for Subquery
.
The root of this problem is that the ORM simply doesn't support functions that return rows in a generic way (your assignment of output_field=JSONField()
is wrong but there's no way to say output_field=Set(JSONField())
). Instead it branches out using getattr(expr, "subquery", False)
in all cases that it makes the most sense to support them (e.g. __in
looups).
Same can be done for
exists()