Opened 5 years ago
Last modified 21 months ago
#30685 assigned Cleanup/optimization
Optimize QuerySet.count() with distinct()
Reported by: | Adam Sołtysik | Owned by: | Simon Charette |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 2.2 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
I have a PostgreSQL table with 100 000 records and 15 columns.
A simple .count()
query results in a fast SQL (execution time in seconds on the left):
(0.015) SELECT COUNT(*) AS "__count" FROM "table";
When we add .distinct()
, a subquery is created with all columns SELECTed:
(0.178) SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (15 columns) FROM "table") subquery;
When instead of .distinct()
we write .distinct('id')
and add .order_by('id', 'col1', 'col2')
, the subquery is additionally ORDERed:
(0.151) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id") "table"."id" AS Col1, ... (15 columns) FROM "table" ORDER BY "table"."id" ASC, "table"."col1" ASC, "table"."col2" ASC) subquery;
Funny thing is that without .distinct('id')
we can write .order_by('non_existing_column')
and it works without any exception.
After adding .values('id')
and an empty .order_by()
, the query is as fast as it can be with DISTINCT:
(0.053) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id") "table"."id" AS Col1 FROM "table") subquery;
I think that the subquery for count()
should never contain additional columns nor be ordered (and the same probably goes for other aggregations).
Change History (16)
comment:1 by , 5 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
comment:2 by , 5 years ago
I don't think it's a bug, but I think there is a place for optimization. I can't see a reason why all columns are SELECTed in a subquery just to perform a simple count. I've attached the SQLs generated and what I would expect.
So, for any model, a query like this:
Model.objects.distinct().count()
should produce an SQL like this:
SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1 FROM "table") subquery;
instead of:
SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (all other columns) FROM "table") subquery;
which, depending on the number of columns and size of the table, can be several times slower (at least on PostgreSQL).
If you think this is not worth the effort, then OK, I won't insist.
comment:3 by , 5 years ago
I think the issue is that there's no way for the distinct()
call to know that you're going to follow up with a count()
— and that code to special case for that isn't going to worth the effort.
If you know you only want to select a single field tell the ORM via `only()`. This'll give you want you need, I think: Model.objects.only('id').distinct().count()
.
(Maybe the ORM experts can tell you more, but I suspect that be more or less the end of the story...)
comment:4 by , 5 years ago
It should feasible to adapt sql.Query.get_count()
to perform these optimizations by setting a .values
mask when no columns are specified.
You might have to tweak the suquery pushdown logic a bit as that's what kicks in when distinct
is used but it should be feasible.
comment:5 by , 5 years ago
Resolution: | invalid |
---|---|
Status: | closed → new |
Triage Stage: | Unreviewed → Accepted |
OK, fine, thank you for the input Simon. Let’s accept on that basis. 👍
comment:6 by , 5 years ago
Thank you for accepting. I have added .values('pk').order_by()
before my .count()
queries, but I'll be glad to be able to get rid of it.
Just to add a bit more information. The "problem" is not only with .distinct()
, also any .annotate()
makes count()
produce a subquery with the redundant annotations inside. For example, a query like this results in SQL that is 4 times slower for my data than the bare count:
Model.objects.annotate(id2=F('id')).count()
(0.057) SELECT COUNT(*) FROM (SELECT "table"."id" AS Col1, "table"."id" AS "id2" FROM "table" GROUP BY "table"."id") subquery;
Adding .values('pk')
of course solves this as well.
comment:7 by , 5 years ago
Adding .values('pk') of course solves this as well.
Yes… and for the same reason, right? 🙂
Fancy taking on a PR as per Simon’s suggestion? (We’re happy to advise…)
comment:8 by , 5 years ago
Summary: | Suboptimal QuerySet.distinct().count() → Optimize QuerySet.count() with distinct() or annotate(). |
---|
Actually, wouldn't it be as simple as adding these two lines after obj = self.clone()
here: https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514
obj.set_values(['pk']) obj.clear_ordering(force_empty=True)
I tried it and it seems to be working (didn't run tests, though). But I think I don't understand the second part of Simon's advice.
comment:9 by , 5 years ago
Summary: | Optimize QuerySet.count() with distinct() or annotate(). → Optimize QuerySet.count() with distinct() |
---|
The annotation
part is tracked in #28477 so I'll rename this ticket to only mention distinct
.
I'm afraid we can't simply clear the values and ordering, else quite bit of code in get_aggregate
would be rendundant.
One case that these changes don't consider is the fact that .values('foo').distinct()
could influence the .count()
e.g.
SELECT COUNT(*) FROM (SELECT DISTINCT author_id FROM book)
Is not equivalent to
SELECT COUNT(*) FROM (SELECT DISTINCT book_id FROM book)
I suggest you open a PR with your changes so CI can report what exactly that breaks.
I'm getting three failures locally on SQLite; one related to the aforementioned .values().distinct()
issue, one related to .datetimes().distinct()
and one related to order_by
spawning a multiple valued relationship.
comment:10 by , 5 years ago
Hi :)
Is this ticket free now ? Do you mind me start playing with an implemention of it
comment:11 by , 5 years ago
I probably won't find time for a PR in the nearest time, so as far as I'm concerned, feel free to take it :)
comment:13 by , 5 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:14 by , 3 years ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:15 by , 2 years ago
comment:16 by , 21 months ago
Has patch: | set |
---|---|
Owner: | set to |
Patch needs improvement: | set |
Status: | new → assigned |
Hi Adam,
From the description here it just looks like the expected behaviour, as per the various `Note` callouts in the `distinct()` documentation.
(If you really think there's a bug here, can I ask you to narrow it down to something more specific? What are the exact models and ORM calls in play. What is the exact SQL generated? Vs What do you expect and why? Thanks.)