Opened 5 years ago

Last modified 22 months ago

#30685 assigned Cleanup/optimization

Optimize QuerySet.count() with distinct()

Reported by: Adam Sołtysik Owned by: Simon Charette
Component: Database layer (models, ORM) Version: 2.2
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

I have a PostgreSQL table with 100 000 records and 15 columns.

A simple .count() query results in a fast SQL (execution time in seconds on the left):

(0.015) SELECT COUNT(*) AS "__count" FROM "table";

When we add .distinct(), a subquery is created with all columns SELECTed:

(0.178) SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (15 columns) FROM "table") subquery;

When instead of .distinct() we write .distinct('id') and add .order_by('id', 'col1', 'col2'), the subquery is additionally ORDERed:

(0.151) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id") "table"."id" AS Col1, ... (15 columns) FROM "table" ORDER BY "table"."id" ASC, "table"."col1" ASC, "table"."col2" ASC) subquery;

Funny thing is that without .distinct('id') we can write .order_by('non_existing_column') and it works without any exception.

After adding .values('id') and an empty .order_by(), the query is as fast as it can be with DISTINCT:

(0.053) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id") "table"."id" AS Col1 FROM "table") subquery;

I think that the subquery for count() should never contain additional columns nor be ordered (and the same probably goes for other aggregations).

Change History (16)

comment:1 by Carlton Gibson, 5 years ago

Resolution: invalid
Status: newclosed

Hi Adam,

From the description here it just looks like the expected behaviour, as per the various `Note` callouts in the `distinct()` documentation.

(If you really think there's a bug here, can I ask you to narrow it down to something more specific? What are the exact models and ORM calls in play. What is the exact SQL generated? Vs What do you expect and why? Thanks.)

comment:2 by Adam Sołtysik, 5 years ago

I don't think it's a bug, but I think there is a place for optimization. I can't see a reason why all columns are SELECTed in a subquery just to perform a simple count. I've attached the SQLs generated and what I would expect.

So, for any model, a query like this:

Model.objects.distinct().count()

should produce an SQL like this:

SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1 FROM "table") subquery;

instead of:

SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (all other columns) FROM "table") subquery;

which, depending on the number of columns and size of the table, can be several times slower (at least on PostgreSQL).

If you think this is not worth the effort, then OK, I won't insist.

comment:3 by Carlton Gibson, 5 years ago

I think the issue is that there's no way for the distinct() call to know that you're going to follow up with a count() — and that code to special case for that isn't going to worth the effort.

If you know you only want to select a single field tell the ORM via `only()`. This'll give you want you need, I think: Model.objects.only('id').distinct().count().

(Maybe the ORM experts can tell you more, but I suspect that be more or less the end of the story...)

comment:4 by Simon Charette, 5 years ago

It should feasible to adapt sql.Query.get_count() to perform these optimizations by setting a .values mask when no columns are specified.

https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514

You might have to tweak the suquery pushdown logic a bit as that's what kicks in when distinct is used but it should be feasible.

https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L421-L457

comment:5 by Carlton Gibson, 5 years ago

Resolution: invalid
Status: closednew
Triage Stage: UnreviewedAccepted

OK, fine, thank you for the input Simon. Let’s accept on that basis. 👍

comment:6 by Adam Sołtysik, 5 years ago

Thank you for accepting. I have added .values('pk').order_by() before my .count() queries, but I'll be glad to be able to get rid of it.

Just to add a bit more information. The "problem" is not only with .distinct(), also any .annotate() makes count() produce a subquery with the redundant annotations inside. For example, a query like this results in SQL that is 4 times slower for my data than the bare count:

Model.objects.annotate(id2=F('id')).count()
(0.057) SELECT COUNT(*) FROM (SELECT "table"."id" AS Col1, "table"."id" AS "id2" FROM "table" GROUP BY "table"."id") subquery;

Adding .values('pk') of course solves this as well.

comment:7 by Carlton Gibson, 5 years ago

Adding .values('pk') of course solves this as well.

Yes… and for the same reason, right? 🙂

Fancy taking on a PR as per Simon’s suggestion? (We’re happy to advise…)

comment:8 by Adam Sołtysik, 5 years ago

Summary: Suboptimal QuerySet.distinct().count()Optimize QuerySet.count() with distinct() or annotate().

Actually, wouldn't it be as simple as adding these two lines after obj = self.clone() here: https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514

obj.set_values(['pk'])
obj.clear_ordering(force_empty=True)

I tried it and it seems to be working (didn't run tests, though). But I think I don't understand the second part of Simon's advice.

comment:9 by Simon Charette, 5 years ago

Summary: Optimize QuerySet.count() with distinct() or annotate().Optimize QuerySet.count() with distinct()

The annotation part is tracked in #28477 so I'll rename this ticket to only mention distinct.

I'm afraid we can't simply clear the values and ordering, else quite bit of code in get_aggregate would be rendundant.

One case that these changes don't consider is the fact that .values('foo').distinct() could influence the .count()

e.g.

SELECT COUNT(*) FROM (SELECT DISTINCT author_id FROM book)

Is not equivalent to

SELECT COUNT(*) FROM (SELECT DISTINCT book_id FROM book)

I suggest you open a PR with your changes so CI can report what exactly that breaks.

I'm getting three failures locally on SQLite; one related to the aforementioned .values().distinct() issue, one related to .datetimes().distinct() and one related to order_by spawning a multiple valued relationship.

comment:10 by Ivaylo Donchev, 5 years ago

Hi :)
Is this ticket free now ? Do you mind me start playing with an implemention of it

comment:11 by Adam Sołtysik, 5 years ago

I probably won't find time for a PR in the nearest time, so as far as I'm concerned, feel free to take it :)

comment:12 by Ivaylo Donchev, 5 years ago

Okay, thanks :) . Will play with it these days.

comment:13 by Ivaylo Donchev, 5 years ago

Owner: changed from nobody to Ivaylo Donchev
Status: newassigned

comment:14 by Mariusz Felisiak, 4 years ago

Owner: Ivaylo Donchev removed
Status: assignednew

comment:15 by Simon Charette, 2 years ago

Now that #28477 has been fixed the location where this should be handled should be made clearer as the elidable annotation part of the problem has been solved.

This issue has a lot of overlap with #25230 so I wonder if both should be merged (possibly have #25230 merged into this one?)

comment:16 by Simon Charette, 22 months ago

Has patch: set
Owner: set to Simon Charette
Patch needs improvement: set
Status: newassigned
Note: See TracTickets for help on using tickets.
Back to Top