Opened 9 years ago
Last modified 21 months ago
#25230 assigned Cleanup/optimization
Change to Query.get_count() causes big performance hit
Reported by: | dexity | Owned by: | Simon Charette |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.8 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Recent pull merge https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified generates SQL which causes severe performance hit when using distinct parameter.
Instead of getting something like:
SELECT DISTINCT COUNT('*') FROM some_model;
it generates SQL query which looks:
SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model) subquery;
Related ticket: https://code.djangoproject.com/ticket/23875
Change History (11)
comment:1 by , 9 years ago
Description: | modified (diff) |
---|
comment:2 by , 9 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:3 by , 9 years ago
The above queries are identical because the distinct isn't actually doing anything. It won't be doing anything if we can guarantee there is some unique column in the query. For example, when there are no joins and the primary key of the main table is included in the query, then the distinct is redundant. We could be even smarter and detect when there are no multivalued joins in the query (and the pk is present).
Not sure how much trouble we should push to fixing this - the original distinct in the query is redundant, so it could be removed from the query in the reporter's application, at least for the count query.
comment:4 by , 9 years ago
For our database the newly generated query (with subquery) takes 30-40 sec to perform, the original query took 0.1 sec. Huge difference!
comment:6 by , 9 years ago
I do get a non negligible slowdown on PostgreSQL with a fully analyzed table containing ~30M rows.
database=# EXPLAIN ANALYZE SELECT COUNT('*') FROM (SELECT DISTINCT * FROM schema.table) subquery; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=6213779.77..6213779.78 rows=1 width=0) (actual time=88610.804..88610.804 rows=1 loops=1) -> Unique (cost=5290067.41..5858505.79 rows=28421919 width=117) (actual time=53265.590..86513.151 rows=28176350 loops=1) -> Sort (cost=5290067.41..5361122.20 rows=28421919 width=117) (actual time=53265.588..76718.379 rows=28176350 loops=1) Sort Key: table.col0, table.col1, table.col2, table.col2, table.col3, table.col4, table.col5 Sort Method: external merge Disk: 1296744kB -> Seq Scan on table (cost=0.00..522350.19 rows=28421919 width=117) (actual time=0.840..15830.723 rows=28176350 loops=1) Total runtime: 88695.934 ms (7 rows) database=# EXPLAIN ANALYZE SELECT COUNT(*) FROM schema.table; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=593404.99..593405.00 rows=1 width=0) (actual time=19139.072..19139.073 rows=1 loops=1) -> Seq Scan on table (cost=0.00..522350.19 rows=28421919 width=0) (actual time=5.683..17387.798 rows=28176350 loops=1) Total runtime: 19139.108 ms (3 rows)
Should we escalate this to release blocker?
follow-up: 10 comment:7 by , 9 years ago
akaariai's argument against fixing this makes sense to me.
The distinct in Model.objects.distinct().count()
is redundant, because the pk
enforces that each row will be distinct anyway. For the cases where distinct
will actually change the result, then a subquery is required.
So even if we decide to fix this, I don't think it warrants release blocker status. Fix the queryset by removing the redundant distinct
. If we can demonstrate that changing the queryset to be correct is non-obvious, then I think that might push the case for implementing a fix.
Really though, distinct is very rarely useful, and is indicative of a poorly constructed query (in sql or in django). Obviously there are useful distinct queries, but they are the exception in my experience.
comment:8 by , 9 years ago
We use MySQL. Model.objects.distinct().count()
is a simple query to make the problem obvious. There might be intermediate filters which apparently make distinct()
more relevant.
comment:9 by , 9 years ago
I'm not against fixing this. We could in general remove distinct from queries when we can prove it is redundant.
But, I don't think this is that high priority issue. I probably will get to fixing this at some point, and I'm definitely willing to review patches.
comment:10 by , 9 years ago
Replying to jarshwah:
Really though, distinct is very rarely useful, and is indicative of a poorly constructed query (in sql or in django). Obviously there are useful distinct queries, but they are the exception in my experience.
It becomes necessary if you use backreferences for FKs in your filter; SQL is stupid and multiplies results in that case.
comment:11 by , 21 months ago
Has patch: | set |
---|---|
Owner: | changed from | to
Patch needs improvement: | set |
Status: | new → assigned |
The exact line causing the performance regression is: https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233#diff-0edd853580d56db07e4020728d59e193R339
The
or self.distinct
component now forces a subquery. Might need Anssi to confirm if that part is needed or not, and whether we can be smarter in a restricted set of use cases.