#17144 closed Bug (fixed)
Excessive GROUP BY clauses during aggregations.
Reported by: | christian_oudard | Owned by: | Anssi Kääriäinen |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.3 |
Severity: | Normal | Keywords: | |
Cc: | Michael Curry, kmike84@…, YenTheFirst, baryshev@…, anssi.kaariainen@…, hcarvalhoalves@…, dev@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Doing a simple aggregation using the Django ORM on MySQL, it is producing a GROUP BY clause that includes an unnecessary field. In the particular case where I discovered this, the unnecessary field is very large, and was slowing down the query by 100-fold.
Here is a simplified version of the model:
class Document(models.Model): data = models.TextField() # Very large field class Attachment(models.Model): document = models.ForeignKey(Document)
And the query I am running:
Document.objects.annotate(num_attachments=Count('attachment'))
And the SQL output:
SELECT `document_document`.`id`, `document_document`.`data`, COUNT(`document_attachment`.`id`) AS `num_attachments` FROM `document_document` LEFT OUTER JOIN `document_attachment` ON (`document_document`.`id` = `document_attachment`.`document_id`) GROUP BY `document_document`.`id`, `document_document`.`id`, `document_document`.`data` ORDER BY NULL
Doing GROUP BY on the data field is unnecessary, and it did not do this in Django 1.2.
From the Django mailing list:
Karen Tracey On Tue, Nov 1, 2011 at 6:19 PM
The SQL generated by the ORM for this query changed between Django version
1.2 and 1.3. The 1.2 SQL did a group by only on the id field. With 1.3
we're getting id twice and then all other fields in the model. Bisection
shows the change was made with r14715:
https://code.djangoproject.com/changeset/14715
It certainly looks to me like the old SQL was correct and preferable for
this particular case. In a brief search I did not find a ticket reporting
this issue -- could you open one?
Attachments (2)
Change History (27)
comment:1 by , 13 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 13 years ago
Made a regression test reproducing the issue, and attached a patch.
The problem seems to happen because self.query.select
is being added to the columns in compiler.get_grouping()
. I don't want to remove it though, because it might break the postgres fixes from r14715.
Does anyone have any insight into how best to fix this?
by , 13 years ago
Attachment: | group_by_bug_regressiontest.patch added |
---|
Regression test for the issue
comment:3 by , 13 years ago
This is a larger issue that merely the double inclusion of the pk. For any query using annotation, ALL fields from query.select are listed in the groupby clause, even when it would only be appropriate for the grouping to only occur on the primary key. For mysql this basically guarantees that all annotated queries will use the worst possible "Using temporary; Using filesort" method to evaluate the sql.
comment:4 by , 13 years ago
Cc: | added |
---|
comment:5 by , 13 years ago
Cc: | added |
---|
comment:6 by , 13 years ago
Cc: | added |
---|
follow-up: 8 comment:7 by , 13 years ago
@sbussetti is absolutely right. I can't do half the things I need to do with the ORM once I need annotation. The issue of group by needs to be revisited, and perhaps redesigned. This is a major caveat in my day to day work.
comment:8 by , 13 years ago
Replying to anonymous:
@sbussetti is absolutely right. I can't do half the things I need to do with the ORM once I need annotation. The issue of group by needs to be revisited, and perhaps redesigned. This is a major caveat in my day to day work.
Agreed - this is also preventing me from doing most of the queries I need to do - a huge road block in my work.
comment:9 by , 13 years ago
Owner: | changed from | to
---|
I've randomly stumbled upon this issue this week.
comment:10 by , 13 years ago
Cc: | added |
---|
I confirm this regression, .annotate() does add unnecessary fields to GROUP BY clause, when .values() is not used. On large datasets this bug results in a bad query performance.
comment:11 by , 13 years ago
this bug is really annoying for many projects. can this be patched or does it need some deep refactoring ?
what can i do to help ?
i can try to play with https://code.djangoproject.com/changeset/14715 but i dont have (yet) any idea of underlying behaviour
comment:12 by , 13 years ago
Cc: | added |
---|
I will try to take a look. BTW I think PostgreSQL 9.1 onwards has support for grouping by just PK fields, so this is of interest for users of that DB, too.
comment:13 by , 13 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
I think the attached patch fixes this issue. The aggregation + aggregation_regress tests pass on sqlite, pgsql and mysql. Full test suite pass on sqlite.
I will leave my computer running the full test suite on mysql and postgresql, and see if there is any surprising breakages.
The patch does a general cleanup of compiler.py get_grouping. In addition it has a small cleanup to sql/query.py, removing an unused variable (I think the quote_cache was moved to compiler.py, but it wasn't removed from query.py).
Adding group_by PK support to PostgreSQL is harder: for PostgreSQL every table which is used in the select, order by or where clauses needs to have its PK added to the GROUP BY. For MySQL it is enough to add the main table's PK. This needs a somewhat larger restructuring of query.py.
comment:14 by , 13 years ago
Hi akaariai, thanks for looking at this nasty bug;
Just applied the patch on a fresh 1.4, i'm testing with sqlite and still have a strange behaviour here.
results = mymodels.Book.objects.annotate(models.Count('category'))
All the model columns are still present in the group_by so i think it still fail; did i miss something ?
thanks!
comment:15 by , 13 years ago
Only MySQL supports grouping by primary key. The SQL specification actually requires all columns in the select list to appear in the group by clause. MySQL just skips this requirement, and there is something similar in PostgreSQL 9.1+.
So, to test the patch you need to be using MySQL. I hope to integrate support for PostgreSQL 9.1+ too, but that will be done in separate ticket.
by , 13 years ago
Attachment: | 17144.diff added |
---|
comment:16 by , 13 years ago
The patch now passes the full test suite on mysql, sqlite and pgsql. I don't have Oracle installation available, so can't test that one.
If MySQL users could test & report if this fixes the performance problems it would help a lot.
The patch was updated slightly, the change is still allowing directly adding strings to query.group_by (an internal variable). One test relied on that, and keeping that ability available costs practically nothing. So, the updated patch has that ability restored.
comment:17 by , 13 years ago
The aggregation_regress
tests pass under Oracle with this patch:
Running tests... ---------------------------------------------------------------------- ..s....s....................s.. ---------------------------------------------------------------------- Ran 31 tests in 12.253s OK
However, if the changes to django.db.models
are reverted, test_aggregate_duplicate_columns
doesn't fail:
---------------------------------------------------------------------- ..s....s..............E.....s.. ====================================================================== ERROR [0.040s]: test_more_more (regressiontests.aggregation_regress.tests.AggregationTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/jenkins/jobs/Django trunk + Oracle/workspace/tests/regressiontests/aggregation_regress/tests.py", line 472, in test_more_more grouping, gb_params = qs.query.get_compiler(qs.db).get_grouping([]) TypeError: get_grouping() takes exactly 1 argument (2 given) ---------------------------------------------------------------------- Ran 31 tests in 11.787s FAILED (errors=1)
(It's another test that fails.)
So it appears that the regression test doesn't validate the fix, at least under Oracle.
comment:18 by , 13 years ago
The reason for non-failure is that the new test is ran only when the backend supports grouping by primary key, that is, only on MySQL. That another test fails because sql/compiler.py's get_grouping signature was changed.
So, I think Oracle is OK at least for aggregation regress which is good news. That is the most likely place to break.
comment:19 by , 13 years ago
I think I will be pushing this ticket to trunk next week. 1.4 will not get this back-patched, as I feel the risk of breaking something is too big.
Reviews welcome! If you want to review this, but don't have time right now just ask me to postpone the commit.
I will separate the sql/query.py quote_cache removal (unused variable) to different commit.
comment:20 by , 12 years ago
Cc: | added |
---|
comment:21 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:23 by , 12 years ago
Component: | ORM aggregation → Database layer (models, ORM) |
---|
comment:24 by , 11 years ago
Cc: | added |
---|
The actual source of this issue is that id is listed twice, which defeats the optimization which removes unused fields. Find why that happened and you'll have your fix.