#32152 closed Bug (fixed)
Aggregation over subquery annotation GROUP BY produces wrong results
Reported by: | Christian Klus | Owned by: | Christian Klus |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 3.0 |
Severity: | Release blocker | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Starting in django 3.0.7, specifically after patch #31566 I noticed some of my more complex queries returning incorrect results. I think I've narrowed it down to a simpler test case:
Example query:
Book.objects.all().annotate( pub_year=TruncYear('pubdate') ).order_by().values('pub_year').annotate( total_pages=Sum('pages'), top_rating=Subquery( Book.objects.filter( pubdate__year=OuterRef('pub_year') ).order_by('rating').values('rating')[:1] ) ).values('pub_year', 'total_pages', 'top_rating')
Generated SQL on 3.0.6:
SELECT django_date_trunc('year', "aggregation_regress_book"."pubdate") AS "pub_year", SUM("aggregation_regress_book"."pages") AS "total_pages", ( SELECT U0."rating" FROM "aggregation_regress_book" U0 WHERE django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate") ORDER BY U0."rating" ASC LIMIT 1 ) AS "top_rating" FROM "aggregation_regress_book" GROUP BY django_date_trunc('year', "aggregation_regress_book"."pubdate"), "top_rating"
Generated SQL on current master:
SELECT django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL) AS "pub_year", SUM("aggregation_regress_book"."pages") AS "total_pages", ( SELECT U0."rating" FROM "aggregation_regress_book" U0 WHERE django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL) ORDER BY U0."rating" ASC LIMIT 1 ) AS "top_rating" FROM "aggregation_regress_book" GROUP BY django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL), ( SELECT U0."rating" FROM "aggregation_regress_book" U0 WHERE django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL) ORDER BY U0."rating" ASC LIMIT 1 ), "aggregation_regress_book"."pubdate"
I see two separate issues here:
"aggregation_regress_book"."pubdate"
is being added to the group by clause incorrectly (this issue probably predates the patch mentioned above)- Even though the subquery is in the select statement, the alias is not being used and instead the subquery is reevaluated. This nearly doubles the cost of one of my queries that is experiencing this problem.
I don't know much about the ORM internals, but here is my naive patch for the second issue:
diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py index ee98984826..6ea287d6cb 100644 --- a/django/db/models/sql/query.py +++ b/django/db/models/sql/query.py @@ -2220,7 +2220,7 @@ class Query(BaseExpression): # the selected fields anymore. group_by = [] for expr in self.group_by: - if isinstance(expr, Ref) and expr.refs not in field_names: + if isinstance(expr, Ref) and expr.refs not in field_names + annotation_names: expr = self.annotations[expr.refs] group_by.append(expr) self.group_by = tuple(group_by)
I'd appreciate anyone with deeper knowlege of the ORM to chime in and let me know if I'm on the right track. Tests are passing locally.
The resulting query on master:
SELECT django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL) AS "pub_year", SUM("aggregation_regress_book"."pages") AS "total_pages", ( SELECT U0."rating" FROM "aggregation_regress_book" U0 WHERE django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL) ORDER BY U0."rating" ASC LIMIT 1 ) AS "top_rating" FROM "aggregation_regress_book" GROUP BY django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL), "top_rating"
Change History (11)
comment:2 by , 4 years ago
Description: | modified (diff) |
---|
comment:3 by , 4 years ago
Triage Stage: | Unreviewed → Accepted |
---|
Thanks for the report and test Christian, I think you analysis is right; all members of the SELECT
clause should be considered.
Could you submit a PR on Github including your test and the following changes
-
django/db/models/sql/query.py
diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py index ee98984826..96819803c0 100644
a b def set_values(self, fields): 2205 2205 field_names.append(f) 2206 2206 self.set_extra_mask(extra_names) 2207 2207 self.set_annotation_mask(annotation_names) 2208 selected = frozenset(field_names + extra_names + annotation_names) 2208 2209 else: 2209 2210 field_names = [f.attname for f in self.model._meta.concrete_fields] 2211 selected = frozenset(field_names) 2210 2212 # Selected annotations must be known before setting the GROUP BY 2211 2213 # clause. 2212 2214 if self.group_by is True: … … def set_values(self, fields): 2220 2222 # the selected fields anymore. 2221 2223 group_by = [] 2222 2224 for expr in self.group_by: 2223 if isinstance(expr, Ref) and expr.refs not in field_names:2225 if isinstance(expr, Ref) and expr.refs not in selected: 2224 2226 expr = self.annotations[expr.refs] 2225 2227 group_by.append(expr) 2226 2228 self.group_by = tuple(group_by)
comment:5 by , 4 years ago
Has patch: | set |
---|
comment:6 by , 4 years ago
Summary: | Groupby after aggregation and subquery produces subtly wrong results → Aggregation over subquery annotation GROUP BY produces wrong results |
---|
Patch could likely be simplified a bit but it captures the essence of the issue. We could also a regression test for the extra
select reference but since this feature is discouraged I'm not sure it's worth doing.
Thanks a lot for the report and investigation Christian.
comment:7 by , 4 years ago
Owner: | changed from | to
---|---|
Severity: | Normal → Release blocker |
Status: | new → assigned |
Version: | 3.1 → 3.0 |
Regression in 42c08ee46539ef44f8658ebb1cbefb408e0d03fe (Django 3.0.7).
comment:8 by , 4 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Accepted → Ready for checkin |
Here's the test case I used (not sure if it's in the right location). It succeeds on 3.0.6 and fails on subsequent releases.
tests/aggregation_regress/tests.py
Q, StdDev,Sum, Value, Variance, When,