Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#32152 closed Bug (fixed)

Aggregation over subquery annotation GROUP BY produces wrong results

Reported by: Christian Klus Owned by: Christian Klus
Component: Database layer (models, ORM) Version: 3.0
Severity: Release blocker Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description (last modified by Christian Klus)

Starting in django 3.0.7, specifically after patch #31566 I noticed some of my more complex queries returning incorrect results. I think I've narrowed it down to a simpler test case:

Example query:

Book.objects.all().annotate(
    pub_year=TruncYear('pubdate')
).order_by().values('pub_year').annotate(
    total_pages=Sum('pages'),
    top_rating=Subquery(
        Book.objects.filter(
            pubdate__year=OuterRef('pub_year')
        ).order_by('rating').values('rating')[:1]
    )
).values('pub_year', 'total_pages', 'top_rating')

Generated SQL on 3.0.6:

SELECT
  django_date_trunc('year', "aggregation_regress_book"."pubdate") AS "pub_year",
  SUM("aggregation_regress_book"."pages") AS "total_pages",
  (
    SELECT U0."rating"
    FROM "aggregation_regress_book" U0
    WHERE
      django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate")
    ORDER BY U0."rating" ASC LIMIT 1
  ) AS "top_rating"
FROM "aggregation_regress_book"
GROUP BY
  django_date_trunc('year', "aggregation_regress_book"."pubdate"),
  "top_rating"

Generated SQL on current master:

SELECT
  django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL) AS "pub_year",
  SUM("aggregation_regress_book"."pages") AS "total_pages",
  (
    SELECT U0."rating"
    FROM "aggregation_regress_book" U0
    WHERE
      django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL)
    ORDER BY U0."rating" ASC LIMIT 1
  ) AS "top_rating"
FROM "aggregation_regress_book"
GROUP BY
  django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL),
  (
    SELECT U0."rating"
    FROM "aggregation_regress_book" U0
    WHERE
      django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL)
    ORDER BY U0."rating" ASC LIMIT 1
  ),
  "aggregation_regress_book"."pubdate"

I see two separate issues here:

  • "aggregation_regress_book"."pubdate" is being added to the group by clause incorrectly (this issue probably predates the patch mentioned above)
  • Even though the subquery is in the select statement, the alias is not being used and instead the subquery is reevaluated. This nearly doubles the cost of one of my queries that is experiencing this problem.

I don't know much about the ORM internals, but here is my naive patch for the second issue:

diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
index ee98984826..6ea287d6cb 100644
--- a/django/db/models/sql/query.py
+++ b/django/db/models/sql/query.py
@@ -2220,7 +2220,7 @@ class Query(BaseExpression):
             # the selected fields anymore.
             group_by = []
             for expr in self.group_by:
-                if isinstance(expr, Ref) and expr.refs not in field_names:
+                if isinstance(expr, Ref) and expr.refs not in field_names + annotation_names:
                     expr = self.annotations[expr.refs]
                 group_by.append(expr)
             self.group_by = tuple(group_by)

I'd appreciate anyone with deeper knowlege of the ORM to chime in and let me know if I'm on the right track. Tests are passing locally.

The resulting query on master:

SELECT
  django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL) AS "pub_year",
  SUM("aggregation_regress_book"."pages") AS "total_pages",
  (
    SELECT U0."rating"
    FROM "aggregation_regress_book" U0
    WHERE
      django_date_extract('year', U0."pubdate") = django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL)
    ORDER BY U0."rating" ASC LIMIT 1
  ) AS "top_rating"
FROM "aggregation_regress_book"
GROUP BY
  django_date_trunc('year', "aggregation_regress_book"."pubdate", NULL, NULL),
  "top_rating"

Change History (11)

comment:1 by Christian Klus, 4 years ago

Here's the test case I used (not sure if it's in the right location). It succeeds on 3.0.6 and fails on subsequent releases.

  • tests/aggregation_regress/tests.py

    diff --git a/tests/aggregation_regress/tests.py b/tests/aggregation_regress/tests.py
    index 7604335257..dac995e1fc 100644
    a b from django.contrib.contenttypes.models import ContentType  
    88from django.core.exceptions import FieldError
    99from django.db import connection
    1010from django.db.models import (
    11     Aggregate, Avg, Case, Count, DecimalField, F, IntegerField, Max, Q, StdDev,
    12     Sum, Value, Variance, When,
     11    Aggregate, Avg, Case, Count, DecimalField, F, IntegerField, Max, OuterRef,
     12    Q, StdDev, Subquery, Sum, Value, Variance, When
    1313)
     14from django.db.models.functions import TruncYear
    1415from django.test import TestCase, skipUnlessAnyDBFeature, skipUnlessDBFeature
    1516from django.test.utils import Approximate
    1617
    class AggregationTests(TestCase):  
    102103        s2.books.add(cls.b1, cls.b3, cls.b5, cls.b6)
    103104        s3.books.add(cls.b3, cls.b4, cls.b6)
    104105
     106    def test_groupby_after_aggregation_and_subquery(self):
     107        self.assertEqual(
     108            Book.objects.all().annotate(
     109                pub_year=TruncYear('pubdate')
     110            ).order_by().values('pub_year').annotate(
     111                total_pages=Sum('pages'),
     112                top_rating=Subquery(
     113                    Book.objects.filter(
     114                        pubdate__year=OuterRef('pub_year')
     115                    ).order_by('rating').values('rating')[:1]
     116                )
     117            ).values('pub_year', 'total_pages', 'top_rating').count(),
     118            4
     119        )
     120
    105121    def assertObjectAttrs(self, obj, **kwargs):
    106122        for attr, value in kwargs.items():
    107123            self.assertEqual(getattr(obj, attr), value)
Last edited 4 years ago by Simon Charette (previous) (diff)

comment:2 by Christian Klus, 4 years ago

Description: modified (diff)

comment:3 by Simon Charette, 4 years ago

Triage Stage: UnreviewedAccepted

Thanks for the report and test Christian, I think you analysis is right; all members of the SELECT clause should be considered.

Could you submit a PR on Github including your test and the following changes

  • django/db/models/sql/query.py

    diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
    index ee98984826..96819803c0 100644
    a b def set_values(self, fields):  
    22052205                        field_names.append(f)
    22062206            self.set_extra_mask(extra_names)
    22072207            self.set_annotation_mask(annotation_names)
     2208            selected = frozenset(field_names + extra_names + annotation_names)
    22082209        else:
    22092210            field_names = [f.attname for f in self.model._meta.concrete_fields]
     2211            selected = frozenset(field_names)
    22102212        # Selected annotations must be known before setting the GROUP BY
    22112213        # clause.
    22122214        if self.group_by is True:
    def set_values(self, fields):  
    22202222            # the selected fields anymore.
    22212223            group_by = []
    22222224            for expr in self.group_by:
    2223                 if isinstance(expr, Ref) and expr.refs not in field_names:
     2225                if isinstance(expr, Ref) and expr.refs not in selected:
    22242226                    expr = self.annotations[expr.refs]
    22252227                group_by.append(expr)
    22262228            self.group_by = tuple(group_by)

comment:5 by Christian Klus, 4 years ago

Has patch: set

comment:6 by Simon Charette, 4 years ago

Summary: Groupby after aggregation and subquery produces subtly wrong resultsAggregation over subquery annotation GROUP BY produces wrong results

Patch could likely be simplified a bit but it captures the essence of the issue. We could also a regression test for the extra select reference but since this feature is discouraged I'm not sure it's worth doing.

Thanks a lot for the report and investigation Christian.

Last edited 4 years ago by Simon Charette (previous) (diff)

comment:7 by Mariusz Felisiak, 4 years ago

Owner: changed from nobody to Christian Klus
Severity: NormalRelease blocker
Status: newassigned
Version: 3.13.0

Regression in 42c08ee46539ef44f8658ebb1cbefb408e0d03fe (Django 3.0.7).

Last edited 4 years ago by Mariusz Felisiak (previous) (diff)

comment:8 by Mariusz Felisiak, 4 years ago

Patch needs improvement: set
Triage Stage: AcceptedReady for checkin

comment:9 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

Resolution: fixed
Status: assignedclosed

In 4ac2d4fa:

Fixed #32152 -- Fixed grouping by subquery aliases.

Regression in 42c08ee46539ef44f8658ebb1cbefb408e0d03fe.

Thanks Simon Charette for the review.

comment:10 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

In ab951d24:

[3.1.x] Fixed #32152 -- Fixed grouping by subquery aliases.

Regression in 42c08ee46539ef44f8658ebb1cbefb408e0d03fe.

Thanks Simon Charette for the review.

Backport of 4ac2d4fa42e1659f328c35b6b8d4761b3419c11a from master

comment:11 by Mariusz Felisiak <felisiak.mariusz@…>, 4 years ago

In b0a6798d:

[3.0.x] Fixed #32152 -- Fixed grouping by subquery aliases.

Regression in 42c08ee46539ef44f8658ebb1cbefb408e0d03fe.

Thanks Simon Charette for the review.

Backport of 4ac2d4fa42e1659f328c35b6b8d4761b3419c11a from master

Note: See TracTickets for help on using tickets.
Back to Top