Context Navigation

← Previous Ticket
Next Ticket →

#35586 closed Bug (fixed)

Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres

Reported by:	Devin Cox	Owned by:	Devin Cox
Component:	Database layer (models, ORM)	Version:	5.0
Severity:	Normal	Keywords:	postgres, set-returning, aggregation, annotation
Cc:	Simon Charette, David Sanders, Jacob Walls	Triage Stage:	Ready for checkin
Has patch:	yes	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

There is a bug where the queryset count is inaccurate when doing an annotation with JSONB_PATH_QUERY. The count is fixed after the queryset is evaluated.

I discovered this bug via PageNumberPagination from rest_framework.pagination, specifically in the function paginate_queryset. This Paginator takes in an object_list (in this case a Queryset) and determines the length by first checking if a .count() method is available, otherwise by doing len(object_list). In this case, it is determining length through Queryset.count().

When the Queryset is annotated, and the length of the results increases, the count remains stale--only returning the original count.

Currently I have a workaround by adding a DISTINCT, INTERSECTION, UNION, or DIFFERENCE to the queryset. Unfortunately, I now also need to have ordering which makes this tricky to continue working around.

I believe this was a regression back in 4.2. This was previously asked about here: https://code.djangoproject.com/ticket/28477#comment:22, and this thread was referred: https://forum.djangoproject.com/t/django-4-2-behavior-change-when-using-arrayagg-on-unnested-arrayfield-postgresql-specific/21547. However, the Unnest approach does not work for our use case.

Here are steps to reproduce:

models.py:

class TestModel(models.Model):
    test_json = models.JSONField(default=dict, blank=True)
    id = models.IntegerField(primary_key=True)

tests.py

from .models import TestModel
from django.contrib.postgres.fields import JSONField
from django.db.models import Func, Value

    def test_bug(self):
        test_model = TestModel.objects.create(
            test_json={
                "test_key" : [
                    {
                        "id" : 1,
                        "name" : "test1"
                    },
                    {
                        "id" : 2,
                        "name" : "test2"
                    }
                ]
            },
            id=1
        )
        test_model.save()
        qs = TestModel.objects.annotate(
            table_element=Func(
                "test_json",
                Value("$.test_key[*]"),
                function="jsonb_path_query",
                output_field=JSONField(),
                subquery=True
            )
        ).filter(pk=1)

        # qs.count() and len(qs) should be equal, but currently they are not. Running qs.count() after len(qs) is currently equal because the queryset was evaluated.

        self.assertEqual(qs.count(), len(qs))

Thank you for any guidance and/or support!

Change History (14)

comment:1 by Devin Cox, 4 months ago

Edit: the subquery=True parameter was from another recommendation I found, however whether or not it is included has had no impact on the net result.

Last edited 4 months ago by Devin Cox (previous) (diff)

comment:2 by Devin Cox, 4 months ago

Component:	Uncategorized → Database layer (models, ORM)

comment:3 by Simon Charette, 4 months ago

Cc:	Simon Charette added
Triage Stage:	Unreviewed → Accepted

This one is tricky, the ORM and its aggregation logic was not built in a way that accounts for annotations spawning rows themselves (instead of through JOINs) like some Postgres set-returning functions (e.g. UNNEST). These do effectively throw a wrench in the optimization introduced in 4.2 (see #28477) if the set-returning annotation is not referenced by the aggregation.

What is really needed here is likely a Expression.set_returning: bool = False attribute that the ORM can consider to enforce the subquery pushdown and preventing of annotation eliding.

Something like

django/db/models/expressions.py

diff --git a/django/db/models/expressions.py b/django/db/models/expressions.py
index 4ee22420d9..8656979630 100644

                class BaseExpression:
     allowed_default = False
     # Can the expression be used during a constraint validation?
     constraint_validation_compatible = True
+    # Does the expression possibly returns more than one row?
+    set_returning = False
     def __init__(self, output_field=None):
         if output_field is not None:

django/db/models/sql/query.py

diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
index 438bb5ddbd..4f1210c49e 100644

                def get_aggregation(self, using, aggregate_exprs):
+            )
             or having
+        )
+        set_returning_annotations = {
+            alias
+            for alias, annotation in self.annotation_select.items()
+            if getattr(annotation, "set_returning", False)
+        }
         # Decide if we need to use a subquery.
+        #
         # Existing aggregations would cause incorrect results as
-…
+               def get_aggregation(self, using, aggregate_exprs):
             or qualify
             or self.distinct
             or self.combinator
+            or set_returning_annotations
         ):
             from django.db.models.sql.subqueries import AggregateQuery
-…
+               def get_aggregation(self, using, aggregate_exprs):
                     for annotation_alias, annotation in self.annotation_select.items():
                         if annotation.get_group_by_cols():
                             annotation_mask.add(annotation_alias)
+                    # Annotations that possibly return multiple rows cannot
+                    # be masked as they might have an incidence on the query.
+                    annotation_mask |= set_returning_annotations
                     inner_query.set_annotation_mask(annotation_mask)
             # Add aggregates to the outer AggregateQuery. This requires making

That could then be used like

class JSONPathQuery(Func):
    function = "jsonb_path_query"
    output_field = JSONField()
    set_returning = True

JSONPathQuery("test_json", Value("$.test_key[*]"))

The challenge here is that Django doesn't have any core set-returning function (which also explains why this slipped under the radar for so long) so maybe we should also consider adding support for contrib.postgres.Unnest which is a common one that would allow us to ensure proper test coverage and document it as an example of when this flag should be set?

Last edited 4 months ago by Simon Charette (previous) (diff)

comment:4 by Simon Charette, 4 months ago

Keywords:	postgres set-returning aggregation annotation added
Summary:	Queryset count and length do not match when annotating using JSONB_PATH_QUERY → Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres

comment:5 by Simon Charette, 4 months ago

Adding a bit more details about set-returning functions close equivalents on SQLite, MySQL, and Oracle.

The gist is that a the query

SELECT
   testmodel.*
   , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element
FROM testmodel

can also be expressed as

SELECT
   testmodel.*
   , table_element_tbl.value AS table_element
FROM
    testmodel
    , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)

And if we added support the automatic addition of set_returning (or table_valued functions?) to alias_map (what is used to generate the FROM clause) it could also possibly allow to solve the long standing problem of adding support for features such as generate_series? That would allow the sql.Query.get_aggregation logic to remain unchanged and keep pruning unreferenced aliased (which is really specific to Postgres) as the reference in the FROM clause would still span the rows

SELECT COUNT(*) FROM (
    SELECT id
    FROM testmodel, jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)
)

Last edited 4 months ago by Simon Charette (previous) (diff)

comment:6 by David Sanders, 4 months ago

Cc:	David Sanders added

comment:7 by Jacob Walls, 4 months ago

Cc:	Jacob Walls added

comment:8 by Devin Cox, 4 months ago

Went ahead and implemented/tested your suggestions and they work great for our individual use case. Also pushed a PR for this. I know there may be additional items now within the scope, such as support for Unnest, but I wanted to go ahead and get the ball rolling. Let me know if there are other ways I can help. Thanks!

Last edited 4 months ago by Devin Cox (previous) (diff)

comment:9 by Devin Cox, 4 months ago

Has patch:	set

comment:10 by Sarah Boyce, 4 months ago

Owner:	set to Devin Cox
Status:	new → assigned

comment:11 by Sarah Boyce, 3 months ago

Patch needs improvement:	set

comment:12 by Sarah Boyce, 3 months ago

Patch needs improvement:	unset

comment:13 by Sarah Boyce, 3 months ago

Triage Stage:	Accepted → Ready for checkin

comment:14 by Sarah Boyce <42296566+sarahboyce@…>, 3 months ago

Resolution:	→ fixed
Status:	assigned → closed

In e030839:

Fixed #35586 -- Added support for set-returning database functions.

Aggregation optimization didn't account for not referenced set-returning annotations on Postgres.

Co-authored-by: Simon Charette <charette.s@…>

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues