Opened 4 months ago

Closed 3 months ago

#35586 closed Bug (fixed)

Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres

Reported by: Devin Cox Owned by: Devin Cox
Component: Database layer (models, ORM) Version: 5.0
Severity: Normal Keywords: postgres, set-returning, aggregation, annotation
Cc: Simon Charette, David Sanders, Jacob Walls Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

There is a bug where the queryset count is inaccurate when doing an annotation with JSONB_PATH_QUERY. The count is fixed after the queryset is evaluated.

I discovered this bug via PageNumberPagination from rest_framework.pagination, specifically in the function paginate_queryset. This Paginator takes in an object_list (in this case a Queryset) and determines the length by first checking if a .count() method is available, otherwise by doing len(object_list). In this case, it is determining length through Queryset.count().

When the Queryset is annotated, and the length of the results increases, the count remains stale--only returning the original count.

Currently I have a workaround by adding a DISTINCT, INTERSECTION, UNION, or DIFFERENCE to the queryset. Unfortunately, I now also need to have ordering which makes this tricky to continue working around.

I believe this was a regression back in 4.2. This was previously asked about here: https://code.djangoproject.com/ticket/28477#comment:22, and this thread was referred: https://forum.djangoproject.com/t/django-4-2-behavior-change-when-using-arrayagg-on-unnested-arrayfield-postgresql-specific/21547. However, the Unnest approach does not work for our use case.

Here are steps to reproduce:

models.py:

class TestModel(models.Model):
    test_json = models.JSONField(default=dict, blank=True)
    id = models.IntegerField(primary_key=True)

tests.py

from .models import TestModel
from django.contrib.postgres.fields import JSONField
from django.db.models import Func, Value

    def test_bug(self):
        test_model = TestModel.objects.create(
            test_json={
                "test_key" : [
                    {
                        "id" : 1,
                        "name" : "test1"
                    },
                    {
                        "id" : 2,
                        "name" : "test2"
                    }
                ]
            },
            id=1
        )
        test_model.save()
        qs = TestModel.objects.annotate(
            table_element=Func(
                "test_json",
                Value("$.test_key[*]"),
                function="jsonb_path_query",
                output_field=JSONField(),
                subquery=True
            )
        ).filter(pk=1)

        # qs.count() and len(qs) should be equal, but currently they are not. Running qs.count() after len(qs) is currently equal because the queryset was evaluated.

        self.assertEqual(qs.count(), len(qs))

Thank you for any guidance and/or support!

Change History (14)

comment:1 by Devin Cox, 4 months ago

Edit: the subquery=True parameter was from another recommendation I found, however whether or not it is included has had no impact on the net result.

Last edited 4 months ago by Devin Cox (previous) (diff)

comment:2 by Devin Cox, 4 months ago

Component: UncategorizedDatabase layer (models, ORM)

comment:3 by Simon Charette, 4 months ago

Cc: Simon Charette added
Triage Stage: UnreviewedAccepted

This one is tricky, the ORM and its aggregation logic was not built in a way that accounts for annotations spawning rows themselves (instead of through JOINs) like some Postgres set-returning functions (e.g. UNNEST). These do effectively throw a wrench in the optimization introduced in 4.2 (see #28477) if the set-returning annotation is not referenced by the aggregation.

What is really needed here is likely a Expression.set_returning: bool = False attribute that the ORM can consider to enforce the subquery pushdown and preventing of annotation eliding.

Something like

  • django/db/models/expressions.py

    diff --git a/django/db/models/expressions.py b/django/db/models/expressions.py
    index 4ee22420d9..8656979630 100644
    a b class BaseExpression:  
    182182    allowed_default = False
    183183    # Can the expression be used during a constraint validation?
    184184    constraint_validation_compatible = True
     185    # Does the expression possibly returns more than one row?
     186    set_returning = False
    185187
    186188    def __init__(self, output_field=None):
    187189        if output_field is not None:
  • django/db/models/sql/query.py

    diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py
    index 438bb5ddbd..4f1210c49e 100644
    a b def get_aggregation(self, using, aggregate_exprs):  
    491491            )
    492492            or having
    493493        )
     494        set_returning_annotations = {
     495            alias
     496            for alias, annotation in self.annotation_select.items()
     497            if getattr(annotation, "set_returning", False)
     498        }
    494499        # Decide if we need to use a subquery.
    495500        #
    496501        # Existing aggregations would cause incorrect results as
    def get_aggregation(self, using, aggregate_exprs):  
    510515            or qualify
    511516            or self.distinct
    512517            or self.combinator
     518            or set_returning_annotations
    513519        ):
    514520            from django.db.models.sql.subqueries import AggregateQuery
    515521
    def get_aggregation(self, using, aggregate_exprs):  
    550556                    for annotation_alias, annotation in self.annotation_select.items():
    551557                        if annotation.get_group_by_cols():
    552558                            annotation_mask.add(annotation_alias)
     559                    # Annotations that possibly return multiple rows cannot
     560                    # be masked as they might have an incidence on the query.
     561                    annotation_mask |= set_returning_annotations
    553562                    inner_query.set_annotation_mask(annotation_mask)
    554563
    555564            # Add aggregates to the outer AggregateQuery. This requires making

That could then be used like

class JSONPathQuery(Func):
    function = "jsonb_path_query"
    output_field = JSONField()
    set_returning = True

JSONPathQuery("test_json", Value("$.test_key[*]"))

The challenge here is that Django doesn't have any core set-returning function (which also explains why this slipped under the radar for so long) so maybe we should also consider adding support for contrib.postgres.Unnest which is a common one that would allow us to ensure proper test coverage and document it as an example of when this flag should be set?

Last edited 4 months ago by Simon Charette (previous) (diff)

comment:4 by Simon Charette, 4 months ago

Keywords: postgres set-returning aggregation annotation added
Summary: Queryset count and length do not match when annotating using JSONB_PATH_QUERYAggregation optimization doesn't account for not referenced set-returning annotations on Postgres

comment:5 by Simon Charette, 4 months ago

Adding a bit more details about set-returning functions close equivalents on SQLite, MySQL, and Oracle.

The gist is that a the query

SELECT
   testmodel.*
   , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element
FROM testmodel

can also be expressed as

SELECT
   testmodel.*
   , table_element_tbl.value AS table_element
FROM
    testmodel
    , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)

And if we added support the automatic addition of set_returning (or table_valued functions?) to alias_map (what is used to generate the FROM clause) it could also possibly allow to solve the long standing problem of adding support for features such as generate_series? That would allow the sql.Query.get_aggregation logic to remain unchanged and keep pruning unreferenced aliased (which is really specific to Postgres) as the reference in the FROM clause would still span the rows

SELECT COUNT(*) FROM (
    SELECT id
    FROM testmodel, jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)
)
Last edited 4 months ago by Simon Charette (previous) (diff)

comment:6 by David Sanders, 4 months ago

Cc: David Sanders added

comment:7 by Jacob Walls, 4 months ago

Cc: Jacob Walls added

comment:8 by Devin Cox, 4 months ago

Went ahead and implemented/tested your suggestions and they work great for our individual use case. Also pushed a PR for this. I know there may be additional items now within the scope, such as support for Unnest, but I wanted to go ahead and get the ball rolling. Let me know if there are other ways I can help. Thanks!

Last edited 4 months ago by Devin Cox (previous) (diff)

comment:9 by Devin Cox, 4 months ago

Has patch: set

comment:10 by Sarah Boyce, 4 months ago

Owner: set to Devin Cox
Status: newassigned

comment:11 by Sarah Boyce, 3 months ago

Patch needs improvement: set

comment:12 by Sarah Boyce, 3 months ago

Patch needs improvement: unset

comment:13 by Sarah Boyce, 3 months ago

Triage Stage: AcceptedReady for checkin

comment:14 by Sarah Boyce <42296566+sarahboyce@…>, 3 months ago

Resolution: fixed
Status: assignedclosed

In e030839:

Fixed #35586 -- Added support for set-returning database functions.

Aggregation optimization didn't account for not referenced set-returning annotations on Postgres.

Co-authored-by: Simon Charette <charette.s@…>

Note: See TracTickets for help on using tickets.
Back to Top