Opened 5 months ago
Closed 4 months ago
#35586 closed Bug (fixed)
Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres
Reported by: | Devin Cox | Owned by: | Devin Cox |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 5.0 |
Severity: | Normal | Keywords: | postgres, set-returning, aggregation, annotation |
Cc: | Simon Charette, David Sanders, Jacob Walls | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
There is a bug where the queryset count is inaccurate when doing an annotation with JSONB_PATH_QUERY. The count is fixed after the queryset is evaluated.
I discovered this bug via PageNumberPagination
from rest_framework.pagination
, specifically in the function paginate_queryset
. This Paginator takes in an object_list (in this case a Queryset) and determines the length by first checking if a .count() method is available, otherwise by doing len(object_list). In this case, it is determining length through Queryset.count().
When the Queryset is annotated, and the length of the results increases, the count remains stale--only returning the original count.
Currently I have a workaround by adding a DISTINCT, INTERSECTION, UNION, or DIFFERENCE to the queryset. Unfortunately, I now also need to have ordering which makes this tricky to continue working around.
I believe this was a regression back in 4.2. This was previously asked about here: https://code.djangoproject.com/ticket/28477#comment:22, and this thread was referred: https://forum.djangoproject.com/t/django-4-2-behavior-change-when-using-arrayagg-on-unnested-arrayfield-postgresql-specific/21547. However, the Unnest
approach does not work for our use case.
Here are steps to reproduce:
models.py:
class TestModel(models.Model): test_json = models.JSONField(default=dict, blank=True) id = models.IntegerField(primary_key=True)
tests.py
from .models import TestModel from django.contrib.postgres.fields import JSONField from django.db.models import Func, Value def test_bug(self): test_model = TestModel.objects.create( test_json={ "test_key" : [ { "id" : 1, "name" : "test1" }, { "id" : 2, "name" : "test2" } ] }, id=1 ) test_model.save() qs = TestModel.objects.annotate( table_element=Func( "test_json", Value("$.test_key[*]"), function="jsonb_path_query", output_field=JSONField(), subquery=True ) ).filter(pk=1) # qs.count() and len(qs) should be equal, but currently they are not. Running qs.count() after len(qs) is currently equal because the queryset was evaluated. self.assertEqual(qs.count(), len(qs))
Thank you for any guidance and/or support!
Change History (14)
comment:2 by , 5 months ago
Component: | Uncategorized → Database layer (models, ORM) |
---|
comment:3 by , 5 months ago
Cc: | added |
---|---|
Triage Stage: | Unreviewed → Accepted |
This one is tricky, the ORM and its aggregation logic was not built in a way that accounts for annotations spawning rows themselves (instead of through JOIN
s) like some Postgres set-returning functions (e.g. UNNEST
). These do effectively throw a wrench in the optimization introduced in 4.2 (see #28477) if the set-returning annotation is not referenced by the aggregation.
What is really needed here is likely a Expression.set_returning: bool = False
attribute that the ORM can consider to enforce the subquery pushdown and preventing of annotation eliding.
Something like
-
django/db/models/expressions.py
diff --git a/django/db/models/expressions.py b/django/db/models/expressions.py index 4ee22420d9..589fd30b0f 100644
a b class BaseExpression: 182 182 allowed_default = False 183 183 # Can the expression be used during a constraint validation? 184 184 constraint_validation_compatible = True 185 # Does the expression possibly return more than one row? 186 set_returning = False 185 187 186 188 def __init__(self, output_field=None): 187 189 if output_field is not None: -
django/db/models/sql/query.py
diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py index 438bb5ddbd..0eb1a2476c 100644
a b def get_aggregation(self, using, aggregate_exprs): 491 491 ) 492 492 or having 493 493 ) 494 has_set_returning_annotation = any( 495 getattr(annotation, "set_returning", False) 496 for annotation in self.annotations.values() 497 ) 494 498 # Decide if we need to use a subquery. 495 499 # 496 500 # Existing aggregations would cause incorrect results as … … def get_aggregation(self, using, aggregate_exprs): 510 514 or qualify 511 515 or self.distinct 512 516 or self.combinator 517 or has_set_returning_annotation 513 518 ): 514 519 from django.db.models.sql.subqueries import AggregateQuery
That could then be used like
class JSONPathQuery(Func): function = "jsonb_path_query" output_field = JSONField() set_returning = True JSONPathQuery("test_json", Value("$.test_key[*]"))
The challenge here is that Django doesn't have any core set-returning function (which also explains why this slipped under the radar for so long) so maybe we should also consider adding support for contrib.postgres.Unnest
which is a common one that would allow us to ensure proper test coverage and document it as an example of when this flag should be set?
comment:4 by , 5 months ago
Keywords: | postgres set-returning aggregation annotation added |
---|---|
Summary: | Queryset count and length do not match when annotating using JSONB_PATH_QUERY → Aggregation optimization doesn't account for not referenced set-returning annotations on Postgres |
comment:5 by , 5 months ago
Adding a bit more details about set-returning functions close equivalents on SQLite, MySQL, and Oracle.
The gist is that a the query
SELECT testmodel.* , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element FROM testmodel
can also be expressed as
SELECT testmodel.* , table_element_tbl.value AS table_element FROM testmodel , jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value)
And if we added support the automatic addition of set_returning
(or table_valued
functions?) to alias_map
(what is used to generate the FROM
clause) it could also possibly allow to solve the long standing problem of adding support for features such as generate_series
? That would allow the sql.Query.get_aggregation
logic to remain unchanged and keep pruning unreferenced aliased (which is really specific to Postgres) as the reference in the FROM
clause would still span the rows
SELECT COUNT(*) FROM ( SELECT id FROM testmodel, jsonb_path_query(testmodel.data, '$.test_key[*]') AS table_element_tbl(value) )
comment:6 by , 5 months ago
Cc: | added |
---|
comment:7 by , 5 months ago
Cc: | added |
---|
comment:8 by , 5 months ago
Went ahead and implemented/tested your suggestions and they work great for our individual use case. Also pushed a PR for this. I know there may be additional items now within the scope, such as support for Unnest
, but I wanted to go ahead and get the ball rolling. Let me know if there are other ways I can help. Thanks!
comment:9 by , 5 months ago
Has patch: | set |
---|
comment:10 by , 5 months ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:11 by , 5 months ago
Patch needs improvement: | set |
---|
comment:12 by , 4 months ago
Patch needs improvement: | unset |
---|
comment:13 by , 4 months ago
Triage Stage: | Accepted → Ready for checkin |
---|
Edit: the
subquery=True
parameter was from another recommendation I found, however whether or not it is included has had no impact on the net result.