#34255 closed Bug (fixed)
Annotation/group by with an expression on psycopg3
Reported by: | Guillaume Andreu Sabater | Owned by: | Mariusz Felisiak |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Release blocker | Keywords: | orm postgres psycopg3 annotation groupby |
Cc: | Florian Apolloner, Simon Charette | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Given the following code:
import zoneinfo from django.db import models from django.db.models.functions import ExtractYear from django.contrib.postgres.fields import ArrayField TZ = zoneinfo.ZoneInfo("Europe/Paris") class JsonBuildObject(models.Func): function = "jsonb_build_object" output_field = models.JSONField() class SubqueryArray(models.Subquery): template = "ARRAY(%(subquery)s)" output_field = ArrayField(base_field=models.JSONField()) class CurveQuerySet(models.QuerySet["Curve"]): """Curve QuerySet.""" def annotate_loads(self) -> "CurveQuerySet": """Annotate baseload by year.""" baseload_qs = ( Point.objects.filter(curve=models.OuterRef("pk")) .annotate(year=ExtractYear("start_at", tzinfo=TZ)) .values("year") .alias(baseload=models.Avg("value")) .annotate( json=JsonBuildObject( models.Value("year"), models.F("year"), models.Value("baseload"), models.F("baseload"), ) ) .values("json") ) return self.annotate(_baseloads=SubqueryArray(baseload_qs)) CurveManager = models.Manager.from_queryset(CurveQuerySet) class Curve(models.Model): """Curve.""" objects = CurveManager() class Point(models.Model): """Curve point.""" curve = models.ForeignKey( Curve, on_delete=models.CASCADE, related_name="points", related_query_name="point", ) start_at = models.DateTimeField() value = models.FloatField()
I use the annotate_loads to compute yearly averages (with .values("year") acting as a GROUP BY) and dump the results in a json field.
With psycopg3, from what I've seen, the query params/values are not interpolated in the query anymore, but sent alongside the query to the server.
In my case, it looks like this:
SELECT "fail_curve"."id", ARRAY( SELECT jsonb_build_object( $1, EXTRACT( YEAR FROM U0."start_at" AT TIME ZONE $2 ), $3, AVG(U0."value") ) AS "json" FROM "fail_point" U0 WHERE U0."curve_id" = ("fail_curve"."id") GROUP BY EXTRACT( YEAR FROM U0."start_at" AT TIME ZONE $4 ) ) AS "_baseloads" FROM "fail_curve" WHERE "fail_curve"."id" = $5
But postgres doesn't like: django.db.utils.ProgrammingError: column "u0.start_at" must appear in the GROUP BY clause or be used in an aggregate function
because
EXTRACT( YEAR FROM U0."start_at" AT TIME ZONE $2 )
is different from
EXTRACT( YEAR FROM U0."start_at" AT TIME ZONE $4 )
I tested an updated query using the same placeholder ($4 -> $2) and it worked as expected:
PREPARE working (text, text, text, int) AS SELECT "fail_curve"."id", ARRAY( SELECT jsonb_build_object( $1, EXTRACT( YEAR FROM U0."start_at" AT TIME ZONE $2 ), $3, AVG(U0."value") ) AS "json" FROM "fail_point" U0 WHERE U0."curve_id" = ("fail_curve"."id") GROUP BY EXTRACT( YEAR FROM U0."start_at" AT TIME ZONE $2 ) ) AS "_baseloads" FROM "fail_curve" WHERE "fail_curve"."id" = $4 LIMIT 1; EXECUTE working('year', 'Europe/Paris', 'base', 1);
My understanding is as follow:
- group by is an expression
- this expression is also used in select
- they have a different placeholder in the query generated by django/psycopg3
- postgres rejects it
Let me know if this needs extra details.
Change History (22)
comment:1 by , 2 years ago
Cc: | added |
---|---|
Severity: | Normal → Release blocker |
Triage Stage: | Unreviewed → Accepted |
follow-up: 8 comment:2 by , 2 years ago
This will be hard to solve as group by aliasing cannot be used since year
is stripped from the query. All I can think of right now is doing something along the lines of
SELECT "fail_curve"."id", ARRAY( SELECT "json" FROM ( SELECT EXTRACT( YEAR FROM U0."start_at" AT TIME ZONE $1 ) AS "year", jsonb_build_object( $2, "year", $3, AVG(U0."value") ) AS "json" FROM "fail_point" U0 WHERE U0."curve_id" = ("fail_curve"."id") GROUP BY "year" -- or 1 ) ) AS "_baseloads" FROM "fail_curve" WHERE "fail_curve"."id" = $4
When a group alias is ultimately masked, similarly to how we've done it with window functions filtering and value masking.
---
Edit, another solution could be to have the psycopg3 query %s -> $n
logic de-duplicate equal values and avoid using multiple placeholders for them. That would result in the query being written as the user reported here at the expense of equality checks (or possibly identity/hash checks which are cheaper) on each query creation which I believe would require adjustments to the current caching strategy.
comment:3 by , 2 years ago
This will be hard to solve as group by aliasing cannot be used since year is stripped from the query.
What about grouping by the entire function? as we do when it's not pushed into subquery, e.g.
Point.objects.annotate( year=ExtractYear("start_at", tzinfo=tz), ).values("year").annotate( baseload=Avg("value"), ).values("year", "baseload")
SELECT EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE 'Europe/Paris') AS "year", AVG("ticket_34255_point"."value") AS "baseload" FROM "ticket_34255_point" GROUP BY 1
comment:4 by , 2 years ago
I believe the issue here is that the year
annotation is elided from the query for a good reason, only one column must be present in the ARRAY
/SubqueryArray
construct.
In other words, the problem is not about grouping by alias or column index but grouping by an expression that is elided from the select clause and thus cannot be grouped by reference.
We need to perform some column masking for the query to ensure the query preserves it semantic
SELECT baseload FROM ( SELECT EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE 'Europe/Paris') AS "year", AVG("ticket_34255_point"."value") AS "baseload" FROM "ticket_34255_point" GROUP BY 1 ) masked_subquery
comment:5 by , 2 years ago
Just want to add that it also crashes without ARRAY
, e.g. for a queryset from comment:1:
SELECT JSONB_BUILD_OBJECT( (%s)::text, EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE %s), (%s)::text, AVG("ticket_34255_point"."value") ) AS "json" FROM "ticket_34255_point" GROUP BY EXTRACT(YEAR FROM "ticket_34255_point"."start_at" AT TIME ZONE %s)
psycopg.errors.GroupingError: column "ticket_34255_point.start_at" must appear in the GROUP BY clause or be used in an aggregate function
comment:6 by , 2 years ago
Right, it doesn't need ARRAY
to be present to reproduce the crash but the reason why we can't simply augment the select clause with year
in order to GROUP BY 1
(or GROUP BY year
) is that it would yield tuples with more columns than the user is asking for when doing an explicit values("json")
.
If we want to augment the SELECT
clause when explicitly grouping by a parameterized expression that is not selected (ExtractYear("start_at", tzinfo=TZ)
is this case) then we must make sure to de-augment the resulting query to ensure the query (mask it) to preserves its semantic. The usage of ARRAY/SubqueryArray
just happens to be a good example of why this is important but the same could be said of an __in
lookup that also only expects a single column to be selected.
comment:7 by , 2 years ago
Edit, another solution could be to have the psycopg3 query %s -> $n logic de-duplicate equal values and avoid using multiple placeholders for them. That would result in the query being written as the user reported here at the expense of equality checks (or possibly identity/hash checks which are cheaper) on each query creation which I believe would require adjustments to the current caching strategy.
Yeah, the same as we do on Oracle. Unfortunately, it has side-effects :|
comment:8 by , 2 years ago
Replying to Simon Charette:
Edit, another solution could be to have the psycopg3 query
%s -> $n
logic de-duplicate equal values and avoid using multiple placeholders for them.
I would love to see this done for Django in general and not at the individual backend level. That said it is probably to late for 4.1
If we cannot fix the issue at hand nicely I think we could fall back to client side cursors and provide a setting (backend option) to switch back to server side cursors so people can experiment with them. Btw from the looks of it and issues we ran into developing psycopg3 support, this also means that the postgresql backend is basically the only backend using server side bindings or is it just stricter than every other backend?
comment:9 by , 2 years ago
Regression test for the Django test suite:
-
tests/aggregation/tests.py
diff --git a/tests/aggregation/tests.py b/tests/aggregation/tests.py index 54e1e6f13a..bd3ea55647 100644
a b from django.db.models.functions import ( 34 34 Cast, 35 35 Coalesce, 36 36 Greatest, 37 Least, 37 38 Lower, 38 39 Now, 39 40 Pi, … … class AggregateTestCase(TestCase): 1614 1615 ).annotate(total=Count("*")) 1615 1616 self.assertEqual(dict(has_long_books_breakdown), {True: 2, False: 3}) 1616 1617 1618 def test_group_by_nested_expression_with_params(self): 1619 books_qs = ( 1620 Book.objects.annotate( 1621 greatest_pages=Greatest("pages", Value(600)) 1622 ) 1623 .values( 1624 "greatest_pages", 1625 ) 1626 .annotate( 1627 min_pages=Min("pages"), 1628 least=Least("min_pages", "greatest_pages"), 1629 ) 1630 .values_list("least", flat=True) 1631 ) 1632 self.assertCountEqual(books_qs, [300, 946, 1132]) 1633 1617 1634 @skipUnlessDBFeature("supports_subqueries_in_group_by") 1618 1635 def test_aggregation_subquery_annotation_related_field(self): 1619 1636 publisher = Publisher.objects.create(name=self.a9.name, num_awards=2)
follow-up: 11 comment:10 by , 2 years ago
I would love to see this done for Django in general and not at the individual backend level. That said it is probably to late for 4.1
Agreed that it would be nice to get but there's no way we can squeeze it in for 4.2.
If we cannot fix the issue at hand nicely I think we could fall back to client side cursors and provide a setting (backend option) to switch back to server side cursors so people can experiment with them. Btw from the looks of it and issues we ran into developing psycopg3 support, this also means that the postgresql backend is basically the only backend using server side bindings or is it just stricter than every other backend?
I'm not sure what you mean here by server side cursors. Did you mean server side bindings?
I'm not sure what's causing it to be stricter but if pyscopg>=3 allows for psycopg2 style of parameter bindings we should default to it until we can figure out an adequate solution here. If that doesn't work I can commit time to solving this issue but I fear the solution won't be pretty.
Basically we'd have to adapt SQLCompiler.get_group_by
to do the following
- Change its return signature so it's allowed to return members that should be added to the
SELECT
clause - Adjust its compile logic so when a compiled expression is parametrized (
len(params) >= 1
) and is not part ofselected_expr_indices
then it's added to of the members that must be added to theSELECT
and are returned by the method. - Then adjust
.as_sql
to theSELECT *original_select (SELECT *original_select, *extra_group_by_select ... GROUP BY ... *extra_group_by_select_indices)
when there areextra_group_by_select
The only interesting bit is that we might be able to reuse some logic that was added to implement window function filtering.
That's a lot of work to support server side bindings and that's only for the group by clause, there might be other cases lurking around.
comment:11 by , 2 years ago
Replying to Simon Charette:
I'm not sure what you mean here by server side cursors. Did you mean server side bindings?
Lol yes
I'm not sure what's causing it to be stricter but if pyscopg>=3 allows for psycopg2 style of parameter bindings we should default to it until we can figure out an adequate solution here. If that doesn't work I can commit time to solving this issue but I fear the solution won't be pretty.
ACK. As Mariusz pointed out we already do some shenanigans for Oracle -- so maybe Oracle does server-side bindings as well? Still I wouldn't surprised if Postgresql is stricter and complains about more things :) But with psycopg3 using server-side bindings we might have more motivation to fix this than for Oracle -- though I think everyone would benefit from named params if we manage to do this. I am just not sure if we can do this easily, we might have to support a mix of %s
and %(name)
in queries for a while or so :/
As for psycopg2 style parameter bindings: Yes psycopg3 supports this since 3.1 https://www.psycopg.org/psycopg3/docs/advanced/cursors.html#client-side-binding-cursors and those should bring it back to a more psycopg2 like behavior. This was to be my escape hatch all along ;) We still might need a fix or two in Django itself since I am not 100% sure that the geo adapters might work fully in "text"-mode which is required for client-side bindings.
comment:12 by , 2 years ago
If my tests didn't fail me, this allows us to switch to client-side binding cursors again while enabling the server-side binding via OPTIONS
:
-
django/db/backends/postgresql/base.py
diff --git a/django/db/backends/postgresql/base.py b/django/db/backends/postgresql/base.py index 17a3c7a377..ccf483cebf 100644
a b class DatabaseWrapper(BaseDatabaseWrapper): 222 222 conn_params = {**settings_dict["OPTIONS"]} 223 223 224 224 conn_params.pop("assume_role", None) 225 conn_params.pop("server_side_binding", None) 225 226 conn_params.pop("isolation_level", None) 226 227 if settings_dict["USER"]: 227 228 conn_params["user"] = settings_dict["USER"] … … class DatabaseWrapper(BaseDatabaseWrapper): 268 269 connection = self.Database.connect(**conn_params) 269 270 if set_isolation_level: 270 271 connection.isolation_level = self.isolation_level 271 if not is_psycopg3: 272 if is_psycopg3: 273 connection.cursor_factory = ( 274 ServerBindingCursor if options.get("server_side_binding") else Cursor 275 ) 276 else: 272 277 # Register dummy loads() to avoid a round trip from psycopg2's 273 278 # decode to json.dumps() to json.loads(), when using a custom 274 279 # decoder in JSONField. 275 280 psycopg2.extras.register_default_jsonb( 276 281 conn_or_curs=connection, loads=lambda x: x 277 282 ) 278 connection.cursor_factory = Cursor283 connection.cursor_factory = Cursor 279 284 return connection 280 285 281 286 def ensure_timezone(self): … … class DatabaseWrapper(BaseDatabaseWrapper): 436 441 437 442 if is_psycopg3: 438 443 439 class Cursor(Database.Cursor): 440 """ 441 A subclass of psycopg cursor implementing callproc. 442 """ 443 444 class CursorMixin: 444 445 def callproc(self, name, args=None): 445 446 if not isinstance(name, sql.Identifier): 446 447 name = sql.Identifier(name) … … if is_psycopg3: 457 458 self.execute(stmt) 458 459 return args 459 460 461 class ServerBindingCursor(CursorMixin, Database.Cursor): 462 pass 463 464 class Cursor(CursorMixin, Database.ClientCursor): 465 pass 466 460 467 class CursorDebugWrapper(BaseCursorDebugWrapper): 461 468 def copy(self, statement): 462 469 with self.debug_sql(statement):
follow-up: 20 comment:16 by , 2 years ago
I suggest that CI have a build that uses server_side_binding
so that regressions aren't introduced for that configuration. For example, in the current PostgreSQL build that defaults to client-side binding, ConcatPair.as_postgresql
can be removed without any test failures.
test_group_by_nested_expression_with_params
needs to be marked as expected failure when using server_side_binding
. Is it appropriate to create a follow-up ticket if it might be possible to fix that in the future?
When I read the current "Server-side parameters binding" section in Django's docs, it left me wondering how to choose between the two. Perhaps there's not much more to be said than what's in the psycopg documentation. Though we typically don't document bugs, I wonder if Django's documentation should note the limitation for queries like the one reported here.
comment:17 by , 2 years ago
I agree with Tim, having a bit more documentation on the known limitations/bugs of server-side binding (we should probably create an issue to track the group by one or maybe use #34262 with a dual purpose) and its gotchas which have ties to the usage of prepared statements (#20516).
FWIW I was surprised to learn that psycopg>=3
also prepares statement automatically on the fifth execution which might result in unexpected behaviour.
I'm not sure if this is the case when client-side binding is used though, I assume not as that requires the usage of server-side bindings? confirmed from the docs
Client-side cursors don’t support binary parameters and return values and don’t support prepared statements.
All that to say that we should probably have a section with a summary of what to expect when switching to psycopg>=3
and enabling server-side binding and maybe advertise the support as experimental for this first release?
comment:18 by , 2 years ago
Yes more docs will certainly not hurt :) @tim Can you submit PRs for the expected failures etc?
FWIW I was surprised to learn that psycopg>=3 also prepares statement automatically on the fifth execution which might result in unexpected behaviour.
Yes, but Django disables that, we probably should expose that via options as well ;)
comment:20 by , 2 years ago
Replying to Tim Graham:
I suggest that CI have a build that uses
server_side_binding
so that regressions aren't introduced for that configuration. For example, in the current PostgreSQL build that defaults to client-side binding,ConcatPair.as_postgresql
can be removed without any test failures.
test_group_by_nested_expression_with_params
needs to be marked as expected failure when usingserver_side_binding
.
Done, see PR.
Thanks for the report. I was to able to reproduce the issue without
SubqueryArray()
and with built-inJSONObject
:Bug in 09ffc5c1212d4ced58b708cbbf3dfbfb77b782ca.