Opened 22 months ago

Last modified 9 months ago

#34262 new Bug

Queryset grouped by annotation with aggregates on another annotated expression crashes on MySQL with sql_mode=only_full_group_by.

Reported by: Mariusz Felisiak Owned by:
Component: Database layer (models, ORM) Version: 4.1
Severity: Normal Keywords: mysql only_full_group_by
Cc: Simon Charette, David Wobrock Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Queryset grouped by annotation with aggregates on another annotated expression crashed on MySQL with sql_mode=only_full_group_by, e.g.

    def test_group_by_nested_expression_with_params(self):
        books_qs = (
            Book.objects.annotate(greatest_pages=Greatest("pages", Value(600)))
            .values(
                "greatest_pages",
            )
            .annotate(
                min_pages=Min("pages"),
                least=Least("min_pages", "greatest_pages"),
            )
            .values_list("least", flat=True)
        )
        self.assertCountEqual(books_qs, [300, 946, 1132])

crashes with:

django.db.utils.OperationalError: (1055, "Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test_django_2.aggregation_book.pages' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by")

Change History (11)

comment:1 by Mariusz Felisiak, 22 months ago

Cc: Simon Charette added

comment:2 by Carlton Gibson, 22 months ago

Triage Stage: UnreviewedAccepted

comment:3 by David Wobrock, 22 months ago

Cc: David Wobrock added

Hey there,

Took a look at what is happening and why MySQL is failing with ONLY_FULL_GROUP_BY.

In short and simplified, this statement works:

mysql> SELECT GREATEST(pages, 600), MIN(pages) FROM aggregation_book GROUP BY GREATEST(pages, 600) ORDER BY NULL;
+----------------------+------------+
| GREATEST(pages, 600) | MIN(pages) |
+----------------------+------------+
|                  600 |        300 |
|                 1132 |       1132 |
|                  946 |        946 |
+----------------------+------------+
3 rows in set (0,01 sec)

And when you try to add a third expression, that uses the two first:

mysql> SELECT GREATEST(pages, 600), MIN(pages), LEAST(MIN(pages), GREATEST(pages, 600)) AS least FROM aggregation_book GROUP BY GREATEST(pages, 600) ORDER BY NULL;
ERROR 1055 (42000): Expression #3 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test_django_tests.aggregation_book2.pages' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

MySQL is not happy, even though it seems rather straightforward that this query should work.

The resources I could find on the topic are here:

And the blog post explains in more depth why it's not working.


That leaves us with a choice to make for Django's behavior I reckon :)
Some options:

1. Add an ANY_VALUE around the problematic expression

That solves the issue here for instance:

mysql> SELECT GREATEST(pages, 600), MIN(pages), ANY_VALUE(LEAST(MIN(pages), GREATEST(pages, 600))) AS least FROM aggregation_book2 GROUP BY GREATEST(pages, 600) ORDER BY NULL;
+----------------------+------------+-------+
| GREATEST(pages, 600) | MIN(pages) | least |
+----------------------+------------+-------+
|                  600 |        300 |   300 |
|                 1132 |       1132 |  1132 |
|                  946 |        946 |   946 |
+----------------------+------------+-------+
3 rows in set (0,00 sec)

However, I fear that detecting when to wrap the expression with an ANY_VALUE is a rabbit hole we don't want to go down, as we might end up implementing what the MySQL team didn't want to implement.

2. Raise awareness

We could, firstly, document the potential issue, and secondly raise a warning when such an error occurs when executing a query a Django.
That way, users are at least aware that's not entirely their or Django's fault.

3. Generally change query generation

Another type of workaround suggested by the MySQL blog post is to use a subquery/derived table:

mysql> SELECT greatest_pages,
           MIN(pages),
           LEAST(MIN(pages), greatest_pages) AS least
    FROM (SELECT GREATEST(pages, 600) greatest_pages,
                 pages
          FROM aggregation_book2) AS t
    GROUP BY greatest_pages
    ORDER BY NULL;
+----------------+------------+-------+
| greatest_pages | MIN(pages) | least |
+----------------+------------+-------+
|            600 |        300 |   300 |
|           1132 |       1132 |  1132 |
|            946 |        946 |   946 |
+----------------+------------+-------+
3 rows in set (0,00 sec)

So that we always try to group on a column, and not an expression.
Even though, it might be worse in terms of performances, depending the DB implementation I guess.

This change would then affect all databases I reckon, which is a much larger change, and therefore riskier.

4. Any other option! :D

I hope all of this makes sense, happy to read any thoughts on this :)
See ya!

comment:4 by Simon Charette, 22 months ago

Thanks for the analysis David, option 1. and 2. were also the conclusion of my limited investigation on the subject so it's great to have cross peer validation on the subject.

I think there might be a way to implement 3. by reusing some of the logic used to implement masking of columns when filtering against window functions which requires two level of subquery wrapping.

A different of approaching 3. is to think that any form of masking of annotations/aliases used for grouping purposes would result in a subquery pushdown. So to reuse your example, instead of performing a subquery pushdown to compute expressions used with aggregate queries we'd do it the other way around to have MySQL group against top level SELECT expressions which it's good at inferring dependencies from

SELECT LEAST(min_pages, greatest_pages) AS `least` FROM (
  SELECT
    GREATEST(`aggregation_book`.`pages`, 600) greatest_pages,
    MIN(`aggregation_book`.`pages`) min_pages
  FROM `aggregation_book`
  GROUP BY 1 ORDER BY NULL
) masked

This should reduce the area of impact of this change to aggregating queries that group against a value that isn't explicitly selected and would also happen to solve the last remaining known issue with using server-side parameters binding for Postgres (#34255).

in reply to:  4 ; comment:5 by Amir Karimi, 16 months ago

Replying to Simon Charette:

Thanks for the analysis David, option 1. and 2. were also the conclusion of my limited investigation on the subject so it's great to have cross peer validation on the subject.

I think there might be a way to implement 3. by reusing some of the logic used to implement masking of columns when filtering against window functions which requires two level of subquery wrapping.

A different of approaching 3. is to think that any form of masking of annotations/aliases used for grouping purposes would result in a subquery pushdown. So to reuse your example, instead of performing a subquery pushdown to compute expressions used with aggregate queries we'd do it the other way around to have MySQL group against top level SELECT expressions which it's good at inferring dependencies from

SELECT LEAST(min_pages, greatest_pages) AS `least` FROM (
  SELECT
    GREATEST(`aggregation_book`.`pages`, 600) greatest_pages,
    MIN(`aggregation_book`.`pages`) min_pages
  FROM `aggregation_book`
  GROUP BY 1 ORDER BY NULL
) masked

This should reduce the area of impact of this change to aggregating queries that group against a value that isn't explicitly selected and would also happen to solve the last remaining known issue with using server-side parameters binding for Postgres (#34255).

I'm curios to know what happened with this issue. Any updates?

in reply to:  5 comment:6 by Mariusz Felisiak, 16 months ago

Replying to Amir Karimi:

I'm curios to know what happened with this issue. Any updates?

Feel-free to work on this issue. Please don't leave comments like any updates? they don't help and cause unnecessary noise to the history.

comment:7 by Jonny Park, 16 months ago

Owner: changed from nobody to Jonny Park
Status: newassigned

comment:8 by Jonny Park, 15 months ago

I think @David Wobrock's query is easier to implement and covers more cases.

For example, if we have a followinfg queryset:

        books_qs = (
            Book.objects.annotate(greatest_pages=Greatest("pages", Value(600)))
            .values(
                "greatest_pages",
            )
            .annotate(
                min_pages=Min("pages"),
                least=Least("min_pages", "greatest_pages"),
            )
        )

Creating the following query that @David Wobrock presented seems like more sense to me and covers many other cases.

SELECT greatest_pages,
           MIN(pages),
           LEAST(MIN(pages), greatest_pages) AS least
    FROM (SELECT GREATEST(pages, 600) greatest_pages,
                 pages
          FROM aggregation_book2) AS t
    GROUP BY greatest_pages
    ORDER BY NULL;

If we were to take @Simon Charette's query, it could be like this:

SELECT geatest_pages, min_pages, LEAST(min_pages, greatest_pages) AS `least` FROM (
  SELECT
    GREATEST(`aggregation_book`.`pages`, 600) greatest_pages,
    MIN(`aggregation_book`.`pages`) min_pages
  FROM `aggregation_book`
  GROUP BY 1 ORDER BY NULL
) masked

I think the position of "MIN(aggregation_book.pages) min_pages" look awkward for me.
With .values_list("least", flat=True) clause present, there was a obvious reason for "MIN(aggregation_book.pages) min_pages" to be pushed down because it is a dependency for least, but without .values_list("least", flat=True) it loses it's reason to be pushed down.
I am a bit suspicious that choosing which additional item to be pushed down by looking at values_list worth it's effort considering frequency of this use case is thought to be small.

Version 2, edited 15 months ago by Jonny Park (previous) (next) (diff)

comment:9 by Jonny Park, 12 months ago

Owner: Jonny Park removed
Status: assignednew

comment:10 by Simon Charette, 12 months ago

FWIW this relates to #34992 where we had to disable allows_group_by_selected_pks on MariaDB entirely as it doesn't implement any form of functional dependence resolition.

comment:11 by Simon Charette, 9 months ago

A recent article on any_value and functional dependency if it can be of help to anyone working on this issue.

Note: See TracTickets for help on using tickets.
Back to Top