Opened 10 years ago
Last modified 10 months ago
#23557 new Bug
Prevent silent extension of explicit GROUP BY when using order_by
Reported by: | Brian May | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.7 |
Severity: | Normal | Keywords: | |
Cc: | Ryan Cheley | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Hello,
With the following db model:
@python_2_unicode_compatible class CPUJob(models.Model): account = models.ForeignKey(Account, blank=True, null=True) username = models.CharField(max_length=50, blank=True, null=True) project = models.ForeignKey(Project, null=True, blank=True) machine = models.ForeignKey(Machine, blank=True, null=True) date = models.DateField(db_index=True, blank=True, null=True) queue = models.ForeignKey(Queue, blank=True, null=True) cpu_usage = models.BigIntegerField(blank=True, null=True) mem = models.BigIntegerField(blank=True, null=True) vmem = models.BigIntegerField(blank=True, null=True) ctime = models.DateTimeField(blank=True, null=True) qtime = models.DateTimeField(blank=True, null=True) etime = models.DateTimeField(blank=True, null=True) start = models.DateTimeField(blank=True, null=True) act_wall_time = models.BigIntegerField(blank=True, null=True) est_wall_time = models.BigIntegerField(blank=True, null=True) jobid = models.CharField(max_length=50, blank=True, null=True, unique=True) cores = models.BigIntegerField(blank=True, null=True) list_mem = models.BigIntegerField(blank=True, null=True) list_pmem = models.BigIntegerField(blank=True, null=True) list_vmem = models.BigIntegerField(blank=True, null=True) list_pvmem = models.BigIntegerField(blank=True, null=True) exit_status = models.BigIntegerField(blank=True, null=True) jobname = models.CharField(max_length=256, blank=True, null=True) software = models.ManyToManyField(SoftwareVersion, blank=True, null=True) class Meta: ordering = ['-date'] db_table = 'cpu_job'
The following on mysql produces a good result:
q = CPUJob.objects.values('project').annotate(usage=Sum('cpu_usage'), jobs=Count('id')) print q.query
of
SELECT `cpu_job`.`project_id`, SUM(`cpu_job`.`cpu_usage`) AS `usage`, COUNT(`cpu_job`.`id`) AS `jobs` FROM `cpu_job` GROUP BY `cpu_job`.`project_id` ORDER BY `cpu_job`.`date` DESC
However on Postgresql, with the same data, I get the following query:
SELECT "cpu_job"."project_id", SUM("cpu_job"."cpu_usage") AS "usage", COUNT("cpu_job"."id") AS "jobs" FROM "cpu_job" GROUP BY "cpu_job"."project_id", "cpu_job"."date" ORDER BY "cpu_job"."date" DESC
Note additional term "cpu_job"."date" in the GROUP BY. I did not ask for it, it got put there.
I suspect the problem is the sort order on the table. The SQL seems to be fine on mysql (really?????), but appears to be invalid on Postgresql (this actually makes more sense to me). So Django appears to be silently adjusting the request to make it valid on Postgresql.
karaage=> SELECT "cpu_job"."project_id", SUM("cpu_job"."cpu_usage") AS "usage", COUNT("cpu_job"."id") AS "jobs" FROM "cpu_job" GROUP BY "cpu_job"."project_id" ORDER BY "cpu_job"."date" DESC ; ERROR: column "cpu_job.date" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: ...cpu_job" GROUP BY "cpu_job"."project_id" ORDER BY "cpu_job"....
I would much rather Django gives an error under Postgresql (and maybe even MYSQL too) rather then silently changing the query, and giving different results to what I expected.
(for this particular query, I don't need sorting, and had not noticed it would cause problems - specifying order_by() seems to solve this problem).
Thanks
Change History (10)
comment:2 by , 10 years ago
Additionally, it may be worth talking about removing the non-standard behaviour from mysql in django by setting the session to "ONLY_FULL_GROUP_BY": http://dev.mysql.com/doc/refman/5.0/en/sql-mode.html#sqlmode_only_full_group_by but that may be backwards incompatible.
comment:3 by , 10 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:4 by , 10 years ago
It should be noted that "full" group-by causes significant performance degradations on mysql; this is related to #17144.
In other respects, @jarshwah's comments are mostly correct -- except for one point, and that is that the SQL spec is retarded, and an ORM can know when it's safe to group by a subset of the retrieved fields (which are unique for the group).
I suspect the only way to resolve this without breaking stuff is to close this ticket as "wontfix".
comment:5 by , 10 years ago
My personal opinion is that Django should never alter silently alter queries. If there is something wrong with the query, it should generate an error rather then attempt to fix the problem by altering the query. The fact the current Postgresql behaviour is documented doesn't make it any more correct IMHO. The programmer should be forced to explicitly state the preferred solution rather then Django guess it.
I would suggest that Django has some sort of legacy mode, enabled by default, for preserving the current behaviour, and a new mode where errors such as these trigger an error condition. That way any problems can be identified and fixed, rather then silently await some change (e.g. different database, or somebody deciding to add an ordering field where there was none), and nobody noticing that the results generated are now wrong.
comment:6 by , 10 years ago
I would be in favour of clearing out the ordering when using a .values call, if an entry in values does not explicitly reference the ordered column. I think this would neatly solve the issue. Unfortunately, I think that'd be backwards incompatible. A flag could be used, but I'm not sure it'd be very nice. It could also be retired after a deprecation period where the behaviour was swapped.
Meta: ordering = ['col'] values_respects_ordering = False # defaults to True
But then I'm sure someone else will say django is silently altering the query by not respecting the Meta.ordering once the behaviour is swapped.
I wish Meta.ordering didn't exist.
comment:7 by , 10 years ago
I am with brian here - I think Django's way to calculate the GROUP BY is complex enough without the GROUP BY clause being altered by .order_by() calls.
In my opinion Django's group by logic should be as follows:
- Group by the primary key of the table in the query by default. The group by will need to be appended by functionally dependent columns. This means that we add any column from the same table to the group by, or any column from any table pointed by direct foreign key or reverse one to one field to the group by (there are some database specific optimizations we could use here).
- The above rule will be overridden when using .values() queries. In that case, group by the user's defined .values() call. Possibly allow extending it by functionally dependent columns. That is, if some table's primary key is in the values() list, then allow extension from that primary key along foreign keys and o2ofields similarly to 1).
- If any other column is needed in the group by, then error out. This means that:
CPUJob.objects.values('project').annotate(usage=Sum('cpu_usage'), jobs=Count('id')).order_by('date')
is an error.
The functionally dependent columns appending means that we need to do a group by author.id, author.name on some databases even though technically group by author.id gives the same result. Similarly, we need to group by author.id, author.name, book.id, book.name when .select_related('favorite_book') is applied, where favorite_book is a foreign key from author to book.
The main point here is that we really shouldn't alter the results because .order_by() was added to the query. Just error out in that case. Resist the temptation to guess, explicit is better than implicit and so on...
I think we can deprecate the current behavior of silently altering the group by. Checking if a column is functionally dependent on some subset of columns in the group by isn't exactly easy, but should be doable at least for primary key case. Multi-column unique indexes are going to be a bit harder to support...
comment:8 by , 2 years ago
Summary: | annotate gives different results on postgresql and mysql → Prevent silent extension of explicit GROUP BY when using order_by |
---|
I'll note that the issue initially reported issue here is not reproducible anymore since #14357 which deprecated Meta.ordering
from being considered when doing annotation. Even if that was not the case the optimization of selected primary key grouping in #19259 would also have reduced the grouping to only "cpu_job"."project_id"
.
I still think that it would be worth to keep this ticket open though to go through a deprecation period and then raise an error when values
is used for grouping and is paired with an explicit order_by
that would result in extra GROUP BY
entries. This can likely be achieved in SQLCompiler.get_group_by by branching off self.query.group_by is not True
and comparing the results of calling self.collapse_group_by
with expressions originating from order_by
and ones without and if there is any difference warn/raise.
comment:9 by , 2 years ago
Cc: | added |
---|
comment:10 by , 19 months ago
Some interesting development in SQL:2023 about the possibility to ORDER BY
non-selected columns when using aggregation F868
.
If I understand correctly Postgres 16+ should implement this feature.
The behaviour you're seeing with Postgres is correct, and documented: https://docs.djangoproject.com/en/dev/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Columns in the ORDER BY clause must be added to the GROUP BY clause on Postgres, and any RDBMS that conforms to the spec. MySQL and sqlite allow columns in the select list that aren't also in the GROUP BY. Django should be consistent though - it's weird that different results are returned based on the underlying engine.
I would propose that the MySQL backend should add the order by columns into the group by list.
http://dev.mysql.com/doc/refman/5.0/en/group-by-extensions.html describes the logic behind not requiring columns in the group by statement, but calls out that queries are non-deterministic unless the "free" column is unique for the group. This is a foot gun as far as I'm concerned, unless you *really* know what you're doing. I think it applies more to hand crafted queries rather than queries generated from an ORM.
"Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values within each group the server chooses." So ORDER BY has no effect on the results, unless the column is returned as part of the query.