Opened 10 years ago
Last modified 10 months ago
#24462 new New feature
Add a new QuerySet operation to use current results as a subquery
Reported by: | trosos | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | distinct order_by subquery |
Cc: | mihadra@…, trosos, John Speno, Jameel A., Paolo Melchiorre, Dave Johansen | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When I try to order a queryset that uses distinct(...)
, e.g.:
qs2 = qs.order_by('a', 'b').distinct('a') qs3 = qs2.order_by('c')
then Django seems to forget about the first order_by
and apply the second order_by
before applying distinct
. I would expect the second order_by
to be applied after applying distinct
.
This is particularly troublesome when one wants to create a custom manager on top of qs2
.
I don't know if this is the intended behavior, but it is very unexpected to me (see the section "Rationale").
Steps to reproduce
Use the Postgre SQL backend.
In models.py:
from django.db import models class YoungestAmongNamesakes(models.Manager): def get_queryset(self): return super(YoungestAmongNamesakes, self).get_queryset().\ order_by('name', 'age').distinct('name') class Human(models.Model): nick = models.CharField(max_length=20) name = models.CharField(max_length=20) age = models.IntegerField() objects = models.Manager() youngest_among_namesakes = YoungestAmongNamesakes()
In the interactive shell:
Human.objects.create(nick='foo', name='Helen', age=20) Human.objects.create(nick='bar', name='Helen', age=23) Human.objects.create(nick='baz', name='Jennifer', age=15) for human in Human.youngest_among_namesakes.all(): print human.nick for human in Human.youngest_among_namesakes.order_by('name', 'nick'): print human.nick for human in Human.youngest_among_namesakes.order_by('nick'): print human.nick
Actual results
The first iteration outputs:
foo baz
The second iteration outputs:
bar baz
The third iteration raises
ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
Human.youngest_among_namesakes.all()
generates:
SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC
Human.youngest_among_namesakes.order_by('name', 'nick')
generates:
SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."nick" ASC
Similarly, Human.youngest_among_namesakes.order_by('nick')
generates:
SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."nick" ASC
Expected results
I would expect the second iteration to output:
foo baz
I would expect the third iteration to output:
baz foo
I would expect Human.youngest_among_namesakes.order_by('name', 'nick')
to generate the following query:
SELECT "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" WHERE "myapp_human"."id" IN (SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC) ORDER BY "myapp_human"."name" ASC, "myapp_human"."nick" ASC
Similarly, I would expect Human.youngest_among_namesakes.order_by('nick')
to generate the following query:
SELECT "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" WHERE "myapp_human"."id" IN (SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC) ORDER BY "myapp_human"."nick" ASC
Rationale
I would expect any queryset qs
to contain the same objects as qs.order_by(...)
, just in a possibly different order.
Workaround
As a (suboptimal) workaround, one might use a subquery instead:
class YoungestAmongNamesakes(models.Manager): def get_queryset(self): qs = super(YoungestAmongNamesakes, self).get_queryset().\ order_by('name', 'age').distinct('name') return super(YoungestAmongNamesakes, self).get_queryset().filter(pk__in=qs)
This however generates unnecessarily complex SQL query for Human.youngest_among_namesakes.all()
.
Suggestion
I would suggest to rewrite django.db.models.query.QuerySet.order_by
not to unconditionally clear the previous ordering, but to first check whether distinct(...)
is used in the present queryset and use a subquery in such a case.
Something like this (the code is untested and written quickly just to document my thought):
def order_by(self, *field_names): """ Returns a new QuerySet instance with the ordering changed. """ assert self.query.can_filter(), \ "Cannot reorder a query once a slice has been taken." obj = self._clone() if self.query.distinct and self.query.distinct_fields: from django.db.models.sql.subqueries import AggregateQuery subquery = obj obj = AggregateQuery(obj.model) try: obj.add_subquery(subquery, using=self.db) except EmptyResultSet: obj = subquery else: obj.query.clear_ordering(force_empty=False) obj.query.add_ordering(*field_names) return obj
Attachments (1)
Change History (15)
comment:1 by , 10 years ago
Cc: | added |
---|---|
Type: | Uncategorized → Bug |
comment:2 by , 10 years ago
by , 10 years ago
Attachment: | 24462.diff added |
---|
comment:3 by , 10 years ago
I don't find changing the way .order_by().distinct().order_by() works a good idea. It *will* break a lot of existing code. For example, if you have automatic Meta ordering for model, what should Human.objects.distinct('name').order_by('name', 'age')
return?
Instead I think we need a new queryset operation to force Django use the current results as a subquery. There are various cases where you'd want this. Simplest one is SomeModel.objects.order_by('posted')[0:10].subquery().order_by('-posted')
.
This feature is likely pretty hard to implement correctly. Generating the SQL for the subquery is easy, but generating references to the subquery's available fields correctly is likely going to be hard.
comment:7 by , 10 years ago
Summary: | Using .distinct().order_by() on a queryset produces unexpected results → Add a new QuerySet operation to use current results as a subquery |
---|---|
Triage Stage: | Unreviewed → Accepted |
Type: | Bug → New feature |
Version: | 1.7 → master |
comment:8 by , 3 years ago
Cc: | added |
---|
comment:9 by , 3 years ago
I believe we have a similar need, but for different reasons.
We have a JSONB postgres column and want to use jsonb_array_elements
followed by filtering. I have introduced a custom Func
, but because jsonb_array_elements
expands into a set of columns, you cannot apply a where
clause filter on the function column. A simple solution is to use jsonb_array_elements
in a subquery and then filter the resulting subquery.
Another use case for sub-querying is being able to apply filters on union queries. Our actual use case involves both jsonb_array_elements
and UNION
s.
A sample query that I'm trying to write with querysets (but so far have been unable to) is:
select * from ( select id, jsonb_array_elements(json_data->'some_array') as elem from foo as foo1 union select id, jsonb_array_elements(json_data->'other_array') as elem from foo as foo2 ) as foo_w_elems where (elem->>'subfield')::int in ( select id from bar where expires_at >= CURRENT_TIMESTAMP )
comment:10 by , 3 years ago
Cc: | added |
---|
comment:11 by , 3 years ago
Cc: | added |
---|
comment:12 by , 3 years ago
Cc: | added |
---|
comment:13 by , 18 months ago
For posterity, another PR related to this topic: https://github.com/django/django/pull/16814
comment:14 by , 10 months ago
Another one relating to aggregation over a union https://code.djangoproject.com/ticket/35146
Not sure the proposal is feasible or a good idea, but attaching a doc patch about
order_by()
clearing previous ordering which we can commit in the meantime.