Opened 10 years ago

Last modified 10 months ago

#24462 new New feature

Add a new QuerySet operation to use current results as a subquery

Reported by: trosos Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: distinct order_by subquery
Cc: mihadra@…, trosos, John Speno, Jameel A., Paolo Melchiorre, Dave Johansen Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When I try to order a queryset that uses distinct(...), e.g.:

qs2 = qs.order_by('a', 'b').distinct('a')
qs3 = qs2.order_by('c')

then Django seems to forget about the first order_by and apply the second order_by before applying distinct. I would expect the second order_by to be applied after applying distinct.

This is particularly troublesome when one wants to create a custom manager on top of qs2.

I don't know if this is the intended behavior, but it is very unexpected to me (see the section "Rationale").

Steps to reproduce

Use the Postgre SQL backend.

In models.py:

from django.db import models

class YoungestAmongNamesakes(models.Manager):
    def get_queryset(self):
        return super(YoungestAmongNamesakes, self).get_queryset().\
          order_by('name', 'age').distinct('name')

class Human(models.Model):
    nick = models.CharField(max_length=20)
    name = models.CharField(max_length=20)
    age = models.IntegerField()

    objects = models.Manager()
    youngest_among_namesakes = YoungestAmongNamesakes()

In the interactive shell:

Human.objects.create(nick='foo', name='Helen', age=20)
Human.objects.create(nick='bar', name='Helen', age=23)
Human.objects.create(nick='baz', name='Jennifer', age=15)

for human in Human.youngest_among_namesakes.all():
    print human.nick

for human in Human.youngest_among_namesakes.order_by('name', 'nick'):
    print human.nick

for human in Human.youngest_among_namesakes.order_by('nick'):
    print human.nick

Actual results

The first iteration outputs:

foo
baz

The second iteration outputs:

bar
baz

The third iteration raises

ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Human.youngest_among_namesakes.all() generates:

SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC

Human.youngest_among_namesakes.order_by('name', 'nick') generates:

SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."nick" ASC

Similarly, Human.youngest_among_namesakes.order_by('nick') generates:

SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" ORDER BY "myapp_human"."nick" ASC

Expected results

I would expect the second iteration to output:

foo
baz

I would expect the third iteration to output:

baz
foo

I would expect Human.youngest_among_namesakes.order_by('name', 'nick') to generate the following query:

SELECT "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" WHERE "myapp_human"."id" IN (SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC) ORDER BY "myapp_human"."name" ASC, "myapp_human"."nick" ASC

Similarly, I would expect Human.youngest_among_namesakes.order_by('nick') to generate the following query:

SELECT "myapp_human"."id", "myapp_human"."nick", "myapp_human"."name", "myapp_human"."age" FROM "myapp_human" WHERE "myapp_human"."id" IN (SELECT DISTINCT ON ("myapp_human"."name") "myapp_human"."id" FROM "myapp_human" ORDER BY "myapp_human"."name" ASC, "myapp_human"."age" ASC) ORDER BY "myapp_human"."nick" ASC

Rationale

I would expect any queryset qs to contain the same objects as qs.order_by(...), just in a possibly different order.

Workaround

As a (suboptimal) workaround, one might use a subquery instead:

class YoungestAmongNamesakes(models.Manager):
    def get_queryset(self):
        qs = super(YoungestAmongNamesakes, self).get_queryset().\
          order_by('name', 'age').distinct('name')
        return super(YoungestAmongNamesakes, self).get_queryset().filter(pk__in=qs)

This however generates unnecessarily complex SQL query for Human.youngest_among_namesakes.all().

Suggestion

I would suggest to rewrite django.db.models.query.QuerySet.order_by not to unconditionally clear the previous ordering, but to first check whether distinct(...) is used in the present queryset and use a subquery in such a case.

Something like this (the code is untested and written quickly just to document my thought):

    def order_by(self, *field_names):
        """
        Returns a new QuerySet instance with the ordering changed.
        """
        assert self.query.can_filter(), \
            "Cannot reorder a query once a slice has been taken."
        obj = self._clone()

        if self.query.distinct and self.query.distinct_fields:
            from django.db.models.sql.subqueries import AggregateQuery
            subquery = obj

            obj = AggregateQuery(obj.model)
            try:
                obj.add_subquery(subquery, using=self.db)
            except EmptyResultSet:
                obj = subquery
        else:
            obj.query.clear_ordering(force_empty=False)

        obj.query.add_ordering(*field_names)
        return obj

Attachments (1)

24462.diff (664 bytes ) - added by Tim Graham 10 years ago.

Download all attachments as: .zip

Change History (15)

comment:1 by trosos, 10 years ago

Cc: trosos added
Type: UncategorizedBug

comment:2 by Tim Graham, 10 years ago

Not sure the proposal is feasible or a good idea, but attaching a doc patch about order_by() clearing previous ordering which we can commit in the meantime.

by Tim Graham, 10 years ago

Attachment: 24462.diff added

comment:3 by Anssi Kääriäinen, 10 years ago

I don't find changing the way .order_by().distinct().order_by() works a good idea. It *will* break a lot of existing code. For example, if you have automatic Meta ordering for model, what should Human.objects.distinct('name').order_by('name', 'age') return?

Instead I think we need a new queryset operation to force Django use the current results as a subquery. There are various cases where you'd want this. Simplest one is SomeModel.objects.order_by('posted')[0:10].subquery().order_by('-posted').

This feature is likely pretty hard to implement correctly. Generating the SQL for the subquery is easy, but generating references to the subquery's available fields correctly is likely going to be hard.

comment:4 by Tim Graham <timograham@…>, 10 years ago

In b9d9ab23:

Refs #24462 -- Emphasized that order_by() clears previous ordering.

comment:5 by Tim Graham <timograham@…>, 10 years ago

In f8ed647:

[1.7.x] Refs #24462 -- Emphasized that order_by() clears previous ordering.

Backport of b9d9ab23bdcc404708aada664e718a9d56415ca3 from master

comment:6 by Tim Graham <timograham@…>, 10 years ago

In 5b91802:

[1.8.x] Refs #24462 -- Emphasized that order_by() clears previous ordering.

Backport of b9d9ab23bdcc404708aada664e718a9d56415ca3 from master

comment:7 by Tim Graham, 10 years ago

Summary: Using .distinct().order_by() on a queryset produces unexpected resultsAdd a new QuerySet operation to use current results as a subquery
Triage Stage: UnreviewedAccepted
Type: BugNew feature
Version: 1.7master

comment:8 by John Speno, 3 years ago

Cc: John Speno added

comment:9 by Jameel A., 3 years ago

I believe we have a similar need, but for different reasons.

We have a JSONB postgres column and want to use jsonb_array_elements followed by filtering. I have introduced a custom Func, but because jsonb_array_elements expands into a set of columns, you cannot apply a where clause filter on the function column. A simple solution is to use jsonb_array_elements in a subquery and then filter the resulting subquery.

Another use case for sub-querying is being able to apply filters on union queries. Our actual use case involves both jsonb_array_elements and UNIONs.

A sample query that I'm trying to write with querysets (but so far have been unable to) is:

select *
from (
	select id, jsonb_array_elements(json_data->'some_array') as elem
	from foo as foo1
	union
	select id, jsonb_array_elements(json_data->'other_array') as elem
	from foo as foo2
) as foo_w_elems
where (elem->>'subfield')::int in (
	select id
	from bar
	where expires_at >= CURRENT_TIMESTAMP
)
Last edited 3 years ago by Jameel A. (previous) (diff)

comment:10 by Jameel A., 3 years ago

Cc: Jameel A. added

comment:11 by Paolo Melchiorre, 3 years ago

Cc: Paolo Melchiorre added

comment:12 by Dave Johansen, 3 years ago

Cc: Dave Johansen added

comment:13 by David Sanders, 18 months ago

For posterity, another PR related to this topic: https://github.com/django/django/pull/16814

comment:14 by Simon Charette, 10 months ago

Another one relating to aggregation over a union https://code.djangoproject.com/ticket/35146

Note: See TracTickets for help on using tickets.
Back to Top