Opened 17 years ago

Closed 13 years ago

#6422 closed New feature (fixed)

Support for 'DISTINCT ON' queries with QuerySet.distinct()

Reported by: Manfred Wassmann <manolo@…> Owned by: Jeffrey Gelens
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: dceu2011
Cc: cornbread, ramusus@…, tim.babych@…, hr.bjarni+django@…, jeffrey@…, James Pic, taylor.mitchell@…, diegobz, jonas-django@…, anssi.kaariainen@…, German M. Bravo Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The patch included modifies django/db/models/query.py so that the distinct method of the QuerySet object optionally takes variable number of field names and, if given, modifies the SQL statements generated so that 'DISTINCT ON (field,...)' is used. The incentive is to allow things like described in the following example:

class Example(models.Model):
  name = models.TextField()
  date = models. DateTimeField()
  other = models.XXX()
  ...

qs = Example.objects.all().distinct('name').order_by('name','-date')

Now qs will return the latest entry for each distinct name. This cannot otherwise be achieved unless resorting to plain SQL.

It should be noted that - at least in standard SQL and Postgres - if name allows NULL values, the query will return all entries for which name is NULL not only the latest one as one might expect.

There will be a couple of '# CHECKME:' comments scattered over the patched file. I tried to mark all places where the change might have a side effect but it wasn't apparent to me if and how to fix it. I.e. there should be no side effect from the code outside the commented regions.

Attachments (19)

query.py.patch (3.9 KB ) - added by Manfred Wassmann <manolo@…> 17 years ago.
Patch for django/db/models/query.py
query.py.2.patch (4.5 KB ) - added by Manfred Wassmann <manolo@…> 17 years ago.
New patch fixes count method and resolves quoting issues
ticket6422.diff (3.3 KB ) - added by mrts 15 years ago.
Update patch to 1.1.X
#6422-distinct.diff (1.9 KB ) - added by German M. Bravo 14 years ago.
Patch updated for 1.3
distinct_on.2.diff (11.9 KB ) - added by Taylor Mitchell 14 years ago.
Add (failing) test for M2M intermediate model
distinct_on.diff (12.8 KB ) - added by Jeffrey Gelens 13 years ago.
distinct_on.3.diff (12.1 KB ) - added by Taylor Mitchell 13 years ago.
distinct_on.4.diff (18.4 KB ) - added by Jeffrey Gelens 13 years ago.
distinct_on.5.diff (18.5 KB ) - added by Michal Petrucha 13 years ago.
Fixed patch with failing testcase (sorry, shouldn't have modified the patch by hand hastily...)
distinct_on.5.2.diff (18.5 KB ) - added by Chris Wesseling 13 years ago.
Added a failing test case for relationship traversals. (this one does apply)
distinct_on.6.diff (18.5 KB ) - added by Jeffrey Gelens 13 years ago.
Fixed field traversal
distinct_on.7.diff (18.5 KB ) - added by Jeffrey Gelens 13 years ago.
Updated distinct method in base class to match the new params
distinct_on.8.diff (12.0 KB ) - added by Ramiro Morales 13 years ago.
Patch with docs and tests updates
distinct_on.9.diff (19.5 KB ) - added by Jeffrey Gelens 13 years ago.
Added docstring
distinct_on.10.diff (21.4 KB ) - added by Jeffrey Gelens 13 years ago.
fixed when calling distinct() multiple times on the same query set
distinct_on.11.diff (23.5 KB ) - added by Anssi Kääriäinen 13 years ago.
Possible approach for using correct aliases
distinct_on.12.diff (30.0 KB ) - added by Anssi Kääriäinen 13 years ago.
distinct_on.13.diff (42.5 KB ) - added by Jeffrey Gelens 13 years ago.
distinct_on.14.diff (29.1 KB ) - added by Ramiro Morales 13 years ago.
Added Anssi to AUTHORS, release notes blurb, removed unrelated tests change, renamed DatabaseOperations.dictinct() to distinct_sql()

Download all attachments as: .zip

Change History (87)

by Manfred Wassmann <manolo@…>, 17 years ago

Attachment: query.py.patch added

Patch for django/db/models/query.py

comment:1 by Manfred Wassmann <manolo@…>, 17 years ago

It should further be noted that - at least with Postgres - fields specified with distinct() have to appear first with the order_by() statement.

comment:2 by Manfred Wassmann <manolo@…>, 17 years ago

Has patch: set

by Manfred Wassmann <manolo@…>, 17 years ago

Attachment: query.py.2.patch added

New patch fixes count method and resolves quoting issues

comment:3 by Manfred Wassmann <manolo@…>, 17 years ago

I have uploaded a new patch which fixes the handling of multiple fields in a DISTINCT ON query as well as the count method for DISTINCT ON queries and resolves the qouting issues.

For DISTINCT ON queries the distinct method has now to be invoked as distinct(on_fields=<FIELD_LIST>), e.g. distinct(on_fields=('field1','field2')). For convenience a single field name can be passed as a plain string like distinct(on_fields='fieldname').

comment:4 by Malcolm Tredinnick, 17 years ago

Needs documentation: set
Needs tests: set
Patch needs improvement: set
Triage Stage: UnreviewedAccepted

Patches against trunk's query.py aren't particularly useful at the moment, since that code has been more or less entirely rewritten in the queryset-refactor branch. So, if you feel enthusiastic, you could try to update this to work with that branch, otherwise it will need to be rewritten after the branch is merged into trunk.

As to the feature itself, the "on_fields" parameter doesn't really fit with the rest of the API. Call the parameter "fields" for consistency. In fact, given that distinct() takes no parameters currently, it probably makes the most sense just to pass it field names: so it will take *args. That's consistent with values(), valuelist() and select_related().

I notice the patch doesn't contain any tests, which will be necessary, since we need to be able to verify it's correct. Make sure you test any interaction with the .count() method on querysets, since that required some special handling with normal .distinct() calls. Documentation needed as well.

comment:5 by Erin Kelly, 16 years ago

Cc: Erin Kelly added

comment:6 by cornbread, 15 years ago

Cc: cornbread added; Erin Kelly removed

I just ran into this issue. I am trying to return unique Session keys from a table. I tried PageRequest.objects.order_by('session').distinct().count() and I get 217 records and not 1 that I should be getting. Would be nice to just be able to do PageRequest.objects.all().distinct('session').count()

Looking forward to your new patch Manfred Wassmann. Thanks for the great work!

comment:7 by ramusus, 15 years ago

Cc: ramusus@… added

comment:8 by tymofiy, 15 years ago

Cc: tim.babych@… added

comment:9 by hejsan, 15 years ago

Cc: hr.bjarni+django@… added

comment:10 by mnbayazit, 15 years ago

Any updates on this? Why isn't it being merged with trunk? It's been 2 years!

by mrts, 15 years ago

Attachment: ticket6422.diff added

Update patch to 1.1.X

in reply to:  10 comment:11 by mrts, 15 years ago

Replying to mnbayazit:

Any updates on this? Why isn't it being merged with trunk? It's been 2 years!

This can not be merged to trunk as it does not contain tests and works only on certain backends (e.g. it certainly does not work with SQLite), so it needs a design decision.

However, for people who need this functionality right now, a working patch against 1.1.X is included.

comment:12 by Erin Kelly, 15 years ago

The current patch also doesn't work with Oracle, which doesn't support the "DISTINCT ON" syntax and requires an analytic query for this purpose. I tried coding that up some time ago, but I found that the query architecture didn't really lend itself well to using analytic functions. That was before the sql compiler refactor, so I'll have to give it another look. It may be easier now.

comment:13 by Jeffrey Gelens, 14 years ago

Cc: jeffrey@… added

comment:14 by James Pic, 14 years ago

Cc: James Pic added

comment:15 by Julien Phalip, 14 years ago

Type: New feature

comment:16 by Julien Phalip, 14 years ago

Severity: Normal

by German M. Bravo, 14 years ago

Attachment: #6422-distinct.diff added

Patch updated for 1.3

comment:17 by Jeffrey Gelens, 14 years ago

Easy pickings: unset
UI/UX: unset

Working on this during the DjangoCon EU sprints. Improving the patch, writing tests and documentation.

comment:18 by Jeffrey Gelens, 14 years ago

Needs documentation: unset
Needs tests: unset
Patch needs improvement: unset

Uploaded patch including docs and tests.

comment:19 by Russell Keith-Magee, 14 years ago

Triage Stage: AcceptedReady for checkin

Technically, this introduces a backwards incompatibility on the true_or_false argument, but the argument is undocumented, untested, and based on a search of Google's code search, unused.

comment:20 by Erin Kelly, 14 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Glad to see this getting some attention, but I think we should hold off on RFC for now. The patch could still use further improvement in a few areas:

In particular, there is no mapping of the provided field names to actual fields and then to column names, which means that the values passed in must actually be the column names, not the field names. It also means there is no check that the specified fields are actually defined, prior to executing the query.

Also, at some point along the way we seem to have lost the special handling for queryset.distinct('somefield').count(). I don't think this is strictly necessary as long as we take care to raise an exception in that case, but right now I see no reason to exclude it.

The patch hard-wires in self.query.model._meta.db_table for the columns in the distinct list, which prevents fields on related tables from being specified (e.g. queryset.distinct('othertable__somefield')). Again, probably not necessary to commit, but this should be easily fixable once the field mapping is done.

comment:21 by Jeffrey Gelens, 14 years ago

True, fixing the things you mentioned

comment:22 by Chris Wesseling, 14 years ago

We're almost there.. fixed the count case.. Now for the join case...

What would be the usecase for the queryset.distinct('othertable__somefield')

Last edited 14 years ago by Chris Wesseling (previous) (diff)

comment:23 by Jeffrey Gelens, 14 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:24 by Jeffrey Gelens, 14 years ago

Owner: changed from nobody to Jeffrey Gelens
Status: newassigned

comment:25 by Jeffrey Gelens, 14 years ago

TODO: use column instead of given name. Will fix 2nd sprint day.

comment:26 by Jeffrey Gelens, 14 years ago

done, ready for checkin

comment:27 by Jeffrey Gelens, 14 years ago

Keywords: dceu2011 added

comment:28 by Jeffrey Gelens, 14 years ago

Needs tests: set

We still need to check if the indexes are being created. For this we need better database introspection for indexes. For this #16220 will need to be fixed first. Till that moment, this ticket is on hold.

Version 0, edited 14 years ago by Jeffrey Gelens (next)

comment:29 by Jeffrey Gelens, 14 years ago

Needs tests: unset

comment:30 by Taylor Mitchell, 14 years ago

Cc: taylor.mitchell@… added

by Taylor Mitchell, 14 years ago

Attachment: distinct_on.2.diff added

Add (failing) test for M2M intermediate model

comment:31 by Taylor Mitchell, 14 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Tried this against our app and found some issues using it with a M2M model. Updated tests. I also seem to have lost the AUTHORS hunk in the process and Trac doesn't want to replace the attachment, apologies.

I would also add a +1 to adding join syntax if it can be done, I will try to think of a use case that's not as domain-specific as ours if needed.

comment:32 by Jeffrey Gelens, 14 years ago

Fixed your new issue, will create a patch later today.

comment:33 by Jeffrey Gelens, 13 years ago

Fixed issue with M2M models. Could you check if this is working?

comment:34 by Jeffrey Gelens, 13 years ago

Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:35 by Jeffrey Gelens, 13 years ago

The patch was improved and this feature is working OK with M2M relations now. Added a test for this special case.

comment:36 by Taylor Mitchell, 13 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

There's an interplay between DISTINCT ON and ORDER BY that isn't clear. I was trying to nail it down before I replied but I got stuck, sorry about that.

To demonstrate, if you use the models from tests.regressiontests.queries and try to do: Tag.objects.distinct('parent') you get an error because the default ordering for Tag is by 'name'. I am not sure how to rectify this. Doing Tag.objects.order_by('parent').distinct('parent') doesn't work either. If this isn't a bug, then the docs should be expanded.

comment:37 by Jeffrey Gelens, 13 years ago

Patch needs improvement: unset

The 'parent' is an foreign key to self. When you print the query you'll see it doesn't add the parent_id to the ORDER BY clause:

>>> print Tag.objects.all().order_by('parent').query
SELECT "queries_tag"."id", "queries_tag"."name", "queries_tag"."parent_id", "queries_tag"."category_id" FROM "queries_tag" LEFT OUTER JOIN "queries_tag" T2 ON ("queries_tag"."parent_id" = T2."id") ORDER BY T2."name" ASC


The DISTINCT ON expression requires the leftmost ORDER BY to match the fields given in the DISTINCT ON clause. To make sure the parent_id will be in ORDER BY, use parent_ _pk or parent_ _id:

>>> print Tag.objects.all().order_by('parent__pk').query
SELECT "queries_tag"."id", "queries_tag"."name", "queries_tag"."parent_id", "queries_tag"."category_id" FROM "queries_tag" LEFT OUTER JOIN "queries_tag" T2 ON ("queries_tag"."parent_id" = T2."id") ORDER BY "queries_tag"."parent_id" ASC

Now if you update your ORM call a lil, you'll get the expected results:

>>> print Tag.objects.distinct('parent').order_by('parent__pk')
[<Tag: t3>, <Tag: t5>, <Tag: t1>]

I added a note to the documentation about the 'order_by' requirement.

It's not a bug, so I hope the documentation is more clear now.

Last edited 13 years ago by Jeffrey Gelens (previous) (diff)

by Jeffrey Gelens, 13 years ago

Attachment: distinct_on.diff added

comment:38 by Taylor Mitchell, 13 years ago

Patch needs improvement: set

Looks like something went awry with the last patch posted, the tests and some other things disappeared. Here's the combined patch, with a test for the self-referential Fk described above and an edit to the docs for readability.

by Taylor Mitchell, 13 years ago

Attachment: distinct_on.3.diff added

comment:39 by Taylor Mitchell, 13 years ago

Patch needs improvement: unset

comment:40 by Jeffrey Gelens, 13 years ago

Triage Stage: AcceptedReady for checkin

I guess it's ready for checkin now.

comment:41 by diegobz, 13 years ago

Cc: diegobz added

comment:42 by Jonas H., 13 years ago

Cc: jonas-django@… added

comment:43 by Jacob, 13 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

Sadly the patch doesn't apply any more; bumping back to accepted.

I'm really sorry we dallied so long that the patch grew stale. I'll try to get it updated myself but if someone wants to save me a bit of time and update the patch that'd make me very happy.

comment:44 by Jeffrey Gelens, 13 years ago

I'll update it to match the current trunk in a couple of hours. It will be great if you could apply the patch soon after that, as it seems this part of the code is changing rapidly.

by Jeffrey Gelens, 13 years ago

Attachment: distinct_on.4.diff added

comment:45 by Jeffrey Gelens, 13 years ago

Patch needs improvement: unset

Added a new version of the patch.

comment:46 by Jeffrey Gelens, 13 years ago

Triage Stage: AcceptedReady for checkin

comment:47 by Michal Petrucha, 13 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

I've come up with another case where it fails; added it to the tests in the patch.

Basically, the error lies in the assumption that all fields in the DISTINCT ON set belong to the base model of the QuerySet.

in reply to:  47 comment:48 by anonymous, 13 years ago

Replying to koniiiik:

I've come up with another case where it fails; added it to the tests in the patch.

Basically, the error lies in the assumption that all fields in the DISTINCT ON set belong to the base model of the QuerySet.

That patch doesn't apply:

(django)ligfiets$ patch -p1 <distinct_on.5.diff 
patching file AUTHORS
patching file django/db/backends/__init__.py
patching file django/db/backends/postgresql_psycopg2/base.py
patching file django/db/backends/postgresql_psycopg2/operations.py
patching file django/db/models/query.py
patching file django/db/models/sql/compiler.py
patching file django/db/models/sql/query.py
patching file docs/ref/models/querysets.txt
patching file tests/regressiontests/queries/models.py
patching file tests/regressiontests/queries/tests.py
patch: **** malformed patch at line 497:      def setUp(self):

comment:49 by Jeffrey Gelens, 13 years ago

Figured out how to fix that kind of cases. Expect a new patch in a couple of hours.

by Michal Petrucha, 13 years ago

Attachment: distinct_on.5.diff added

Fixed patch with failing testcase (sorry, shouldn't have modified the patch by hand hastily...)

by Chris Wesseling, 13 years ago

Attachment: distinct_on.5.2.diff added

Added a failing test case for relationship traversals. (this one does apply)

by Jeffrey Gelens, 13 years ago

Attachment: distinct_on.6.diff added

Fixed field traversal

comment:50 by Jeffrey Gelens, 13 years ago

Patch needs improvement: unset

comment:51 by Michal Petrucha, 13 years ago

One more nit: in the backend prototype you forgot to update BaseDatabaseOperations.distinct to the new prototype without the db_table parameter. Other than that it looks quite promising.

by Jeffrey Gelens, 13 years ago

Attachment: distinct_on.7.diff added

Updated distinct method in base class to match the new params

comment:52 by Michal Petrucha, 13 years ago

Triage Stage: AcceptedReady for checkin

comment:53 by anonymous, 13 years ago

bombom:/storage/dists/Django-1.3.1# patch -p1 </storage/dists/distinct_on.7.diff
patching file AUTHORS
Hunk #1 succeeded at 524 (offset -20 lines).
patching file django/db/backends/__init__.py
Hunk #1 succeeded at 335 (offset -33 lines).
Hunk #2 succeeded at 481 (offset -43 lines).
patching file django/db/backends/postgresql_psycopg2/base.py
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file django/db/backends/postgresql_psycopg2/base.py.rej
can't find file to patch at input line 97
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/django/db/backends/postgresql_psycopg2/operations.py b/django/db/backends/postgresql_psycopg2/operations.py
|--- a/django/db/backends/postgresql_psycopg2/operations.py
|+++ b/django/db/backends/postgresql_psycopg2/operations.py
--------------------------
File to patch:
Skip this patch? [y] y
Skipping patch.
1 out of 1 hunk ignored
patching file django/db/models/query.py
Hunk #1 succeeded at 651 (offset -42 lines).
Hunk #2 succeeded at 1078 (offset -42 lines).
patching file django/db/models/sql/compiler.py
Hunk #1 succeeded at 67 (offset -4 lines).
patching file django/db/models/sql/query.py
Hunk #1 FAILED at 121.
Hunk #2 FAILED at 260.
Hunk #3 FAILED at 389.
Hunk #4 FAILED at 1592.
4 out of 4 hunks FAILED -- saving rejects to file django/db/models/sql/query.py.rej
patching file docs/ref/models/querysets.txt
Hunk #1 FAILED at 340.
Hunk #2 FAILED at 369.
2 out of 2 hunks FAILED -- saving rejects to file docs/ref/models/querysets.txt.rej
patching file tests/regressiontests/queries/models.py
Hunk #2 FAILED at 340.
1 out of 2 hunks FAILED -- saving rejects to file tests/regressiontests/queries/models.py.rej
patching file tests/regressiontests/queries/tests.py
Hunk #1 FAILED at 10.
Hunk #2 FAILED at 1732.
Hunk #3 FAILED at 1818.
3 out of 3 hunks FAILED -- saving rejects to file tests/regressiontests/queries/tests.py.rej

Why this happning? :)

comment:54 by Michal Petrucha, 13 years ago

Are you sure you're applying it on a recent SVN trunk? There it applies almost fine, except for a small whitespace error in the patch (probably caused by git) that makes one hunk in tests/regressiontests/queries/models.py fail.

Just out of pure curiosity I just tried it on the 1.3.X branch and there it looks suspiciously similar to your output.

by Ramiro Morales, 13 years ago

Attachment: distinct_on.8.diff added

Patch with docs and tests updates

comment:55 by Ramiro Morales, 13 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

I've update the patch with:

  • The new tests failed for me because on my environment the Tag.objects.order_by('parent__pk').distinct('parent') call choose ['<Tag: t2>', '<Tag: t4>', '<Tag: t1>'] instead of ['<Tag: t3>', '<Tag: t45', '<Tag: t1>']. Had to use order_by('parent__pk', 'pk') instead.
  • Fleshed out a bit documentation changes:
    • I'd like the note about the need to use an order_by() call to be more clear. -- I'm clearing RFC because of this.
    • Added some examples of usage based on my interpretation of the new functionality.
  • Moved the addition to the AUTHORS file to the right spot (lexicographical ordering by last name)

comment:56 by Jeffrey Gelens, 13 years ago

Patch needs improvement: unset

I updated the documentation explaining the use of order_by. Hope it's more clear like this.

comment:57 by Anssi Kääriäinen, 13 years ago

Some comments (I haven't actually tried the patch, just read it in diffview):

  • IMHO add_distinct_fields needs a docstring. Other similar methods seem to have one, also.
  • What happens if you specify .distinct(field_list) multiple times with different fields? If the first .distinct call generated joins, won't those joins be left loitering around without any users, but still having refcount > 0?
  • Haven't tested this, but I suspect you can break this by adding fields to DISTINCT ON which aren't in the normal select list and then doing a group by. If you want correct results in this case, you will probably need to subquerify the DISTINCT ON query, then do a group by. NotImplementedError seems to be the sane choice, making this work correctly might be a bit hard and I haven't ever used distinct on + group by combination when writing SQL myself.

The true_or_false argument was removed. Is there any way to get rid of the distinct after this. Does there need to be one?

comment:58 by Jeffrey Gelens, 13 years ago

@akaariai

  • Added a docstring to add_distinct_fields.
  • If you specify distinct(field_list) multiple times it will overwrite the previous one. This is the intended behavior.
  • The fields in DISTINCT ON are always in the select list. I guess your explanation can't happen. If it does could you explain further and give some examples?
  • I discussed the true_or_false argument removal with Russell and Alex during the DjangoCon in Amsterdam this year. As it's a undocumented feature and we couldn't find any usage of it on the web, we decided to remove the argument. It's a weird argument anyway ;) .

by Jeffrey Gelens, 13 years ago

Attachment: distinct_on.9.diff added

Added docstring

comment:59 by Anssi Kääriäinen, 13 years ago

Assume you have models:

class A1(models.Model):
    foo = models.TextField()

class B1(models.Model):
    a = models.ForeignKey(A1)

And you do:

qs1 = B1.objects.distinct('a__id', 'a__foo')
print qs1.query
qs2 = qs1.distinct('pk')
print qs2.query

Now you will have a join to A even though you don't have a reference to it. The queries generated:

SELECT DISTINCT ON ("obj_creation_speed_a1"."id", "obj_creation_speed_a1"."foo") "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id" FROM "obj_creation_speed_b1" INNER JOIN "obj_creation_speed_a1" ON ("obj_creation_speed_b1"."a_id" = "obj_creation_speed_a1"."id")

SELECT DISTINCT ON ("obj_creation_speed_b1"."id") "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id" FROM "obj_creation_speed_b1" INNER JOIN "obj_creation_speed_a1" ON ("obj_creation_speed_b1"."a_id" = "obj_creation_speed_a1"."id")

Note that in the second query, there is a join into A even if no field of A is used in the query. The join should be removed. But I don't think this is a blocker issue. If you want to fix this, there are at least these two choices:

  • Apply the distinct on in the compiler when the query is generated, order by is handle this way IIRC.
  • Keep record of which aliases are referenced by the DISTINCT ON added fields, and subtract the refcount by one for those aliases if the distinct on is changed.

As said before, in my opinion this is not a must-fix in this ticket.

Now, if you continue from the above created qs1 and issue .annotate(Max('a__id')), you will get this query:

qs3 = qs1.annotate(Max('a__id'))
print qs3.query
list(qs3)

The query generated is this:

SELECT DISTINCT ON ("obj_creation_speed_a1"."id", "obj_creation_speed_a1"."foo") "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id", MAX("obj_creation_speed_b1"."a_id") AS "a__id__max" FROM "obj_creation_speed_b1" INNER JOIN "obj_creation_speed_a1" ON ("obj_creation_speed_b1"."a_id" = "obj_creation_speed_a1"."id") GROUP BY "obj_creation_speed_b1"."id", "obj_creation_speed_b1"."a_id"

And the list(qs3) will generate this:

django.db.utils.DatabaseError: column "obj_creation_speed_a1.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT DISTINCT ON ("obj_creation_speed_a1"."id", "obj_creat...
                            ^

You could "fix" this by appending the DISTINCT ON columns into the group by clause. Unfortunately (without going into details) the results aren't what the user would expect. If you want the results the user expects, you would need to do a subquery. If the distinct on is in the inner query or group by is in the inner query depends on the order in which the .annotate and .distinct are applied. A good fix for this would be to raise NotImplementedError("Can't handle queries with aggregates and distinct on clause"). I have written some SQL by hand, and can't remember ever using distinct on and group by in the same query. Maybe there is some use case, but I just don't see the need to block on this issue, either.

Last edited 13 years ago by Anssi Kääriäinen (previous) (diff)

comment:60 by Jeffrey Gelens, 13 years ago

@ akaariai Thanks for your thorough comments :)
I fixed your first point about the join removal when calling distinct() multiple times.
It now applies the distinct in the compiler. This didn't fix the problem however, so I had to unref the fields manually, which is working great.

As you've said, the combination of a annotation and an distinct on is probably a use case which is never used. This is certainly not a blocker for this feature.

Anyways, this patch took a lot of time. I hope all the bugs were fixed :-D

by Jeffrey Gelens, 13 years ago

Attachment: distinct_on.10.diff added

fixed when calling distinct() multiple times on the same query set

comment:61 by Anssi Kääriäinen, 13 years ago

Cc: anssi.kaariainen@… added

Sorry, I still found one more problem, and some smaller issues:

  • The distinct() in postgresql_psycopg2/operations.py doesn't take aliases into account. It assumes that the distinct is always for the main table alias. If you first do a filter on m2m relation, then a distinct on for the m2m relation you get:
    # I added coworkers = models.ManyToManyField('self') to Staff model used in tests
    >>> print Staff.objects.distinct('coworkers__name').order_by('coworkers__name').query
    SELECT DISTINCT ON ("queries_staff"."name")
         "queries_staff"."id", "queries_staff"."name", "queries_staff"."organisation", T3."name"
    FROM "queries_staff"
    LEFT OUTER JOIN "queries_staff_coworkers"
         ON ("queries_staff"."id" = "queries_staff_coworkers"."from_staff_id")
    LEFT OUTER JOIN "queries_staff" T3
         ON ("queries_staff_coworkers"."to_staff_id" = T3."id")
    ORDER BY T3."name" ASC
    

If you execute that query, it will error, because distinct on is for "queries_staff"."name", but order by is for T3."name"

  • In the attached patch there is a different approach of join cleanup. It also fixes one other long-standing expected_failure test in queries. The failure is related to similar "lingering joins" done by ordering.
  • The obj.query.distinct_fields = [] in query.distinct() is not needed. The variable will be set anyways in sql.query.add_distinct_fields(). obj.query.distinct = True can also be set in add_distinct_fields()
  • Should the joins be generated as OUTER or INNER joins? Not sure about this. I guess either way will be fine. The attached patch generates OUTER JOINS (because it uses similar code than the order by join creation).
  • Prevent different distinct_fields lists when combining querysets.

I attached a patch which fixes these issues. It isn't the cleanest one (non-DRY, code copy from order_by). But it can be used as a basis for a cleaner patch if needed.

I did also check out how to support aggregates with distinct_fields. However I had a hard time figuring out a realistic test case, so I removed that part. So, jgelens is absolutely correct here: better to first find an use case :)

All in all, the only major issue is the correct alias usage in compiler.py. The rest is just nitpicking. In my opinion the alias usage needs to be fixed before commit, but I may be over-complicating things again... Other than that I think this feature is now very solid. I might be fine to just let the distinct() generated joins linger in there after all. That is what order_by has done for ages.

The attached patch passes all tests, with one long-standing expected failure now gone. As said the patch is just to show one possible approach to the alias issue. I hope jgelens will check it out and see what to do to the above mentioned issues (if anything).

by Anssi Kääriäinen, 13 years ago

Attachment: distinct_on.11.diff added

Possible approach for using correct aliases

comment:62 by Jeffrey Gelens, 13 years ago

I think your additions to the patch are great. Everything works OK. The code copy from order_by is minor and imo unnecessary to split out and generalize for both functions.
Let's mark it as RFC as 1.4alpha is nearing completion. Thanks!

comment:63 by Anssi Kääriäinen, 13 years ago

I hope to do a final cleanup tomorrow. I still think it is a good idea to throw NotImplementedError for aggregate & annotate (this is actually quite simple: in each method check if self.distinct_fields: raise NotImplementedError). The reason is that annotate + DISTINCT ON actually somewhat works, sometimes. But it doesn't work in any meaningful way. So to avoid backwards compatibility issues if annotate + distinct_on combination is going to ever be supported, it is good that there is no behaviour Django must support. The exceptions are simple enough to implement, so IMHO this is a good idea still. Sorry for coming back to this again and again. :)

In addition I will check if there is some easy way to avoid the code duplication between get_ordering and get_distinct join generation (in compiler.py). It is a good idea, as actually both of them _must_ generate same joins, otherwise you will get errors on ordering not matching prefix of DISTINCT ON.

comment:64 by German M. Bravo, 13 years ago

Cc: German M. Bravo added

comment:65 by Anssi Kääriäinen, 13 years ago

Final cleanup done & attached. All tests should pass on sqlite, and the changed tests do pass on PostgreSQL.

Changes from the previous version:

  • DRY in complier.py join setup for ordering and distinct
  • NotImplementedError for .distinct(fields) + annotate/aggregate.
  • Moved the new tests to modeltests/distinct_on_fields

There are a couple of test-fixes combined in the patch. Create 2500 objects -> bulk create 2500 objects, one expected failure is no more, and one test was testing how executing a query alters the query instance (it should not alter it). It now tests the created SQL instead.

As far as I understand this is now ready for checkin. The only minor thing missing is a check for order_by prefix matching distinct_fields prefix. The DB will complain if that is broken, so that really isn't a big deal.

Please review & mark ready for checkin if no problems found. Please check extra-carefully added comments. My English isn't that good...

by Anssi Kääriäinen, 13 years ago

Attachment: distinct_on.12.diff added

comment:66 by Jeffrey Gelens, 13 years ago

Thanks! Looks great, I ran the tests, you forgot to add a init.py to the modeltests/distinct_on_fields module. Also some Tag object creation was accidentally removed. I added this again.
I added a new patch with these changes.

Now it looks we've finally come to a version which is ready for checkin! :-D

Last edited 13 years ago by Jeffrey Gelens (previous) (diff)

by Jeffrey Gelens, 13 years ago

Attachment: distinct_on.13.diff added

comment:67 by Jeffrey Gelens, 13 years ago

Triage Stage: AcceptedReady for checkin

by Ramiro Morales, 13 years ago

Attachment: distinct_on.14.diff added

Added Anssi to AUTHORS, release notes blurb, removed unrelated tests change, renamed DatabaseOperations.dictinct() to distinct_sql()

comment:68 by Ramiro Morales, 13 years ago

Resolution: fixed
Status: assignedclosed

In [17244]:

Added support for modifying the effect of DISTINCT clauses so they
only consider some fields (PostgreSQL only).

For this, the distinct() QuerySet method now accepts an optional
list of model fields names and generates DISTINCT ON clauses on
these cases. Thanks Jeffrey Gelens and Anssi Kääriäinen for their work.

Fixes #6422.

Note: See TracTickets for help on using tickets.
Back to Top