Opened 3 years ago
Last modified 2 years ago
#33682 assigned Cleanup/optimization
Clarify using distinct() with related fields that have Meta.ordering defined.
Reported by: | Robert Leach | Owned by: | anegawa-j |
---|---|---|---|
Component: | Documentation | Version: | 3.2 |
Severity: | Normal | Keywords: | sql, distinct |
Cc: | Carlton Gibson | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I have a rather complex database and an advanced search interface that creates complex queries. It’s been working great for over a year now.
I recently added a feature to count distinct related table records in the joined results. When I added fields to these many-related tables to .distinct()
(and to .order_by()
), I couldn’t get the test to execute without hitting an InvalidColumnReference
error. And though I’m supplying the same expanded fields list to .order_by()
that I am to .distinct()
, the error claims that SELECT DISTINCT ON expressions must match initial ORDER BY expressions
…
When I print the SQL, the place where it notes a difference has a weird T8
reference where the model name should be in an order by
clause. The corresponding distinct
clause has the full table name instead of the reference, which is what I suspect is triggering the exception.
I was able to create a set of toy models and a test that minimally reproduces the exception...
toymodels.py:
from django.db.models import Model, CharField, AutoField, ForeignKey, ManyToManyField, CASCADE class TestPeak(Model): id = AutoField(primary_key=True) name = CharField(max_length=10) compounds = ManyToManyField( to="TestCompound", related_name="testpeaks", ) class Meta: verbose_name = "testpeak" verbose_name_plural = "testpeaks" ordering = ["name"] class TestCompound(Model): id = AutoField(primary_key=True) name = CharField(max_length=10) class Meta: verbose_name = "testcompound" verbose_name_plural = "testcompounds" ordering = ["name"] class TestSynonym(Model): name = CharField(max_length=10, primary_key=True) compound = ForeignKey( TestCompound, related_name="testsynonyms", on_delete=CASCADE ) class Meta: verbose_name = "testsynonym" verbose_name_plural = "testsynonyms" ordering = ["compound", "name"]
test_bug.py:
from DataRepo.tests.tracebase_test_case import TracebaseTestCase from DataRepo.models.toymodels import TestPeak, TestCompound, TestSynonym from django.db.models import Q class DjangoSQLBug(TracebaseTestCase): maxDiff = None @classmethod def setUpTestData(cls): TestCompound.objects.create(name="testcpd") cpd = TestCompound.objects.get(id__exact=1) TestSynonym.objects.create(name="testsyn",compound=cpd) TestPeak.objects.create(name="testpk") pk = TestPeak.objects.get(id__exact=1) pk.compounds.add(cpd) def test_mm_om_query(self): q_exp = Q(name__iexact="testpk") distinct_fields = ['name', 'pk', 'compounds__testsynonyms__compound', 'compounds__testsynonyms__name', 'compounds__testsynonyms__pk', 'compounds__name', 'compounds__pk'] qs = TestPeak.objects.filter(q_exp).order_by(*distinct_fields).distinct(*distinct_fields) self.assertEqual(qs.count(), 1)
python manage.py test
output:
Creating test database for alias 'default'... Creating test database for alias 'validation'... System check identified no issues (0 silenced). E ====================================================================== ERROR: test_mm_om_query (DataRepo.tests.sqlbugtest.test_bug.DjangoSQLBug) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) psycopg2.errors.InvalidColumnReference: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_... ^ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/tests/sqlbugtest/test_bug.py", line 21, in test_mm_om_query self.assertEqual(qs.count(), 1) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 412, in count return self.query.get_count(using=self.db) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/query.py", line 519, in get_count number = obj.get_aggregation(using, ['__count'])['__count'] File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/query.py", line 504, in get_aggregation result = compiler.execute_sql(SINGLE) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql cursor.execute(sql, params) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute return self._execute_with_wrappers(sql, params, many=False, executor=self._execute) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers return executor(sql, params, many, context) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) django.db.utils.ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_... ^ ---------------------------------------------------------------------- Ran 1 test in 0.018s FAILED (errors=1) Destroying test database for alias 'default'... Destroying test database for alias 'validation'... gen-rl-macbookair[2022-05-05 12:34:44]:~/PROJECT-local/TRACEBASE/tracebase$
I posted on Django Users about this, if you would like more information:
We’re on Django 3.2 using Postgres.
Change History (13)
comment:1 by , 3 years ago
comment:2 by , 3 years ago
I figured I might have been missing something. I actually have code to avoid the gotcha, but apparently, I only did it for the model from which the query came from (i.e. the "root model"):
# If there are any split_rows manytomany related tables, we will need to prepend the ordering (and pk) fields of # the root model if len(distinct_fields) > 0: distinct_fields.insert(0, "pk") tmp_distincts = self.getOrderByFields(model_name=self.rootmodel.__name__) tmp_distincts.reverse() for fld_nm in tmp_distincts: distinct_fields.insert(0, fld_nm) if order_by is not None and order_by not in distinct_fields: distinct_fields.insert(0, order_by)
def getOrderByFields(self, mdl_inst_nm=None, model_name=None): """ Retrieves a model's default order by fields, given a model instance name. """ ... brevity edit ... # Get a model object mdl = apps.get_model("DataRepo", mdl_nm) if "ordering" in mdl._meta.__dict__: return mdl._meta.__dict__["ordering"] return []
I know this ticket system is not the place for getting support, but if you'll indulge me... would prepending all the meta ordering fields avoid the gotcha if I inserted the meta ordering field(s) before any other fields? (In my use case, the order is unimportant - only the distinct is). I'd found that it did in the case of the "root model".
comment:3 by , 3 years ago
I believe that the note at the bottom of the distinct section of this doc:
https://docs.djangoproject.com/en/4.0/ref/models/querysets/#distinct
answers my above question, but to dissect the verbiage it tells you *what* to do without the *why* fully described. It tells you that adding the _id
field would be necessary to make the expressions match (which is "a" reason why to add the id
field), but it doesn't explicitly explain why that makes them match, or say whether the id
field is precisely required or if any field will do.
If I'd better understood the *why* in that doc, I might have coded the right solution to the gotcha and not overlooked the other cases.
My updated understanding is that it seems that the reason *a* related model field is necessary is because the related model "field" in the model definition that links to the related model isn't a "field". It's a reference that gets turned into a field that by default uses the meta.ordering
. (I didn't even notice that the distinct clause had compound_id
and the order by clause had name
in that position.) So I'm guessing that *any*(?) related model field in front of a (non-field) related model reference (whether it's at the beginning of the distinct list or "just before" the non-field related model reference) would solve the issue? Or perhaps even explicit inclusion of such a (non) field would cause the problem.
I think these are areas in which the doc could be improved just a bit more. Understanding the /why/ better, I think, could be helpful to avoid these pitfalls, and also help to understand an otherwise cryptic error message.
follow-up: 5 comment:4 by , 3 years ago
Robert, Can you propose a documentation improvement via GitHub's PR?
follow-up: 6 comment:5 by , 3 years ago
Replying to Mariusz Felisiak:
Robert, Can you propose a documentation improvement via GitHub's PR?
I can certainly give it a shot, though I'm not the best writer when it comes to brevity.
Also, I don't have a deep understanding of the related Django code, so my understanding could be empirically correct, but technically flawed (like Bohr's model of the atom). For example, when the same field reference is supplied to both .order_by()
and .distinct()
, such as in Simon's example:
TestSynonym.objects.distinct('compound').order_by('compound')
...why is the inserted field in each case not coordinated? Why does the conversion from the reference (compound
) differ? Simon says it resolves to:
list(TestSynonym.objects.distinct('compound').order_by('compound__name'))
but based on my debug output of another test using that above call, that's imprecise. It shows:
QUERY: SELECT DISTINCT ON ("DataRepo_testsynonym"."compound_id") "DataRepo_testsynonym"."name", "DataRepo_testsynonym"."compound_id" FROM "DataRepo_testsynonym" INNER JOIN "DataRepo_testcompound" ON ("DataRepo_testsynonym"."compound_id" = "DataRepo_testcompound"."id") ORDER BY "DataRepo_testcompound"."name" ASC
which means that the distinct field resolution and order by field resolutions are:
distinct
:compound_id
order_by
:name
So it's essentially resolved as:
list(TestSynonym.objects.distinct('compound_id').order_by('compound__name'))
When those methods are assessed individually, I understand why those fields are the preferred solution (e.g. the meta ordering may not be unique), but given that distinct
requires the same fields be present at the beginning of the order-by, I don't know what prevents the code to be written to have those fields be resolved in a way that is copacetic. Like, why not convert the reference into 2 additional fields that together, meet both requirements (name
AND compound_id
)? Order-by would be satisfied and distinct would be satisfied. Or... in my case, name
is unique, so distinct could resolve to the meta ordering without issue...
Is there a technical reason the code doesn't already do this?
follow-up: 7 comment:6 by , 3 years ago
Replying to Robert Leach:
When those methods are assessed individually, I understand why those fields are the preferred solution (e.g. the meta ordering may not be unique), but given that
distinct
requires the same fields be present at the beginning of the order-by, I don't know what prevents the code to be written to have those fields be resolved in a way that is copacetic. Like, why not convert the reference into 2 additional fields that together, meet both requirements (name
ANDcompound_id
)? Order-by would be satisfied and distinct would be satisfied. Or... in my case,name
is unique, so distinct could resolve to the meta ordering without issue...
Is there a technical reason the code doesn't already do this?
This would be another logic that's implicit and probably unexpected by users (at least in some cases). As far as I'm aware it's preferable to fail loudly even with a ProgrammingError
. I'm not sure how to improve this note, maybe it's enough to add a correct example:
-
docs/ref/models/querysets.txt
diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt index a9da1dcf7e..891b8255b0 100644
a b Examples (those after the first will only work on PostgreSQL):: 565 565 ...wouldn't work because the query would be ordered by ``blog__name`` thus 566 566 mismatching the ``DISTINCT ON`` expression. You'd have to explicitly order 567 567 by the relation ``_id`` field (``blog_id`` in this case) or the referenced 568 one (``blog__pk``) to make sure both expressions match. 568 one (``blog__pk``) to make sure both expressions match:: 569 570 Entry.objects.order_by('blog_id').distinct('blog_id') 569 571 570 572 ``values()`` 571 573 ~~~~~~~~~~~~
comment:7 by , 3 years ago
Replying to Mariusz Felisiak:
This would be another logic that's implicit and probably unexpected by users (at least in some cases). As far as I'm aware it's preferable to fail loudly even with a
ProgrammingError
. I'm not sure how to improve this note, maybe it's enough to add a correct example:
docs/ref/models/querysets.txt
diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt index a9da1dcf7e..891b8255b0 100644
a b Examples (those after the first will only work on PostgreSQL):: 565 565 ...wouldn't work because the query would be ordered by ``blog__name`` thus 566 566 mismatching the ``DISTINCT ON`` expression. You'd have to explicitly order 567 567 by the relation ``_id`` field (``blog_id`` in this case) or the referenced 568 one (``blog__pk``) to make sure both expressions match. 568 one (``blog__pk``) to make sure both expressions match:: 569 570 Entry.objects.order_by('blog_id').distinct('blog_id') 569 571 570 572 ``values()`` 571 573 ~~~~~~~~~~~~
I'm not sure that would have been enough for me to have anticipated or correctly interpreted the error I encountered. I just corrected our code base and I now feel I have a clearer picture of how I got here... Let me explain why. I'd thought my expressions did match:
distinct_fields = ['name', 'pk', 'compounds__testsynonyms__compound', 'compounds__testsynonyms__name', 'compounds__testsynonyms__pk', 'compounds__name', 'compounds__pk'] qs = TestPeak.objects.filter(q_exp).order_by(*distinct_fields).distinct(*distinct_fields)
I was like "but I'm supplying the exact same fields for both order_by and distinct".
What I think I'd missed was that "referenced field" in the phrase explicitly order by the relation _id or referenced field
doesn't just apply to related model fields in the queried("/root") model (e.g. Entry
's fields such as blog
in Entry.objects.order_by('blog').distinct('blog')
, but applies to all related models through all my joins - and (indirectly/potentially) their meta.ordering
values... because I wasn't explicitly adding the compounds__testsynonyms__compound
"field" above in my codebase where I ran into this error - I was implicitly adding them -> I had written code to straightforwardly grab and prepend the meta.ordering fields in order to satisfy distinct's requirements and still have things ordered nicely - and that's where I was running into this problem, because it didn't occur to me that the fields in meta.ordering could be changed differently by order_by
and distinct
. meta.ordering lets you add those "blog" fields and I wasn't the developer that had added them to our code base, but I didn't even think to check whether those fields I was adding would be subject to the distinct/order_by gotcha.
I mean - all the information is there in the doc for me to have avoided this. You just have to piece together doc info on meta.ordering
, order_by
, and distinct
and extend the examples to realize it applies to joined models' fields - not just the root model. It was just a bit too complex for me at my level of django experience to anticipate correctly the implications of what I was doing.
Maybe the best thing to do would be to anticipate the motivations that lead me down this path... I wanted to use distinct on a join query, which meant that I needed to add fields to order_by that I didn't really care about - I just needed to add them to meet the requirement that the fields matched. BUT - I didn't want to change the desired ordering - so I needed to explicitly add the default orderings so that the IDs wouldn't override them. So maybe the doc should just point out that if you're adding fields to order_by solely to be able to use distinct, but you don't want to change the default ordering specified in meta.ordering - you need to consider the fact that the meta.ordering fields explicitly added can make you run into the order_by/distinct gotcha.
comment:8 by , 3 years ago
Component: | Database layer (models, ORM) → Documentation |
---|---|
Keywords: | sql, distinct, → sql, distinct |
Summary: | SQL generation bug in `.distinct()` when supplied fields go through multiple many-related tables → Clarify using distinct() with related fields that have Meta.ordering defined. |
Triage Stage: | Unreviewed → Accepted |
Type: | Bug → Cleanup/optimization |
I tried to clarify this, maybe:
-
docs/ref/models/querysets.txt
diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt index a9da1dcf7e..7c2c299b58 100644
a b Examples (those after the first will only work on PostgreSQL):: 554 554 555 555 .. note:: 556 556 Keep in mind that :meth:`order_by` uses any default related model ordering 557 that has been defined. You might have to explicitly order by the relation 558 ``_id`` or referenced field to make sure the ``DISTINCT ON`` expressions 559 match those at the beginning of the ``ORDER BY`` clause. For example, if 560 the ``Blog`` model defined an :attr:`~django.db.models.Options.ordering` by 561 ``name``:: 557 that has been defined in 558 :attr:`Meta.ordering <django.db.models.Options.ordering>`. As a 559 consequence, ordering by a related field is resolved to the list of fields 560 defined in a related ``Meta.ordering`` and may not be the same as a 561 relation ID field used by ``distinct()``. To avoid using ``Meta.ordering`` 562 and make sure the ``DISTINCT ON`` expressions match those at the beginning 563 of the ``ORDER BY`` clause, you can explicitly order by the relation 564 ``_id`` or referenced field. For example, if the ``Blog`` model defined an 565 :attr:`~django.db.models.Options.ordering` by ``name``:: 562 566 563 567 Entry.objects.order_by('blog').distinct('blog') 564 568 565 569 ...wouldn't work because the query would be ordered by ``blog__name`` thus 566 570 mismatching the ``DISTINCT ON`` expression. You'd have to explicitly order 567 571 by the relation ``_id`` field (``blog_id`` in this case) or the referenced 568 one (``blog__pk``) to make sure both expressions match. 572 one (``blog__pk``) to make sure both expressions match:: 573 574 Entry.objects.order_by('blog_id').distinct('blog_id') 569 575 570 576 ``values()`` 571 577 ~~~~~~~~~~~~
What do you think?
comment:9 by , 3 years ago
That's definitely better, as it directly references Meta.ordering, which is a great clue to what the implications are. What do you think about adding something like this to what you wrote above:
If all you want is to make a set of query results distinct without changing the ordering, note you must explicitly "re-"add the otherwise over-ridden fields defined in Meta.ordering. But be careful, if you simply add those fields, you can run afoul of the matching fields requirement between order_by and distinct. The field(s) defined in Meta.ordering can include a foreign key, which will resolve differently in order_by (to the related model's Meta.ordering field(s)) and distinct (the
_id
) and the fields will no longer match between the 2 expressions.
But this gotcha can be entirely avoided by only ever using real database fields in every
Meta.ordering
,order_by
, anddistinct
list instead of adding Django's related model objects (e.g.ForeignKey
,ManyToManyField
,OneToManyField
, etc.) so as not to rely on Django's differing related-model-object to database-field resolving mechanisms (e.g. instead of addingblog
toEntry.Meta.ordering
, addblog__name
).
...but like I said, all the info is there to either put it together and anticipate it (albeit requiring a fairly deep understanding and more experience to achieve those insights), or to interpret the exception. I just feel like piecing together this (what I would consider common) use-case, may make it clearer to those who are endeavoring into their first big Django project and applying complex understandings of and experiences with other programming languages.
Whatever edit gets included, I would like to say that I very much appreciate your attention to this topic.
follow-up: 11 comment:10 by , 3 years ago
OK, I just learned that you cannot add something like blog__name
to Entry's Meta.ordering
. Still learning... So...
If all you want is to make a set of query results distinct without changing the ordering, note you must explicitly "re-"add the otherwise over-ridden fields defined in Meta.ordering. But be careful, if you simply add those fields, you can run afoul of the matching fields requirement between order_by and distinct. The field(s) defined in Meta.ordering can include a foreign key (via
ForeignKey
,ManyToManyField
,OneToManyField
, etc. fields), which will resolve differently in order_by (to the related model's Meta.ordering field(s)) and distinct (the _id) and the fields will no longer match between the 2 expressions.
Fields of related models cannot be added to a Model's
Meta.ordering
(e.g. you cannot addblog__name
toEntry.Meta.ordering
), so to avoid the order_by versus distinct matching field gotcha and retain the default ordering in this instance, there are no shortcuts to only apply distinct to an existing query without explicitly re-applying the default ordering and resolving related model objects to database fields.
comment:12 by , 3 years ago
I think Mariusz' suggestion is more appropriate. I don't think the more discursive "gotcha"-type content fits in the reference documentation, rather than a more tutorial like piece, where it would be better. (Perhaps a blog post if there's no spot we yet discuss this in that kind of context.)
I think the solution here comes back to Simon's comment:1…
...error out loudly so at least users are not presented this cryptic error.
If we could catch this and present an error showing the field mismatch folks would be able to correct at source. (That's totally Blue Skies — no idea how easy building a better error message is here.)
(I think like Mariusz, I'd lean heavily towards the raise option, rather than trying to implicitly do/guess the right thing™ here.)
HTH
comment:13 by , 2 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
The
QuerySet.distinct
andorder_by
methods don't resolve references to foreign key the same way when referenced models define aMeta.ordering
.You can get a similar crash with way less code by doing
As the above actually translates to
Due to
TestCompound.Meta.ordering = ('name',)
I would argue that this is invalid but I'm curious to hear what others have to say as the resulting error is definitely cryptic there's possibly ways we could do things better.
Given
DISTINCT ON
usage requires a matchingORDER BY
I see three optionsMeta.ordering
before using it. We've got a precedent against that by making aggregations ignoreMeta.ordering
in the recent versions.distinct('related_with_ordering')
behave likeorder_by
with regards to ordering expansion to make both APIs coherent given they a dependent (resulting clause could match). This will require a deprecation period and spreads the arguable bad related ordering expansion pattern to another method.distinct('related_with_ordering')
error out loudly so at least users are not presented this cryptic error. This maintains the arguably confusing mismatch between both APIs but we stop the spread ofordering
expansion.