Opened 3 years ago

Last modified 2 years ago

#33682 assigned Cleanup/optimization

Clarify using distinct() with related fields that have Meta.ordering defined.

Reported by: Robert Leach Owned by: anegawa-j
Component: Documentation Version: 3.2
Severity: Normal Keywords: sql, distinct
Cc: Carlton Gibson Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I have a rather complex database and an advanced search interface that creates complex queries. It’s been working great for over a year now.

I recently added a feature to count distinct related table records in the joined results. When I added fields to these many-related tables to .distinct() (and to .order_by()), I couldn’t get the test to execute without hitting an InvalidColumnReference error. And though I’m supplying the same expanded fields list to .order_by() that I am to .distinct(), the error claims that SELECT DISTINCT ON expressions must match initial ORDER BY expressions

When I print the SQL, the place where it notes a difference has a weird T8 reference where the model name should be in an order by clause. The corresponding distinct clause has the full table name instead of the reference, which is what I suspect is triggering the exception.

I was able to create a set of toy models and a test that minimally reproduces the exception...

toymodels.py:

from django.db.models import Model, CharField, AutoField, ForeignKey, ManyToManyField, CASCADE

class TestPeak(Model):
    id = AutoField(primary_key=True)
    name = CharField(max_length=10)
    compounds = ManyToManyField(
        to="TestCompound",
        related_name="testpeaks",
    )
    class Meta:
        verbose_name = "testpeak"
        verbose_name_plural = "testpeaks"
        ordering = ["name"]

class TestCompound(Model):
    id = AutoField(primary_key=True)
    name = CharField(max_length=10)
    class Meta:
        verbose_name = "testcompound"
        verbose_name_plural = "testcompounds"
        ordering = ["name"]

class TestSynonym(Model):
    name = CharField(max_length=10, primary_key=True)
    compound = ForeignKey(
        TestCompound, related_name="testsynonyms", on_delete=CASCADE
    )
    class Meta:
        verbose_name = "testsynonym"
        verbose_name_plural = "testsynonyms"
        ordering = ["compound", "name"]

test_bug.py:

from DataRepo.tests.tracebase_test_case import TracebaseTestCase
from DataRepo.models.toymodels import TestPeak, TestCompound, TestSynonym
from django.db.models import Q

class DjangoSQLBug(TracebaseTestCase):
    maxDiff = None

    @classmethod
    def setUpTestData(cls):
        TestCompound.objects.create(name="testcpd")
        cpd = TestCompound.objects.get(id__exact=1)
        TestSynonym.objects.create(name="testsyn",compound=cpd)
        TestPeak.objects.create(name="testpk")
        pk = TestPeak.objects.get(id__exact=1)
        pk.compounds.add(cpd)

    def test_mm_om_query(self):
        q_exp = Q(name__iexact="testpk")
        distinct_fields = ['name', 'pk', 'compounds__testsynonyms__compound', 'compounds__testsynonyms__name', 'compounds__testsynonyms__pk', 'compounds__name', 'compounds__pk']
        qs = TestPeak.objects.filter(q_exp).order_by(*distinct_fields).distinct(*distinct_fields)
        self.assertEqual(qs.count(), 1)

python manage.py test output:

Creating test database for alias 'default'...
Creating test database for alias 'validation'...
System check identified no issues (0 silenced).
E
======================================================================
ERROR: test_mm_om_query (DataRepo.tests.sqlbugtest.test_bug.DjangoSQLBug)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.InvalidColumnReference: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_...
                                                             ^


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/tests/sqlbugtest/test_bug.py", line 21, in test_mm_om_query
    self.assertEqual(qs.count(), 1)
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 412, in count
    return self.query.get_count(using=self.db)
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/query.py", line 519, in get_count
    number = obj.get_aggregation(using, ['__count'])['__count']
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/query.py", line 504, in get_aggregation
    result = compiler.execute_sql(SINGLE)
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
    cursor.execute(sql, params)
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_...
                                                             ^


----------------------------------------------------------------------
Ran 1 test in 0.018s

FAILED (errors=1)
Destroying test database for alias 'default'...
Destroying test database for alias 'validation'...
gen-rl-macbookair[2022-05-05 12:34:44]:~/PROJECT-local/TRACEBASE/tracebase$ 

I posted on Django Users about this, if you would like more information:

https://forum.djangoproject.com/t/is-this-a-bug-in-djangos-sql-creation-through-multiple-many-to-many-tables/13508/9

We’re on Django 3.2 using Postgres.

Change History (13)

comment:1 by Simon Charette, 3 years ago

The QuerySet.distinct and order_by methods don't resolve references to foreign key the same way when referenced models define a Meta.ordering.

You can get a similar crash with way less code by doing

list(TestSynonym.objects.distinct('compound').order_by('compound'))

As the above actually translates to

list(TestSynonym.objects.distinct('compound').order_by('compound__name'))

Due to TestCompound.Meta.ordering = ('name',)


I would argue that this is invalid but I'm curious to hear what others have to say as the resulting error is definitely cryptic there's possibly ways we could do things better.

Given DISTINCT ON usage requires a matching ORDER BY I see three options

  1. Do nothing, users should learn the gotchas of Meta.ordering before using it. We've got a precedent against that by making aggregations ignore Meta.ordering in the recent versions.
  2. Make distinct('related_with_ordering') behave like order_by with regards to ordering expansion to make both APIs coherent given they a dependent (resulting clause could match). This will require a deprecation period and spreads the arguable bad related ordering expansion pattern to another method.
  3. Refuse the temptation to guess and make distinct('related_with_ordering') error out loudly so at least users are not presented this cryptic error. This maintains the arguably confusing mismatch between both APIs but we stop the spread of ordering expansion.

comment:2 by Robert Leach, 3 years ago

I figured I might have been missing something. I actually have code to avoid the gotcha, but apparently, I only did it for the model from which the query came from (i.e. the "root model"):

        # If there are any split_rows manytomany related tables, we will need to prepend the ordering (and pk) fields of
        # the root model
        if len(distinct_fields) > 0:
            distinct_fields.insert(0, "pk")
            tmp_distincts = self.getOrderByFields(model_name=self.rootmodel.__name__)
            tmp_distincts.reverse()
            for fld_nm in tmp_distincts:
                distinct_fields.insert(0, fld_nm)

            if order_by is not None and order_by not in distinct_fields:
                distinct_fields.insert(0, order_by)
    def getOrderByFields(self, mdl_inst_nm=None, model_name=None):
        """
        Retrieves a model's default order by fields, given a model instance name.
        """
        ... brevity edit ...

        # Get a model object
        mdl = apps.get_model("DataRepo", mdl_nm)

        if "ordering" in mdl._meta.__dict__:
            return mdl._meta.__dict__["ordering"]
        return []

I know this ticket system is not the place for getting support, but if you'll indulge me... would prepending all the meta ordering fields avoid the gotcha if I inserted the meta ordering field(s) before any other fields? (In my use case, the order is unimportant - only the distinct is). I'd found that it did in the case of the "root model".

comment:3 by Robert Leach, 3 years ago

I believe that the note at the bottom of the distinct section of this doc:

https://docs.djangoproject.com/en/4.0/ref/models/querysets/#distinct

answers my above question, but to dissect the verbiage it tells you *what* to do without the *why* fully described. It tells you that adding the _id field would be necessary to make the expressions match (which is "a" reason why to add the id field), but it doesn't explicitly explain why that makes them match, or say whether the id field is precisely required or if any field will do.

If I'd better understood the *why* in that doc, I might have coded the right solution to the gotcha and not overlooked the other cases.

My updated understanding is that it seems that the reason *a* related model field is necessary is because the related model "field" in the model definition that links to the related model isn't a "field". It's a reference that gets turned into a field that by default uses the meta.ordering. (I didn't even notice that the distinct clause had compound_id and the order by clause had name in that position.) So I'm guessing that *any*(?) related model field in front of a (non-field) related model reference (whether it's at the beginning of the distinct list or "just before" the non-field related model reference) would solve the issue? Or perhaps even explicit inclusion of such a (non) field would cause the problem.

I think these are areas in which the doc could be improved just a bit more. Understanding the /why/ better, I think, could be helpful to avoid these pitfals, and also help to understand an otherwise cryptic error message.

Version 1, edited 3 years ago by Robert Leach (previous) (next) (diff)

comment:4 by Mariusz Felisiak, 3 years ago

Robert, Can you propose a documentation improvement via GitHub's PR?

in reply to:  4 ; comment:5 by Robert Leach, 3 years ago

Replying to Mariusz Felisiak:

Robert, Can you propose a documentation improvement via GitHub's PR?

I can certainly give it a shot, though I'm not the best writer when it comes to brevity.

Also, I don't have a deep understanding of the related Django code, so my understanding could be empirically correct, but technically flawed (like Bohr's model of the atom). For example, when the same field reference is supplied to both .order_by() and .distinct(), such as in Simon's example:

TestSynonym.objects.distinct('compound').order_by('compound')

...why is the inserted field in each case not coordinated? Why does the conversion from the reference (compound) differ? Simon says it resolves to:

list(TestSynonym.objects.distinct('compound').order_by('compound__name'))

but based on my debug output of another test using that above call, that's imprecise. It shows:

QUERY: SELECT DISTINCT ON ("DataRepo_testsynonym"."compound_id") "DataRepo_testsynonym"."name", "DataRepo_testsynonym"."compound_id" FROM "DataRepo_testsynonym" INNER JOIN "DataRepo_testcompound" ON ("DataRepo_testsynonym"."compound_id" = "DataRepo_testcompound"."id") ORDER BY "DataRepo_testcompound"."name" ASC

which means that the distinct field resolution and order by field resolutions are:

  • distinct: compound_id
  • order_by: name

So it's essentially resolved as:

list(TestSynonym.objects.distinct('compound__id').order_by('compound__name'))

When those methods are assessed individually, I understand why those fields are the preferred solution (e.g. the meta ordering may not be unique), but given that distinct requires the same fields be present at the beginning of the order-by, I don't know what prevents the code to be written to have those fields be resolved in a way that is copacetic. Like, why not convert the reference into 2 additional fields that together, meet both requirements (name AND compound_id)? Order-by would be satisfied and distinct would be satisfied. Or... in my case, name is unique, so distinct could resolve to the meta ordering without issue...

Is there a technical reason the code doesn't already do this?

Last edited 3 years ago by Robert Leach (previous) (diff)

in reply to:  5 ; comment:6 by Mariusz Felisiak, 3 years ago

Replying to Robert Leach:

When those methods are assessed individually, I understand why those fields are the preferred solution (e.g. the meta ordering may not be unique), but given that distinct requires the same fields be present at the beginning of the order-by, I don't know what prevents the code to be written to have those fields be resolved in a way that is copacetic. Like, why not convert the reference into 2 additional fields that together, meet both requirements (name AND compound_id)? Order-by would be satisfied and distinct would be satisfied. Or... in my case, name is unique, so distinct could resolve to the meta ordering without issue...

Is there a technical reason the code doesn't already do this?

This would be another logic that's implicit and probably unexpected by users (at least in some cases). As far as I'm aware it's preferable to fail loudly even with a ProgrammingError. I'm not sure how to improve this note, maybe it's enough to add a correct example:

  • docs/ref/models/querysets.txt

    diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt
    index a9da1dcf7e..891b8255b0 100644
    a b Examples (those after the first will only work on PostgreSQL)::  
    565565    ...wouldn't work because the query would be ordered by ``blog__name`` thus
    566566    mismatching the ``DISTINCT ON`` expression. You'd have to explicitly order
    567567    by the relation ``_id`` field (``blog_id`` in this case) or the referenced
    568     one (``blog__pk``) to make sure both expressions match.
     568    one (``blog__pk``) to make sure both expressions match::
     569
     570        Entry.objects.order_by('blog_id').distinct('blog_id')
    569571
    570572``values()``
    571573~~~~~~~~~~~~

in reply to:  6 comment:7 by Robert Leach, 3 years ago

Replying to Mariusz Felisiak:

This would be another logic that's implicit and probably unexpected by users (at least in some cases). As far as I'm aware it's preferable to fail loudly even with a ProgrammingError. I'm not sure how to improve this note, maybe it's enough to add a correct example:

  • docs/ref/models/querysets.txt

    diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt
    index a9da1dcf7e..891b8255b0 100644
    a b Examples (those after the first will only work on PostgreSQL)::  
    565565    ...wouldn't work because the query would be ordered by ``blog__name`` thus
    566566    mismatching the ``DISTINCT ON`` expression. You'd have to explicitly order
    567567    by the relation ``_id`` field (``blog_id`` in this case) or the referenced
    568     one (``blog__pk``) to make sure both expressions match.
     568    one (``blog__pk``) to make sure both expressions match::
     569
     570        Entry.objects.order_by('blog_id').distinct('blog_id')
    569571
    570572``values()``
    571573~~~~~~~~~~~~

I'm not sure that would have been enough for me to have anticipated or correctly interpreted the error I encountered. I just corrected our code base and I now feel I have a clearer picture of how I got here... Let me explain why. I'd thought my expressions did match:

        distinct_fields = ['name', 'pk', 'compounds__testsynonyms__compound', 'compounds__testsynonyms__name', 'compounds__testsynonyms__pk', 'compounds__name', 'compounds__pk']
        qs = TestPeak.objects.filter(q_exp).order_by(*distinct_fields).distinct(*distinct_fields)

I was like "but I'm supplying the exact same fields for both order_by and distinct".

What I think I'd missed was that "referenced field" in the phrase explicitly order by the relation _id or referenced field doesn't just apply to related model fields in the queried("/root") model (e.g. Entry's fields such as blog in Entry.objects.order_by('blog').distinct('blog'), but applies to all related models through all my joins - and (indirectly/potentially) their meta.ordering values... because I wasn't explicitly adding the compounds__testsynonyms__compound "field" above in my codebase where I ran into this error - I was implicitly adding them -> I had written code to straightforwardly grab and prepend the meta.ordering fields in order to satisfy distinct's requirements and still have things ordered nicely - and that's where I was running into this problem, because it didn't occur to me that the fields in meta.ordering could be changed differently by order_by and distinct. meta.ordering lets you add those "blog" fields and I wasn't the developer that had added them to our code base, but I didn't even think to check whether those fields I was adding would be subject to the distinct/order_by gotcha.

I mean - all the information is there in the doc for me to have avoided this. You just have to piece together doc info on meta.ordering, order_by, and distinct and extend the examples to realize it applies to joined models' fields - not just the root model. It was just a bit too complex for me at my level of django experience to anticipate correctly the implications of what I was doing.

Maybe the best thing to do would be to anticipate the motivations that lead me down this path... I wanted to use distinct on a join query, which meant that I needed to add fields to order_by that I didn't really care about - I just needed to add them to meet the requirement that the fields matched. BUT - I didn't want to change the desired ordering - so I needed to explicitly add the default orderings so that the IDs wouldn't override them. So maybe the doc should just point out that if you're adding fields to order_by solely to be able to use distinct, but you don't want to change the default ordering specified in meta.ordering - you need to consider the fact that the meta.ordering fields explicitly added can make you run into the order_by/distinct gotcha.

comment:8 by Mariusz Felisiak, 3 years ago

Component: Database layer (models, ORM)Documentation
Keywords: sql, distinct, → sql, distinct
Summary: SQL generation bug in `.distinct()` when supplied fields go through multiple many-related tablesClarify using distinct() with related fields that have Meta.ordering defined.
Triage Stage: UnreviewedAccepted
Type: BugCleanup/optimization

I tried to clarify this, maybe:

  • docs/ref/models/querysets.txt

    diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt
    index a9da1dcf7e..7c2c299b58 100644
    a b Examples (those after the first will only work on PostgreSQL)::  
    554554
    555555.. note::
    556556    Keep in mind that :meth:`order_by` uses any default related model ordering
    557     that has been defined. You might have to explicitly order by the relation
    558     ``_id`` or referenced field to make sure the ``DISTINCT ON`` expressions
    559     match those at the beginning of the ``ORDER BY`` clause. For example, if
    560     the ``Blog`` model defined an :attr:`~django.db.models.Options.ordering` by
    561     ``name``::
     557    that has been defined in
     558    :attr:`Meta.ordering <django.db.models.Options.ordering>`. As a
     559    consequence, ordering by a related field is resolved to the list of fields
     560    defined in a related ``Meta.ordering`` and may not be the same as a
     561    relation ID field used by ``distinct()``. To avoid using ``Meta.ordering``
     562    and make sure the ``DISTINCT ON`` expressions match those at the beginning
     563    of the ``ORDER BY`` clause, you can explicitly order by the relation
     564    ``_id`` or referenced field. For example, if the ``Blog`` model defined an
     565    :attr:`~django.db.models.Options.ordering` by ``name``::
    562566
    563567        Entry.objects.order_by('blog').distinct('blog')
    564568
    565569    ...wouldn't work because the query would be ordered by ``blog__name`` thus
    566570    mismatching the ``DISTINCT ON`` expression. You'd have to explicitly order
    567571    by the relation ``_id`` field (``blog_id`` in this case) or the referenced
    568     one (``blog__pk``) to make sure both expressions match.
     572    one (``blog__pk``) to make sure both expressions match::
     573
     574        Entry.objects.order_by('blog_id').distinct('blog_id')
    569575
    570576``values()``
    571577~~~~~~~~~~~~

What do you think?

comment:9 by Robert Leach, 3 years ago

That's definitely better, as it directly references Meta.ordering, which is a great clue to what the implications are. What do you think about adding something like this to what you wrote above:

If all you want is to make a set of query results distinct without changing the ordering, note you must explicitly "re-"add the otherwise over-ridden fields defined in Meta.ordering. But be careful, if you simply add those fields, you can run afoul of the matching fields requirement between order_by and distinct. The field(s) defined in Meta.ordering can include a foreign key, which will resolve differently in order_by (to the related model's Meta.ordering field(s)) and distinct (the _id) and the fields will no longer match between the 2 expressions.

But this gotcha can be entirely avoided by only ever using real database fields in every Meta.ordering, order_by, and distinct list instead of adding Django's related model objects (e.g. ForeignKey, ManyToManyField, OneToManyField, etc.) so as not to rely on Django's differing related-model-object to database-field resolving mechanisms (e.g. instead of adding blog to Entry.Meta.ordering, add blog__name).

...but like I said, all the info is there to either put it together and anticipate it (albeit requiring a fairly deep understanding and more experience to achieve those insights), or to interpret the exception. I just feel like piecing together this (what I would consider common) use-case, may make it clearer to those who are endeavoring into their first big Django project and applying complex understandings of and experiences with other programming languages.

Whatever edit gets included, I would like to say that I very much appreciate your attention to this topic.

Last edited 3 years ago by Robert Leach (previous) (diff)

comment:10 by Robert Leach, 3 years ago

OK, I just learned that you cannot add something like blog__name to Entry's Meta.ordering. Still learning... So...

If all you want is to make a set of query results distinct without changing the ordering, note you must explicitly "re-"add the otherwise over-ridden fields defined in Meta.ordering. But be careful, if you simply add those fields, you can run afoul of the matching fields requirement between order_by and distinct. The field(s) defined in Meta.ordering can include a foreign key (via ForeignKey, ManyToManyField, OneToManyField, etc. fields), which will resolve differently in order_by (to the related model's Meta.ordering field(s)) and distinct (the _id) and the fields will no longer match between the 2 expressions.

Fields of related models cannot be added to a Model's Meta.ordering (e.g. you cannot add blog__name to Entry.Meta.ordering), so to avoid the order_by versus distinct matching field gotcha and retain the default ordering in this instance, there are no shortcuts to only apply distinct to an existing query without explicitly re-applying the default ordering and resolving related model objects to database fields.

in reply to:  10 comment:11 by Mariusz Felisiak, 3 years ago

Cc: Carlton Gibson added

It seems we need a second option about the wording.

comment:12 by Carlton Gibson, 3 years ago

I think Mariusz' suggestion is more appropriate. I don't think the more discursive "gotcha"-type content fits in the reference documentation, rather than a more tutorial like piece, where it would be better. (Perhaps a blog post if there's no spot we yet discuss this in that kind of context.)

I think the solution here comes back to Simon's comment:1

...error out loudly so at least users are not presented this cryptic error.

If we could catch this and present an error showing the field mismatch folks would be able to correct at source. (That's totally Blue Skies — no idea how easy building a better error message is here.)

(I think like Mariusz, I'd lean heavily towards the raise option, rather than trying to implicitly do/guess the right thing™ here.)

HTH

comment:13 by anegawa-j, 2 years ago

Owner: changed from nobody to anegawa-j
Status: newassigned
Note: See TracTickets for help on using tickets.
Back to Top