Opened 13 years ago

Closed 9 years ago

Last modified 4 years ago

#17276 closed Cleanup/optimization (fixed)

Slow anti-join query against Postgres

Reported by: dmitry@… Owned by: nobody
Component: Database layer (models, ORM) Version: 1.3
Severity: Normal Keywords: orm, postgres, join, anti-join
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Ramiro Morales)

For background on this ticket please see the following two discussions:

Basically, in a many-to-many mappings between models Student and Course, if I want to find all instances of Students that aren't registered for classes, I would issue the following Django query:

Student.objects.filter(course__isnull=True)

Django translates this into the following query:

SELECT "student"."id", "student"."name" 
   FROM "student" 
      LEFT OUTER JOIN "course_students" 
         ON ("student"."id" = "course_students"."student_id") 
      LEFT OUTER JOIN "course" 
         ON ("course_students"."course_id" = "course"."id") 
   WHERE "course"."id" IS NULL

The problem is that the way the WHERE clause is generated is very inefficient (at least when used with Postgres). Changing WHERE to "course_students"."student_id" IS NOT NULL yields orders of magnitude improved query plan. Here's the difference I'm seeing on real data:

  • Django way: (cost=1072479.36..6256437.79 rows=1 width=145)
  • Hand-crafted: (cost=1518.71..1533.35 rows=1 width=145)

I'm attaching a sample project with the model already set up. To see the generated SQL query, simply run "python manage.py anti-join."

Attachments (2)

AntiJoin.rar (12.8 KB ) - added by dmitry@… 13 years ago.
sample project
17276_tests.diff (683 bytes ) - added by Anssi Kääriäinen 13 years ago.

Download all attachments as: .zip

Change History (8)

by dmitry@…, 13 years ago

Attachment: AntiJoin.rar added

sample project

comment:1 by Luke Plant, 13 years ago

Triage Stage: UnreviewedAccepted

Accepted on the basis of Russell's comments in the thread linked. When it comes to implementation and the need to do different things for different databases, we may have to WONTFIX this, but we can see.

comment:2 by Ramiro Morales, 13 years ago

Description: modified (diff)

comment:3 by Anssi Kääriäinen, 13 years ago

The query is asking for items with no m2m assigned items. Thus a join against only the m2m intermediate table should suffice.

The last join in that chain is not necessary at all. Foreign keys ensure that if there is a row in course_students table, then there will be a matching row in course table. This means that if the first join yields any rows, so must the second join, too. Naturally, if there is no rows in the first join, then there will be no rows in the second join, either. The last join is not needed, as the null/not null status of the first join tells us already if the filter condition is correct or not.

There is a slight complication because of deferred foreign keys used in Django - it is possible to insert a row into course_students table without there being a matching row in course table. I don't know if that is something Django should guard against. If the user ends up in situation where he has a row in the m2m table, but not in the target table, he has done something wrong - m2m targets should always be saved into the DB first before adding them to m2m collections.

This complication of deferred foreign keys also make this a hard optimization target for databases. Django has an advantage in that even if we are using deferred foreign keys, we know that in normal usage the defer should not have any effect.

Sidenote: why exactly are we using deferred foreign keys, not foreign keys with DEFERRABLE INITIALLY IMMEDIATE and set them to deferred only when needed in fixture loading? Or am I missing some other valid use case for deferred foreign keys in Django? If so, that would naturally also make the above join-removal optimization invalid.

Trivial test case included.

by Anssi Kääriäinen, 13 years ago

Attachment: 17276_tests.diff added

comment:4 by dberansky, 13 years ago

I realized over the weekend that the same issue exists for one-to-many relationships. For example, consider these models:

class Parent(models.Model):
    name = models.CharField(max_length=100)

class Child(models.Model):
    name = models.CharField(max_length=200)
    parents = models.ForeignKey(to=Parent, db_column="parent_id")

and the following Django query:

Parent.objects.filter(child__isnull=True)

for which Django will generate this SQL:

SELECT "demo_parent"."id", "demo_parent"."name" 
   FROM "demo_parent" 
      LEFT OUTER JOIN "demo_child" 
         ON ("demo_parent"."id" = "demo_child"."parent_id") 
   WHERE "demo_child"."id" IS NULL

So the same issue as with many-to-many where this query is orders of magnitude slower than a slightly modified one:

SELECT "demo_parent"."id", "demo_parent"."name" 
   FROM "demo_parent" 
      LEFT OUTER JOIN "demo_child" 
         ON ("demo_parent"."id" = "demo_child"."parent_id") 
   WHERE "demo_child"."parent_id" IS NULL

('WHERE "demo_child"."parent_id" IS NULL' vs. 'WHERE "demo_child"."id" IS NULL')

comment:5 by Tim Graham, 9 years ago

Resolution: fixed
Status: newclosed

in reply to:  4 comment:6 by Thomas Riccardi, 4 years ago

Replying to dberansky:

I realized over the weekend that the same issue exists for one-to-many relationships. For example, consider these models:

class Parent(models.Model):
    name = models.CharField(max_length=100)

class Child(models.Model):
    name = models.CharField(max_length=200)
    parents = models.ForeignKey(to=Parent, db_column="parent_id")

and the following Django query:

Parent.objects.filter(child__isnull=True)

for which Django will generate this SQL:

SELECT "demo_parent"."id", "demo_parent"."name" 
   FROM "demo_parent" 
      LEFT OUTER JOIN "demo_child" 
         ON ("demo_parent"."id" = "demo_child"."parent_id") 
   WHERE "demo_child"."id" IS NULL

So the same issue as with many-to-many where this query is orders of magnitude slower than a slightly modified one:

SELECT "demo_parent"."id", "demo_parent"."name" 
   FROM "demo_parent" 
      LEFT OUTER JOIN "demo_child" 
         ON ("demo_parent"."id" = "demo_child"."parent_id") 
   WHERE "demo_child"."parent_id" IS NULL

('WHERE "demo_child"."parent_id" IS NULL' vs. 'WHERE "demo_child"."id" IS NULL')

This One2Many inefficient query is still there at least with django 2.2.
If anyone ends up here like me: the workaround to get the efficient query is:

Parent.objects.filter(child__parent__isnull=True)
Note: See TracTickets for help on using tickets.
Back to Top