Opened 9 years ago

Last modified 2 years ago

#25789 new Cleanup/optimization

Inefficient Queries Generated due to not using WHERE EXISTS

Reported by: Alex Rothberg Owned by: nobody
Component: Database layer (models, ORM) Version: 1.8
Severity: Normal Keywords: QuerySet.extra
Cc: Anssi Kääriäinen, Marc Tamlyn, James Howe, David Wobrock, John Speno Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Reposting question from SO with some more details.

I believe that the Django ORM is generating seriously inefficient SQL due to it not using WHERE EXISTS but instead using a DISTINCT with a LEFT JOIN. by comparison, SQLAlchemy will use WHERE EXISTS.

I have two models, Exam and Series. Series objects have a foreign key to an Exam object. Both of the models contain a field description_user. I am trying to search for all Exams that have a search term in description_user or have a child Series with that term in its description_user. I want to do this for a number of search terms (requiring all of them). I also want to de-duplicate the results (ie not get the same Exam multiple times).

This is roughly what the filter looks like:

a = (Q(**{'series__description_user__icontains': 'bar'}) | Q(**{'description_user__icontains': 'bar'})) 
b = (Q(**{'series__description_user__icontains': 'foo'}) | Q(**{'description_user__icontains': 'foo'}))
c = (Q(**{'series__description_user__icontains': 'baz'}) | Q(**{'description_user__icontains': 'baz'}))
Exam.objects.filter(a & b & c).distinct()

with corresponding SQL:

SELECT DISTINCT 
                "exam_storage_exam"."id", 
                "exam_storage_exam"."description_user"
FROM            "exam_storage_exam" 
LEFT OUTER JOIN "exam_storage_series" 
ON              ( 
                                "exam_storage_exam"."id" = "exam_storage_series"."exam_id" 
                AND             ( 
                                                "exam_storage_series"."removed" IS NULL) ) 
WHERE           ( 
                                "exam_storage_exam"."removed" IS NULL 
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
                AND             ( 
                                                "exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
                                OR              "exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\'))

The issue is that as the number of search terms grows, the size of the intermediate data set before the DISTINCT operation grows as well.

Ideally the SQL would look like:

SELECT *                                                
 FROM exam                                             
   WHERE (EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%foo%'
          )) or exam.description_user LIKE '%foo%') AND 
(EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%bar%'
          )) or exam.description_user LIKE '%bar%') AND 
(EXISTS (SELECT 1             
        FROM exam_storage_series              
          WHERE exam.id = series.exam_id AND (
            series.description_user LIKE '%baz%'
          )) or exam.description_user LIKE '%baz%')

Currently the performance of Django query is terrible. This style searching comes up for example in how DRF generates search queries.

Change History (13)

comment:1 by Tim Graham, 9 years ago

Cc: Anssi Kääriäinen added
Type: UncategorizedCleanup/optimization

Anssi, is this feasible? The closest existing ticket I could find is #16603 but not sure if it should be considered a duplicate.

comment:2 by Anssi Kääriäinen, 9 years ago

Yes, this is possible, and something Django's ORM should do.

Unfortunately this is hard to implement correctly. The big problem is aggregation, when doing .filter().annotate(), where both operations target the same relation, the aggregation must use results of the join generated by the filter. But if the filter doesn't generate a join, then we have a problem.

It would be Djangoic if the exists query would be generated automatically. For the above mentioned reason this is hard. Maybe it would be easier if we had something like .filter(models.Exists(somerel__col=val)), where the Exists() class would inform Django that an exists subquery should be generated.

Last edited 9 years ago by Tim Graham (previous) (diff)

comment:3 by Tim Graham, 9 years ago

Triage Stage: UnreviewedAccepted

comment:4 by Marc Tamlyn, 9 years ago

Cc: Marc Tamlyn added

comment:5 by Tim Graham, 9 years ago

Keywords: QuerySet.extra added

comment:6 by Matthew Schinckel, 9 years ago

It's not totally relevant (at least not yet), but I have a version of this that works with .annotate():

https://github.com/django/django/pull/6478

There's also a django-developers thread that hopefully may get some discussion.

Version 0, edited 9 years ago by Matthew Schinckel (next)

comment:7 by James Howe, 6 years ago

Similarly, I have a use-case where this query structure has the only acceptable performance (in PostgreSQL):

... FROM model_1
WHERE NOT EXISTS (
  SELECT TRUE FROM model_2
  WHERE model_1.pkey = model_2.fkey
    AND (model_2.property IS NULL OR model_2.property >= 42)
)

Currently using extra() to do it.

comment:8 by James Howe, 6 years ago

Cc: James Howe added

comment:9 by Simon Charette, 6 years ago

James, since this ticket was filled support for Exists expression was added so you should be able to avoid using extra

queryset.annotate(
    foo_exists=Exists(subqueryset.filter(foo_id=OuterRef('pk'))
).filter(
    foo_exists=False,
)

Once #25367 lands it'll be possible to pass ~Exists directly to filter.

comment:10 by Simon Charette, 2 years ago

Another nice to have that could be added here would be allow the usage of transforms on related fields so we could have lookups of the form __exists or __count

Author.objects.filter(books__exists=Q(title__icontains="ring"))
Author.objects.filter(books__count__gt=4)

instead of

Author.objects.filter(Exist(Book.objects.filter(author=OuterRef("pk"), title__icontains="ring")))
Author.objects.filter(
     GreaterThan(
         Book.objects.filter(
             author_id=OutRef("pk"),
         ).values(
             "author_id"
         ).values(Count("id")), 4
     )
)

I think it would go a long way in making it easier for users to work around this issue and #10060.

comment:11 by David Wobrock, 2 years ago

Cc: David Wobrock added

comment:12 by Simon Charette, 2 years ago

FWIW I had a very limited try at this in this branch. Mostly some test and debugging shims but I figured I'd provide them if someone is interesting in looking this issue.

I'd add that this work has a bit of overlap with #28296 as it basically needs to turn a series of transforms and lookups against a related field into a subquery without causing JOINs on the outer query which the default behaviour of using the __ syntax.

comment:13 by John Speno, 2 years ago

Cc: John Speno added
Note: See TracTickets for help on using tickets.
Back to Top