Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#27260 closed Cleanup/optimization (invalid)

Performance Issue because of LEFT OUTER JOIN instead the better INNER JOIN

Reported by: Sven R. Kunze Owned by: nobody
Component: Database layer (models, ORM) Version: 1.8
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Original Query:

BigTable.objects.filter(
    Q(tablea__item_id__in=my_items) | Q(tableb__item_id__in=my_items)
)

resulting in:

SELECT *
FROM "big_table"
  LEFT OUTER JOIN "tablea" ON ("big_table"."id" = "tablea"."big_table_id")
  LEFT OUTER JOIN "tableb" ON ("big_table"."id" = "tableb"."big_table_id")
WHERE ("tablea"."item_id" IN (<handful of items>) OR
       "tableb"."item_id" IN (<handful of items>))

The LEFT OUTER JOIN combined with a growing number of rows in big_table results in very slow execution of the query.
We temporarily rewrote the query by using two separate queries and merging the sets via Python:

BigTable.objects.filter(
    Q(tablea__item_id__in=my_items)
)
SELECT *
FROM "big_table"
  INNER JOIN "tablea" ON ("big_table"."id" = "tablea"."big_table_id")
WHERE "tablea"."item_id" IN (<handful of items>)

and

BigTable.objects.filter(
    Q(tableb__item_id__in=my_items)
)
SELECT *
FROM "big_table"
  INNER JOIN "tableb" ON ("big_table"."id" = "tableb"."big_table_id")
WHERE "tableb"."item_id" IN (<handful of items>)

Ideally, we would like to use the original query because it is clearer and less code to maintain. If you want to see the models, I can post them as well. Ideally again, we don't actually expect models to influence the type of the joins.

We want BigTable to be filtered, so, if there's no appropriate match on the TableA or TableB, the BigTable item simply does not appear in the result set. However, LEFT OUTER JOIN would result in such a row which then needs to be removed afterwards laboriously (at least PostgreSQL struggles to do so). If the original query had used INNER JOIN, the query would have been fast.

Change History (10)

comment:1 by Tim Graham, 8 years ago

Please also provide your models.

I'm not certain whether or not your proposed changes are correct. If you can provide a patch so we can check whether or not it breaks existing behavior, it'll be more easily to evaluate the idea.

Looking for possibly related tickets, I found #8439.

in reply to:  1 comment:2 by Sven R. Kunze, 8 years ago

Replying to Tim Graham:

Please also provide your models.

class BigTable:
    # a lot of unrelated fields
class TableA:
    big_table = models.OneToOneField(BigTable, related_name='...', primary_key=True)
    item_id = models.CharField(max_length=1024)
class TableB:
    big_table = models.ForeignKey(BigTable, related_name='...')
    item_id = models.CharField(max_length=1024)

I'm not certain whether or not your proposed changes are correct. If you can provide a patch so we can check whether or not it breaks existing behavior, it'll be more easily to evaluate the idea.

Neither are we. Maybe, a UNION could also work. The real issue we have is the slow execution of the original query.

comment:3 by Tim Graham, 8 years ago

FYI, if you're not sure it's a bug, you should use our support channels before opening a ticket.

comment:4 by Tim Graham, 8 years ago

Resolution: invalid
Status: newclosed

I think Ansii's write up on join promotion explains why LEFT OUTER JOIN is required here.

comment:5 by Sven R. Kunze, 8 years ago

Thanks, Tim. You are right, that OUTER LEFT JOIN is required here.

However, I might state again that our real problem is the performance issue. I simultaneously posted a question on the PostgreSQL mailing list:

https://www.postgresql.org/message-id/flat/97dbd6f9-a289-94d1-325f-997df87c672a%40mail.de#97dbd6f9-a289-94d1-325f-997df87c672a@mail.de

Version 0, edited 8 years ago by Sven R. Kunze (next)

comment:6 by Tim Graham, 8 years ago

I haven't worked with the ORM enough to know if changing the query operator is feasible.

comment:7 by Sven R. Kunze, 8 years ago

@Tim

On the one hand, I somehow doubt that it's Django responsibility to do it (cf. https://www.postgresql.org/message-id/8ccdd027-f4a0-2e0d-1e26-2a403fed8d17%40mail.de ).

On the other hand, there's already an asymmetry between a single "filter" and an "OR filter". The former uses "INNER JOIN", the latter uses "OUTER JOIN".

comment:8 by Thomas Güttler, 8 years ago

Just for the records. We have performance issues on postgres because some SQL statements have up to six OUTER joins.

Union and intersection would be easier to read for me (human being) and maybe better to optimize for the postgres query planer.

comment:9 by Tim Graham, 8 years ago

Support for QuerySet.union(), intersection(), difference() was added in #27718.

comment:10 by Thomas Güttler, 8 years ago

Just for the records, we use this pattern until we update to Django 1.11:

Use case: Select users which have permission "some_app.permission_name" and which are in group "special_group"

user_ids_with_group = User.objects.filter(groups__in=[special_group]).distinct().values_list('pk', flat=True)
user_ids_with_permission = User.objects.filter(authutils.users_with_perm_q('some_app.permission_name')).distinct().values_list('pk', flat=True)
qs = User.objects.filter(pk__in=set(user_ids_with_group).intersection(user_ids_with_permission)).distinct().order_by('last_name', 'first_name')


# authutils
def users_with_perm_q(perm, is_active=True, is_superuser=True):
    try:
        app_label, codename = perm.split('.')
    except ValueError:
        raise ValueError('Permission name should be in the form "app_label.perm_name".')

    permission = Permission.objects.get(codename=codename, content_type__app_label=app_label)
    user_q = Q(user_permissions=permission)
    group_q = Q(groups__permissions=permission)
    return Q(is_active=is_active) & (Q(is_superuser=is_superuser) | user_q | group_q)
Note: See TracTickets for help on using tickets.
Back to Top