Opened 6 years ago

Closed 9 months ago

#29725 closed Cleanup/optimization (fixed)

Inefficient SQL generated when counting a ManyToMany

Reported by: Gavin Wahl Owned by: ontowhee
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: Simon Charette Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When calling count() on an unfiltered many to many relation, a useless join is included in the SQL that makes it much slower than it should be. On my dataset, the difference is 1000ms to 100ms, because an index-only scan can be used.

This is the SQL that is currently generated:

SELECT COUNT(*) AS "__count"
FROM "app_foo"
INNER JOIN "app_foo_bar" ON ("app_foo"."id" = "app_foo_bar"."foo_id")
WHERE "app_foo_bar"."foo_id" = ?;

This is the SQL that should be generated:

SELECT COUNT(*) AS "__count"
FROM "app_foo_bar"
WHERE "app_foo_bar"."foo_id" = ?;

This optimization can only be applied when there are no filters applied, because then the join is used to satisfy the filters. In the no-filters case, only the through table needs to be consulted.

Change History (29)

comment:1 by oliver, 6 years ago

Owner: changed from nobody to oliver
Status: newassigned

comment:2 by Simon Charette, 6 years ago

Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization
Version: 2.1master

Hello Olivier,

I think you should be able to achieve this by defining a count() method on the dynamic class created by create_forward_many_to_many_manager by filtering self.through._default_manager based on self.instance and return its count().

comment:3 by Simon Charette, 6 years ago

Cc: Simon Charette added

in reply to:  2 comment:4 by oliver, 6 years ago

Thanks for your advice.
I will try to correct it according to your advice

Replying to Simon Charette:

Hello Olivier,

I think you should be able to achieve this by defining a count() method on the dynamic class created by create_forward_many_to_many_manager by filtering self.through._default_manager based on self.instance and return its count().

Last edited 6 years ago by oliver (previous) (diff)

comment:6 by oliver, 6 years ago

Has patch: set
Needs tests: set

comment:7 by Tom Forbes, 6 years ago

Have we considered making this change for exists() as well? I'm sure this will generate the join in the same way that count() does.

Also somewhat related: #28477

Last edited 6 years ago by Tom Forbes (previous) (diff)

comment:8 by Simon Charette, 6 years ago

Doing it for exists() as well makes sense.

comment:9 by Simon Charette, 6 years ago

Needs tests: unset
Patch needs improvement: set

comment:10 by oliver, 6 years ago

Patch needs improvement: unset

comment:11 by Tim Graham <timograham@…>, 6 years ago

Resolution: fixed
Status: assignedclosed

In 1299421:

Fixed #29725 -- Removed unnecessary join in QuerySet.count() and exists() on a many-to-many relation.

comment:12 by Mariusz Felisiak <felisiak.mariusz@…>, 6 years ago

In 5f7991c4:

Fixed #30325 -- Reverted "Fixed #29725 -- Removed unnecessary join in QuerySet.count() and exists() on a many-to-many relation."

This reverts commit 1299421cadc4fcf63585f2f88337078e43e660e0 due to
a regression with custom managers.

comment:13 by Mariusz Felisiak <felisiak.mariusz@…>, 6 years ago

In e8de1cc9:

[2.2.x] Fixed #30325 -- Reverted "Fixed #29725 -- Removed unnecessary join in QuerySet.count() and exists() on a many-to-many relation."

This reverts commit 1299421cadc4fcf63585f2f88337078e43e660e0 due to
a regression with custom managers.

Backport of 5f7991c42cff73b6278106d499d719b726f85ead from master

comment:14 by Mariusz Felisiak, 6 years ago

Resolution: fixed
Status: closednew

Solution has been reverted because it caused the inconsistent behavior of count() and exists() on a reverse many-to-many relationship with a custom manager (see new tests in 9ac8520fcde29840a1345be19d80dbda53aa6d03).

comment:15 by Mariusz Felisiak, 6 years ago

Has patch: unset

comment:16 by Simon Charette, 6 years ago

Has patch: set
Patch needs improvement: set

Switched back to patch needs improvement instead as it can probably be adapted to skip the optimization when a custom manager is defined.

comment:17 by Mariusz Felisiak, 6 years ago

Ticket #30325 revealed one more issue, i.e. optimization doesn't work when chaining all() after reverse M2M relations.

comment:18 by Mike Hansen, 6 years ago

Additionally, the optimization causes a query to be done if the related objects have already been prefetched.

comment:19 by Shiwei Chen, 18 months ago

Owner: changed from oliver to Shiwei Chen
Status: newassigned

Thanks everyone for your insights. Im going to try to implement this optimization, accounting for all the issues listed here so far.

PR

Last edited 18 months ago by Mariusz Felisiak (previous) (diff)

comment:20 by Simon Charette, 18 months ago

Needs tests: set

Patch needs a few tweaks to properly disable the optimization when dealing with custom manager. It also lacks coverage for the prefetched case as reported in comment:18.

Last edited 18 months ago by Simon Charette (previous) (diff)

comment:21 by Shiwei Chen, 17 months ago

Needs tests: unset
Patch needs improvement: unset

Updated the patch to properly deal with cases involving custom managers and prefetch queries, and implemented test cases for them.

https://github.com/django/django/pull/16863

comment:22 by Mariusz Felisiak, 17 months ago

Needs tests: set
Patch needs improvement: set

comment:23 by ontowhee, 10 months ago

Owner: changed from Shiwei Chen to ontowhee

comment:24 by ontowhee, 10 months ago

I'm trying to pick up on the work from PR #16863, and I am beginning to learn about M2M relationships, RelatedManagers, and QuerySets. From reading the discussions on this ticket and looking through the comments on the PR, this needs to handle chaining .all() for the reverse M2M relation. Is that the self.article.publications?

Questions:

  • What other chaining needs to be considered?
  • How many levels and combinations of chaining need to be considered? For example, I imagine it is desirable to ensure the optimization also occurs for .all().all().all().count() too. Another example would be .all().filter().count(), where the filter is technically an empty filter, but it would return a QuerySet.
  • Where would be the best level to implement the optimization? The PR handles it in ManyRelatedManager. Another possibility could be to implement this on compiler.get_from_clause(). I found this idea from reading the comment on another ticket https://code.djangoproject.com/ticket/20535#comment:3.
  • If it is handled in the RelatedManager level, is there a way for the RelatedManager to look ahead to determine that the next function in the chain will be a .count() or .exists()?

I'm not sure if I'm asking the right questions. If anyone has any pointers for what areas I should look into, please let me know.

Last edited 10 months ago by ontowhee (previous) (diff)

comment:25 by Simon Charette, 10 months ago

Thank you for working on this ticket ontowhee.

From my understanding the closed PR is 95% of the way there as it builds on top of the initial work that had to be reverted and adds extra coverage. I'd start by creating a new PR out of it and deal with the unaddressed left by Mariusz to see how the test suite reacts.

How many levels and combinations of chaining need to be considered? For example, I imagine it is desirable to ensure the optimization also occurs for .all().all().all().count() too. Another example would be .all().filter().count(), where the filter is technically an empty filter, but it would return a QuerySet.

I'd focus on the many-to-many manager case discussed here only, that's already hard enough to get right.

Where would be the best level to implement the optimization? The PR handles it in ManyRelatedManager. Another possibility could be to implement this on compiler.get_from_clause(). I found this idea from reading the comment on another ticket https://code.djangoproject.com/ticket/20535#comment:3.

Exactly where it's defined in the PR you referenced. Skipping of unnecessary intermediate joins as described in #20535 is a much larger problem again that might help here in the long run but I think there are benefits in solving this well scoped issue first.

If it is handled in the RelatedManager level, is there a way for the RelatedManager to look ahead to determine that the next function in the chain will be a .count() or .exists()?

No need to as at that point it will be a QuerySet. Lets keep this ticket focused on the common many-to-many case where .count and .exists are directly called.

comment:26 by ontowhee, 10 months ago

I'm not sure if this is a viable solution. It passing the ManyRelatedManager to the queryset as an instance variable _many_related_queryset. (Note: It is failing tests in foreign_object/, so I will need to look into it.)

Draft PR

No need to as at that point it will be a QuerySet

I briefly looked into QuerySet._query to see if a new optimized sql query could be generated when the aliasMap lists certain table types. I ended up passing the ManyRelatedManager to get something working. I'm open to exploring a different approach.

Last edited 10 months ago by ontowhee (previous) (diff)

comment:27 by ontowhee, 9 months ago

I have removed the changes involving chaining all(). PR

comment:28 by Mariusz Felisiak, 9 months ago

Needs tests: unset
Patch needs improvement: unset
Triage Stage: AcceptedReady for checkin

comment:29 by Mariusz Felisiak <felisiak.mariusz@…>, 9 months ago

Resolution: fixed
Status: assignedclosed

In 66e47ac6:

Fixed #29725 -- Removed unnecessary join in QuerySet.count() and exists() on a many to many relation.

Co-Authored-By: Shiwei Chen <april.chen.0615@…>

Note: See TracTickets for help on using tickets.
Back to Top