Opened 28 hours ago
Last modified 27 hours ago
#36157 closed Bug
Unusual behaviour when pre-fetching with only applied on the related fields — at Version 1
Reported by: | Tim McCurrach | Owned by: | |
---|---|---|---|
Component: | Uncategorized | Version: | 5.1 |
Severity: | Normal | Keywords: | |
Cc: | Tim McCurrach | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
When prefetching related models. If you apply .only()
to the related queryset, django performs additional lookups for related IDs that have been left out of the only
.
It is probably easiest to explain the issue with an example.
Example Situation
Suppose you have these models:
class Blog(models.Model): name = models.CharField(max_length=100) class Post(models.Model): name = models.CharField(max_length=100) blog = models.ForeignKey(Blog, on_delete=models.CASCADE, related_name="posts") ...lots of other big fields
And you create a few items for each:
blog = Blog.objects.create(name="Django Tricks") blog2 = Blog.objects.create(name="React Tricks") Post.objects.create(name="prefetching", blog=blog) Post.objects.create(name="models", blog=blog) Post.objects.create(name="templates", blog=blog) Post.objects.create(name="hooks", blog=blog2) Post.objects.create(name="components", blog=blog2)
If I wish to pre-fetch the posts for some blogs, but only want the names of each post, rather than the content of each post I can do the following:
Blog.objects.prefetch_related(Prefetch("posts", queryset=Post.objects.only("name")))
I would expect this to result in just 2 database queries. One to fetch the data for the Blog
instances, and another to fetch the the data for the related Post
s. Instead, there is an n+1 issue where there are 5 extra follow up requests for each of the related Post
instances. This is the SQL that is generated:
SELECT "app_blog"."id", "app_blog"."name" FROM "app_blog" LIMIT 21 SELECT "app_post"."id", "app_post"."name" FROM "app_post" WHERE "app_post"."blog_id" IN (1, 2) SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 1 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 2 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 3 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 4 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 5 LIMIT 21
I can understand it might be a good idea to have the related-id's for the blog on hand should you need them later. But I also think, that by using .only()
you are explicitly telling django - I don't need these. This is a real problem for larger data-sets, where you end up with thousands of extra round-trips to the database.
Context
This is an issue that came up in the wild. I'm using a third-party optimiser that improves the performance of graphQL queries by decorating querysets with only()
, select_related()
etc. It correctly identifies that I am only using certain fields and applies only()
to them, knowing I will never need to access certain related fields. This results in django producing many additional hits to the database.