#36157 closed Bug (duplicate)
Unusual behaviour when pre-fetching with only applied on the related fields
Reported by: | Tim McCurrach | Owned by: | |
---|---|---|---|
Component: | Uncategorized | Version: | 5.1 |
Severity: | Normal | Keywords: | |
Cc: | Tim McCurrach | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
When prefetching related models. If you apply .only()
to the related queryset, django performs additional lookups for related IDs that have been left out of the only
.
It is probably easiest to explain the issue with an example.
Example Situation
Suppose you have these models:
class Blog(models.Model): name = models.CharField(max_length=100) class Post(models.Model): name = models.CharField(max_length=100) blog = models.ForeignKey(Blog, on_delete=models.CASCADE, related_name="posts") ...lots of other big fields
And you create a few items for each:
blog = Blog.objects.create(name="Django Tricks") blog2 = Blog.objects.create(name="React Tricks") Post.objects.create(name="prefetching", blog=blog) Post.objects.create(name="models", blog=blog) Post.objects.create(name="templates", blog=blog) Post.objects.create(name="hooks", blog=blog2) Post.objects.create(name="components", blog=blog2)
If I wish to pre-fetch the posts for some blogs, but only want the names of each post, rather than the content of each post I can do the following:
Blog.objects.prefetch_related(Prefetch("posts", queryset=Post.objects.only("name")))
I would expect this to result in just 2 database queries. One to fetch the data for the Blog
instances, and another to fetch the the data for the related Post
s. Instead, there is an n+1 issue where there are 5 extra follow up requests for each of the related Post
instances. This is the SQL that is generated:
SELECT "app_blog"."id", "app_blog"."name" FROM "app_blog" LIMIT 21 SELECT "app_post"."id", "app_post"."name" FROM "app_post" WHERE "app_post"."blog_id" IN (1, 2) SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 1 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 2 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 3 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 4 LIMIT 21 SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 5 LIMIT 21
I can understand it might be a good idea to have the related-id's for the blog on hand should you need them later. But I also think, that by using .only()
you are explicitly telling django - I don't need these. This is a real problem for larger data-sets, where you end up with thousands of extra round-trips to the database.
Context
This is an issue that came up in the wild. I'm using a third-party optimiser that improves the performance of graphQL queries by decorating querysets with only()
, select_related()
etc. It correctly identifies that I am only using certain fields and applies only()
to them, knowing I will never need to access certain related fields. This unfortunately results in django producing many additional hits to the database. I don't think this is expected behaviour from django.
Change History (4)
comment:1 by , 25 hours ago
Description: | modified (diff) |
---|
comment:2 by , 25 hours ago
Description: | modified (diff) |
---|
comment:3 by , 24 hours ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
comment:4 by , 23 hours ago
Hello Simon,
Thanks for the speedy reply, and the explanation. That's very useful, and makes a lot of sense.
I'll raise the issue with the third-party, and check out your library.
Hello Tim,
This is a duplicate of #33835 which was won't fixed. See how the request out there was the opposite of what you requested here which is to implicitly add the required fields for prefetching.
The thing is prefetching must have
blog_id
otherwise it has no way to build the associative map betweenBlog
instances andPost
to populateblog.posts.all()
. In other words if I gave you a list of the formposts = [{"id": 1, "name": "Some blog Post"}, {"id": 2, "name": "Some other blog Post"}
how would you partition it byblog_id
?Adapting
prefetch_related
to error out if provided an inadequatePrefetch(queryset)
could potentially be done but that would not solve your actual problem.The third-party library you are using to automatically generate these queries is flawed and should includeblog_id
in the select mask (AKA theonly
) call if it relies on prefetching.If you'd like to catch these early I'd suggest looking at the potential solution to have warnings emitted on query leaks in ticket:33835#comment:2.