Opened 28 hours ago

Last modified 27 hours ago

#36157 closed Bug

Unusual behaviour when pre-fetching with only applied on the related fields — at Version 1

Reported by: Tim McCurrach Owned by:
Component: Uncategorized Version: 5.1
Severity: Normal Keywords:
Cc: Tim McCurrach Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Tim McCurrach)

When prefetching related models. If you apply .only() to the related queryset, django performs additional lookups for related IDs that have been left out of the only.

It is probably easiest to explain the issue with an example.

Example Situation

Suppose you have these models:

class Blog(models.Model):
    name = models.CharField(max_length=100)

class Post(models.Model):
    name = models.CharField(max_length=100)
    blog = models.ForeignKey(Blog, on_delete=models.CASCADE, related_name="posts")
    ...lots of other big fields

And you create a few items for each:

blog = Blog.objects.create(name="Django Tricks")
blog2 = Blog.objects.create(name="React Tricks")
Post.objects.create(name="prefetching", blog=blog)
Post.objects.create(name="models", blog=blog)
Post.objects.create(name="templates", blog=blog)
Post.objects.create(name="hooks", blog=blog2)
Post.objects.create(name="components", blog=blog2)

If I wish to pre-fetch the posts for some blogs, but only want the names of each post, rather than the content of each post I can do the following:

Blog.objects.prefetch_related(Prefetch("posts", queryset=Post.objects.only("name")))

I would expect this to result in just 2 database queries. One to fetch the data for the Blog instances, and another to fetch the the data for the related Posts. Instead, there is an n+1 issue where there are 5 extra follow up requests for each of the related Post instances. This is the SQL that is generated:

SELECT "app_blog"."id", "app_blog"."name" FROM "app_blog" LIMIT 21
SELECT "app_post"."id", "app_post"."name" FROM "app_post" WHERE "app_post"."blog_id" IN (1, 2)
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 1 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 2 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 3 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 4 LIMIT 21
SELECT "app_post"."id", "app_post"."blog_id" FROM "app_post" WHERE "app_post"."id" = 5 LIMIT 21

I can understand it might be a good idea to have the related-id's for the blog on hand should you need them later. But I also think, that by using .only() you are explicitly telling django - I don't need these. This is a real problem for larger data-sets, where you end up with thousands of extra round-trips to the database.

Context

This is an issue that came up in the wild. I'm using a third-party optimiser that improves the performance of graphQL queries by decorating querysets with only(), select_related() etc. It correctly identifies that I am only using certain fields and applies only() to them, knowing I will never need to access certain related fields. This results in django producing many additional hits to the database.

Change History (1)

comment:1 by Tim McCurrach, 28 hours ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.
Back to Top