#25544 closed Cleanup/optimization (fixed)
prefetch_related sends duplicate ids to database
Reported by: | Julien Hartmann | Owned by: | Ian Foote |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | prefetch duplicate |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
That may be intended, but it seems dubious, so…
Issue happens when prefetching a model relation from another one, linked by a M2M field. Sample case:
class Picture(Model): # some fields class Collection(Model): pictures = ManyToManyField(Picture, blank=True, related_name='collections') # some fields class CollectionPermission(Model): content_object = ForeignKey(Collection, related_name='permissions') # fk to users and fk to permissions
When prefetching collection permissions from a picture queryset, like this:
Picture.objects.prefetch_related('collections__permissions')
The third query (the second level of prefetching) looks like this:
SELECT "gallery_collectionpermission"."id", "gallery_collectionpermission"."user_id", "gallery_collectionpermission"."permission_id", "gallery_collectionpermission"."content_object_id" FROM "gallery_collectionpermission" WHERE "gallery_collectionpermission"."content_object_id" IN ( 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9 )
The issue arises in two steps:
- As prefetch has no identity mapping, the first prefetching step will build a different Collection instance for every picture that uses it.
- Then the second prefetching step will gather all instances, and dump their ids to the database without looking for duplicates.
Ideally, this could be fixed at the first step: when a queryset prefetches the same row several times, it would make sense to return a single instance. In this example, all Pictures' cache should point to the same Gallery instances.
The second step could also be fixed: the related manager could eliminate duplicates when building the query. For some reason the ReverseSingleRelatedObjectDescriptor does it, but none of the other descriptors or managers do.
The cost seems huge. Typically, I have collections of hundreds of pictures, but each picture will only appear in 4 or 5 collections. The query thus sends the db server tens of thousands of ids instead of a few dozens.
Is there something I missed? Or is this use case out of the normal scope of prefetch_related
?
Change History (16)
comment:1 by , 9 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Type: | Uncategorized → Cleanup/optimization |
comment:2 by , 9 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 9 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
Patch is ready, IMO. Waiting for actual merge until the CI catches up...
comment:4 by , 9 years ago
Has patch: | set |
---|---|
Patch needs improvement: | set |
Triage Stage: | Ready for checkin → Accepted |
postgres_tests.test_array.TestQuerying.test_in
fails.
comment:5 by , 9 years ago
@IanFoote Oops, my fault. I didn't consider the possibility that non-hashable values (e.g. lists for an ArrayField) could be passed to __in
lookups. I don't really see a way to preserve this fix in the face of that possibility (going through the value and trying to convert everything to something hashable doesn't seem like a good plan), so unless you have any ideas I'm not seeing, I think we may need to revert to your previous fix at the prefetch_related
layer.
Sorry for the bad direction!
comment:6 by , 9 years ago
@carljm, what do you think of simply special casing ArrayField
's __in
lookup which is something we already do for __exact
anyway.
Something like the following comes to mind.
@ArrayField.register_lookup class ArrayInLookup(InLookup): def get_prep_lookup(self): values = super(ArrayInLookup, self).get_prep_lookup() return [tuple(value) for value in values]
comment:7 by , 9 years ago
@charettes I think that's a good idea! As long as we're confident that a) there are no other valid cases outside ArrayField where an unhashable value might be passed in an __in
lookup rhs, and b) every value passed to an ArrayField lookup should always be iterable. But I can't come up with any counter-examples to either of those.
comment:8 by , 9 years ago
A possibility is to try-except the deduping of the values. If it doesn't succeed, just use the original values.
comment:9 by , 9 years ago
I do not know if it is related, but ArrayField.get_db_prep_value
always converts tuples to lists.
comment:11 by , 9 years ago
Patch needs improvement: | unset |
---|
comment:12 by , 9 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:13 by , 9 years ago
Patch needs improvement: | set |
---|---|
Triage Stage: | Ready for checkin → Accepted |
Latest comments on the PR indicates a few more tests should be added.
comment:14 by , 9 years ago
Patch needs improvement: | unset |
---|---|
Version: | 1.8 → master |
The comments have been addressed.
comment:16 by , 7 years ago
The fix to this resulted in a regression where parameters passed to __in
were reordered due to the use of set()
. See #29503.
I'm not too familiar with this, but accepting on the basis for further investigation.