Opened 5 months ago
Last modified 4 days ago
#35677 assigned Bug
Unexpected behaviour of Prefetch with queryset filtering on a through model
Reported by: | David Glenck | Owned by: | Simon Charette |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 5.1 |
Severity: | Normal | Keywords: | Prefetch, prefetch_related, many-to-many |
Cc: | Simon Charette, Thiago Bellini Ribeiro | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Currently prefetching many-to-many objects with a queryset that filters based on fields in the through model, will return different results than applying the same filter without prefetching.
Assume the following many-to-many relationship with a through model:
class Subscriber(models.Model): name = models.CharField(max_length=255) subscriptions = models.ManyToManyField('Subscription', related_name='subscribers', through='Status') class Subscription(models.Model): provider_name = models.CharField(max_length=255) class Status(models.Model): subscriber = models.ForeignKey(Subscriber, related_name='status', on_delete=models.CASCADE) subscription = models.ForeignKey(Subscription, related_name='status', on_delete=models.CASCADE) is_active = models.BooleanField(default=True)
Populated with the following data:
subscriber1 = Subscriber.objects.create(name="Sofia") subscriber2 = Subscriber.objects.create(name="Peter") subscription1 = Subscription.objects.create(provider_name="ABC") subscription2 = Subscription.objects.create(provider_name="123") Status.objects.create(subscriber=subscriber1, subscription=subscription1, is_active=True) Status.objects.create(subscriber=subscriber1, subscription=subscription1, is_active=True) Status.objects.create(subscriber=subscriber1, subscription=subscription2, is_active=False) Status.objects.create(subscriber=subscriber2, subscription=subscription1, is_active=True) Status.objects.create(subscriber=subscriber2, subscription=subscription2, is_active=True)
To get the active subscriptions of Sofia I can do this:
subscriber1.subscriptions.filter(status__is_active=True)
which gives me as expected twice the first subscription
Now, I try to prefetch the active subscriptions like this:
prefetched = Subscriber.objects.filter(pk__in=[subscriber1.pk, subscriber2.pk]).prefetch_related( Prefetch( 'subscriptions', queryset=Subscription.objects.filter(status__is_active=True), to_attr='active_subscriptions' ) ) prefetched[0].active_subscriptions
But if I do this, I get queryset with 4 times instead of 2 times the first subscription.
Looking into the source I found, that in the first case, the internal filter of the related_descriptor is made "sticky", such that it will be combined with the next filter that is applied, as if only one filter was used. So it behaves equivalent to:
Subscription.objects.filter(subscribers=subscriber1.id, status__is_active=True)
The prefetch on the other hand doesn't do any magic to stick the filters together and it will behave like this:
Subscription.objects.filter(subscribers=subscriber1.id).filter(status__is_active=True)
which, as well documented, is behaving differently for many-to-many relationships: https://docs.djangoproject.com/en/5.0/topics/db/queries/#spanning-multi-valued-relationships
Ideally the first filter on the queryset would also stick to the internal filter. Or at least there should be an easy way to change the behaviour, when needed.
Change History (17)
comment:1 by , 5 months ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
comment:2 by , 5 months ago
Hi Natalia
Thank you for pointing me to some related issues.
I checked them and they refer to some behaviour I also referred to, but the issue I am pointing out is a different one.
What I am trying to do is to optimize a repeating query of this shape (but of course more complex in reality)
for subscriber in subscribers: subscriber.subscriptions.filter(status__is_active=True)
by prefetching the related objects like this:
prefetched = Subscriber.objects.filter(pk__in=subscribers).prefetch_related( Prefetch( 'subscriptions', queryset=Subscription.objects.filter(status__is_active=True), to_attr='active_subscriptions' ) ) for subscriber in prefetched: prefetched[0].active_subscriptions
From the code above I would expect, that the prefetch gives the same results as my un-prefetched query. But it doesn't.
This is because the first case, django does some magic internally, to combine my .filter(status__is_active=True)
, with the internal filter to select subscriptions from this specific subscriber. Such that the result is, as if it would be a single filter (which is relevant in many-to-many relations).
But in the case of the prefetch django doesn't do that magic internally and thus it returns a different result.
So my proposal is to make the prefetched result behave the same as the non-prefetched one by modifying the internal behaviour of the prefetch query. Or at least give an option to do so. Because currently I have no clean way to instruct the prefetch to do what I want.
I'm not sure if this is a bug or a feature request, because this specific case is not mentioned in the documentation, but I would argue that it is the expected behaviour.
For illustration, this dirty workaround makes the prefetch return the result, that I would expect (but has undesired side-effects):
class StickyQueryset(models.QuerySet): def _chain(self): self._sticky_filter = True return self class Subscription(models.Model): provider_name = models.CharField(max_length=255) objects = StickyQueryset.as_manager()
I tried looking deeper into the QuerySet code to see if I can propose some patch. But it's rather complex code in there.
comment:3 by , 5 months ago
I tried looking deeper into the QuerySet code to see if I can propose some patch. But it's rather complex code in there.
I would search for the _next_is_sticky()
token around get_prefetch_querysets
implementations for many-to-many fields. It seems like we are already setting this flag in ManyRelatedManager.get_prefetch_querysets
(can't link as Github is down) though so I'm surprised that the filter calls are not considered to be combined.
follow-up: 6 comment:4 by , 5 months ago
Cc: | added |
---|---|
Type: | Uncategorized → Bug |
I suspect that the reason why subscriber.subscriptions.filter(status__is_active=True)
works is that the descriptor for Subscriber.subscription
calls _next_is_sticky()
before any filters is applied while when it's not the case for the queryset passed to Prefetch
.
From the _next_is_sticky
docstring
Indicate that the next filter call and the one following that should be treated as a single filter.
In the case of subscriber.subscription.filter(status__is_active)
the resulting chain is
Subcription.objects.all()._next_is_sticky().filter(subscribers=subscriber).filter(status__is_active)
while the call chain of prefetch(Prefetch("subscriptions", Subscripton.objects.filter(status__is_active))
is
Subscripton.objects.filter(status__is_active))._next_is_sticky().filter(subscribers=subscriber)
Which doesn't have the intended effect since _next_is_sticky()
is not called prior to the first filter()
call.
Calling _next_is_sticky()
(which is effectively what you emulated with your StickyQueryset
) before calling filter(status__is_active)
has the same effect
prefetched = Subscriber.objects.filter(pk__in=[subscriber1.pk, subscriber2.pk]).prefetch_related( Prefetch( 'subscriptions', queryset=Subscription.objects.all()._next_is_sticky().filter(status__is_active=True)._next_is_sticky(), to_attr='active_subscriptions' ) )
The second _next_is_sticky
call is necessary because ManyRelatedManager.get_prefetch_querysets
calls to using
triggers _chain
and clears it.
All in all this whole sticky notion is kind of hacky and simply doesn't appear appropriate in the context of ManyRelatedManager.get_prefetch_querysets
(as there is no follow up filter
call). It seems that we need in there is not _next_is_sticky
but a way to let the ORM know that some filter calls against multi-valued relationships should reuse existing JOINs no matter what. I know we have a ticket for that but I can't find it.
comment:5 by , 5 months ago
Resolution: | needsinfo |
---|---|
Status: | closed → new |
Triage Stage: | Unreviewed → Accepted |
I think the following is what you are requesting. I'm re-opening Natalia as I believe this is a legitimate bug.
-
django/db/models/fields/related_descriptors.py
diff --git a/django/db/models/fields/related_descriptors.py b/django/db/models/fields/related_descriptors.py index bc288c47ec..5356e28d22 100644
a b def __set__(self, instance, value): 94 94 instance.__dict__[self.field.attname] = value 95 95 96 96 97 def _filter_prefetch_queryset(queryset, field_name, instances ):97 def _filter_prefetch_queryset(queryset, field_name, instances, db): 98 98 predicate = Q(**{f"{field_name}__in": instances}) 99 db = queryset._db or DEFAULT_DB_ALIAS99 db = db or DEFAULT_DB_ALIAS 100 100 if queryset.query.is_sliced: 101 101 if not connections[db].features.supports_over_clause: 102 102 raise NotSupportedError( … … def get_prefetch_querysets(self, instances, querysets=None): 785 785 ) 786 786 queryset = querysets[0] if querysets else super().get_queryset() 787 787 queryset._add_hints(instance=instances[0]) 788 queryset = queryset.using(queryset._db or self._db) 788 db = queryset._db or self._db 789 queryset = queryset.using(db) 789 790 790 791 rel_obj_attr = self.field.get_local_related_value 791 792 instance_attr = self.field.get_foreign_related_value 792 793 instances_dict = {instance_attr(inst): inst for inst in instances} 793 queryset = _filter_prefetch_queryset(queryset, self.field.name, instances) 794 queryset = _filter_prefetch_queryset( 795 queryset, self.field.name, instances, db 796 ) 794 797 795 798 # Since we just bypassed this class' get_queryset(), we must manage 796 799 # the reverse relation manually. … … def get_prefetch_querysets(self, instances, querysets=None): 1165 1168 ) 1166 1169 queryset = querysets[0] if querysets else super().get_queryset() 1167 1170 queryset._add_hints(instance=instances[0]) 1168 queryset = queryset.using(queryset._db or self._db) 1171 db = queryset._db or self._db 1172 # Make sure filters are applied in a "sticky" fashion to reuse 1173 # multi-valued relationships like direct filter() calls against 1174 # many-to-many managers do. 1175 queryset.query.filter_is_sticky = True 1169 1176 queryset = _filter_prefetch_queryset( 1170 queryset ._next_is_sticky(), self.query_field_name, instances1177 queryset, self.query_field_name, instances, db 1171 1178 ) 1179 queryset = queryset.using(db) 1172 1180 1173 1181 # M2M: need to annotate the query in order to get the primary model 1174 1182 # that the secondary model was actually related to. We know that
David, if you'd up to it it seems that all this bugfix is missing are tests that can be added to the prefetch_related
test app if you're interested in submitting a PR once Github is back online.
comment:6 by , 5 months ago
Replying to Simon Charette:
All in all this whole sticky notion is kind of hacky and simply doesn't appear appropriate in the context of
ManyRelatedManager.get_prefetch_querysets
(as there is no follow upfilter
call). It seems that we need in there is not_next_is_sticky
but a way to let the ORM know that some filter calls against multi-valued relationships should reuse existing JOINs no matter what. I know we have a ticket for that but I can't find it.
Simon, perhaps this is the ticket you are looking for? #27303, at first it seems like an admin specific report but reading on it feels it has some similarities with the "sticky" bits.
Also thank you for your further analysis and reopening.
comment:7 by , 5 months ago
Thanks Natalia, #27303 is effectively a manifestation of this problem in the admin. We don't have a good way to denote that joins against multi-valued relationships should be reused between different QuerySet
method calls. It's a problem we hacked around for filter
with this sticky notion but the problem exists for annotate
and any other ORM method that allows for multi-valued (many-to-many or reverse one-to-many) to be referenced.
I believe there are other tickets that discussed a generic way to alias such relationships so they are always reusable, it was brought up during the design of FilteredRelation
for sure.
IIRC the design was something along the lines of
Subscriber.objects.alias( # Force the reuse of all JOINs up to Susbcription (including Status) subscriptions=Relation("subscriptions", reuse=True), ).filter(subscriber=subscriber).filter(status__is_active=True)
comment:8 by , 5 months ago
Hello Simon
thank you for the proposed bugfix. I prepared a test and applied your patch and it passes with your fix.
However, if I do an annotation after the filter (which I do in my real world use case). The same happens, when I add another prefetch_related after the filter.
prefetched = Subscriber.objects.filter(pk__in=subscribers).prefetch_related( Prefetch( 'subscriptions', queryset=Subscription.objects.filter(status__is_active=True).annotate(active=F('status__is_active')), to_attr='active_subscriptions' ) )
This makes sense, because the last filter/annotation etc. is made sticky, not the first, like in the case of the many-to-many manager.
Knowing this, for some cases this can be fixed, by moving the additional annotation before the filter.
But if I do an annotation like this, is seems to fail regardless:
prefetched = Subscriber.objects.filter(pk__in=subscribers).prefetch_related( Prefetch( 'subscriptions', queryset=Subscription.objects.annotate(active=F('status__is_active')).filter(active=True), to_attr='active_subscriptions' ) )
The many-to-many manager seems to handle this case correctly.
subscriber1.subscriptions.annotate(active=F('status__is_active')).filter(active=True)
I have started a branch with the tests here: https://github.com/django/django/compare/main...pascalfree:django:35677
comment:9 by , 5 months ago
I think there might be a way to solve the annotate issue as well but it's not trivial.
When annotate
is used sql.Query.used_aliases
doesn't get populated
>>> Subscription.objects.annotate(active=F('status__is_active')).query.used_aliases set()
This means that even if the subsequent filter()
wasn't clearing it (it does since filter_is_sticky
is False
by then) by the time it makes its way to the prefetching logic there is nothing left in used_alias
that can be reused.
The only solution I can think of is to add the filter while allowing pre-existing many-to-many aliases to be reused
-
django/db/models/fields/related_descriptors.py
diff --git a/django/db/models/fields/related_descriptors.py b/django/db/models/fields/related_descriptors.py index 5356e28d22..7b003d02cd 100644
a b def _filter_prefetch_queryset(queryset, field_name, instances, db): 112 112 if high_mark is not None: 113 113 predicate &= LessThanOrEqual(window, high_mark) 114 114 queryset.query.clear_limits() 115 return queryset.filter(predicate) 115 queryset.query.add_q(predicate, reuse_all_aliases=True) 116 return queryset 116 117 117 118 118 119 class ForwardManyToOneDescriptor: … … def get_prefetch_querysets(self, instances, querysets=None): 1172 1173 # Make sure filters are applied in a "sticky" fashion to reuse 1173 1174 # multi-valued relationships like direct filter() calls against 1174 1175 # many-to-many managers do. 1175 queryset.query.filter_is_sticky = True1176 1176 queryset = _filter_prefetch_queryset( 1177 1177 queryset, self.query_field_name, instances, db 1178 1178 ) -
django/db/models/sql/query.py
diff --git a/django/db/models/sql/query.py b/django/db/models/sql/query.py index aef3f48f10..0945aa9198 100644
a b def build_filter( 1602 1602 def add_filter(self, filter_lhs, filter_rhs): 1603 1603 self.add_q(Q((filter_lhs, filter_rhs))) 1604 1604 1605 def add_q(self, q_object ):1605 def add_q(self, q_object, reuse_all_aliases=False): 1606 1606 """ 1607 1607 A preprocessor for the internal _add_q(). Responsible for doing final 1608 1608 join promotion. … … def add_q(self, q_object): 1616 1616 existing_inner = { 1617 1617 a for a in self.alias_map if self.alias_map[a].join_type == INNER 1618 1618 } 1619 clause, _ = self._add_q(q_object, self.used_aliases) 1619 if reuse_all_aliases: 1620 can_reuse = set(self.alias_map) 1621 else: 1622 can_reuse = self.used_aliases 1623 clause, _ = self._add_q(q_object, can_reuse) 1620 1624 if clause: 1621 1625 self.where.add(clause, AND) 1622 1626 self.demote_joins(existing_inner)
Ideally we wouldn't reuse all aliases though, only the ones that we know for sure we want to reuse for _filter_prefetch_queryset(field_name)
, which can be obtained through a combination of names_to_path
and comparison to alias_map
.
comment:11 by , 5 months ago
Ahmed, thanks for offering but without a track record of contributing to the ORM I think this one might be a bit too much particularly to get the discussed names_to_path
and alias_map
working. We should at least ensure that the proposed solution addresses David problem.
comment:12 by , 3 months ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:13 by , 8 weeks ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:15 by , 4 weeks ago
Cc: | added |
---|
comment:17 by , 4 days ago
Owner: | set to |
---|---|
Status: | new → assigned |
Hello David, thank you for the detailed report and the models and reproduction steps you provided. I’m having trouble understanding the specific behavior you’re reporting that seems inconsistent with the documentation you referenced.
If the issue you’re encountering is related to the documented behavior of spanning multi-valued relationships, it might be a duplicate of issues such as #29196, #29271, #16554, or others. If the issue is different, could you please provide a brief summary or rephrase what you believe the bug in Django is?