Opened 8 months ago
Closed 7 months ago
#35434 closed Cleanup/optimization (wontfix)
prefetch_related_objects fails to cache UUID FKs when the string representation of a UUID is used
Reported by: | Selcuk Ayguney | Owned by: | |
---|---|---|---|
Component: | Documentation | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Selcuk Ayguney, Simon Charette | Triage Stage: | Unreviewed |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Consider the following models:
class Pet(models.Model): id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) name = models.CharField(max_length=20) class Toy(models.Model): pet = models.ForeignKey(Pet, models.CASCADE, related_name="toys") name = models.CharField(max_length=20)
The following works and caches the deferred pet
instance as expected:
pet = Pet.objects.create(name="Fifi") toy_bone = Toy(pet_id=str(pet.id), name="Bone") print(toy_bone.pet)
It fails if prefetch_related_objects
is used:
pet = Pet.objects.create(name="Fifi") toy_bone = Toy(pet_id=str(pet.id), name="Bone") prefetch_related_objects([toy_bone], "pet") print(toy_bone.pet)
raise self.RelatedObjectDoesNotExist( ^^^^^^^^^^^^^^^^^ prefetch_related.models.Toy.pet.RelatedObjectDoesNotExist: Toy has no pet.
The behaviour is inconsistent as str to UUID conversion is automatically performed in the first example.
One may argue that the use of prefetch_related_objects
is redundant in this case. However, it is useful in async code where deferred lookups are not automatic.
Unit test to reproduce the issue and the suggested fix:
https://github.com/django/django/pull/18132
Change History (15)
comment:1 by , 8 months ago
Description: | modified (diff) |
---|
comment:2 by , 8 months ago
Description: | modified (diff) |
---|
comment:3 by , 8 months ago
Has patch: | set |
---|
comment:4 by , 8 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Triage Stage: | Unreviewed → Accepted |
comment:5 by , 8 months ago
Cc: | added |
---|
Have we performed any form of benchmarks to assess the impact of calling to_python
for each sides of the prefetching bridge? Field.to_python
can perform non-trivial validation and instance checks that quickly add up when dealing with large collection of objects.
Given prefetch_related_object
is normally called from a prefetch_related
queryset where both sides are already guaranteed to be of the right type and serializers/forms are meant to be turn representations of objects to the expected field types at input boundaries it seems like this change could do more harm than good both from a performance and areas of concerns violation perspective.
My question to you would be why you are assigning the unexpected type to field in the first place (the pet_id=str(pet.id)
)? Plenty of things subtly break if you assign the string representation of a DateField
, DecimalField
, FloatField
, etc to a model instance field instead of the proper type so I'm not sure how this particular case is special. If you have to assign such properties I believe you should call .full_clean()
/ clean_fields()
on the model instance and be ready to deal with any associated ValidationError
.
I'm more of the opinion that we should wont-fix this issue than proceed here.
comment:6 by , 8 months ago
True, I already implemented a similar workaround in the codebase I was working on when I spotted this issue, and honestly I'm not sure if this is a wontfix or a legitimate bug. I haven't performed any benchmark either. My rationale was that the usual Django deferred lookup works when a str
is used, but it only fails when prefetch_related_objects
is used, so this looked like the more consistent behaviour. I don't think other types such as DecimalField
etc are commonly used as primary keys.
In my real world example the model instance is instantiated from a JSON object in a generic way, and foreign keys are set using the attname
s where it is not possible to know the actual data type.
comment:7 by , 8 months ago
My rationale was that the usual Django deferred lookup works when a str is used, but it only fails when prefetch_related_objects is used, so this looked like the more consistent behaviour.
I understand where that could be perceived as inconsistent but I would argue that it's undefined behaviour in both cases. It only happens to work in the lazy loading case because it incurs a query that defers the lookup to the backend.
I don't think other types such as DecimalField etc are commonly used as primary keys.
Fair point, it remains unnecessary while arguably faster validation for AutoField,
UUIDField`, and others though.
In my real world example the model instance is instantiated from a JSON object in a generic way, and foreign keys are set using the attnames where it is not possible to know the actual data type.
If you are turning a raw JSON object to a model instance I think that you should be expected to call clean_fields
at the very least. That's what serialization layers are usually for; turn data into their proper model equivalent.
comment:8 by , 8 months ago
It only happens to work in the lazy loading case because it incurs a query that defers the lookup to the backend.
Agreed. Granted, it is still because of the database backend, but strings can be used interchangeably with UUIDs in almost everywhere, for example Toy.objects.filter(pk="00000000-...").exists()
. This makes the behaviour in this ticket even more peculiar.
That's what serialization layers are usually for; turn data into their proper model equivalent.
Sure thing. As I mentioned I have already changed my code to validate the instance, but this behaviour took me by surprise and it looked like it violates the least astonishment principle.
Edit: Apparently the same issue affects Jacob Walls in a different way.
comment:9 by , 8 months ago
I understand where that could be perceived as inconsistent but I would argue that it's undefined behaviour in both cases. It only happens to work in the lazy loading case because it incurs a query that defers the lookup to the backend.
Interesting. If pet_id=str(pet.id)
is undefined, then I'm looking at a huge yak shave to audit my projects for UUIDs represented as strings, which I'm pretty sure we're doing in every view, test, model, migration...
follow-up: 15 comment:10 by , 8 months ago
There's a difference between lookup against the database (e.g. filter(pet_id=str(pet.id))
) which performs implicit conversions and direct model attribute assignments. I would argue that if you explicitly assign the string representation of objects meant to be stored in a field to a model instance you are effectively doing something wrong.
One concrete example of that is model equality Model.__eq__
which is implemented by comparing the bases and the primary key value of the compared objects. Since value = uuid.uuid4(); value != str(value)
then Object(pk=value) != Object(pk=str(value))
.
I don't think it's attainable to have the ORM to perform type conversion every time it needs to perform equality checks.
follow-up: 12 comment:11 by , 8 months ago
Component: | Database layer (models, ORM) → Documentation |
---|---|
Has patch: | unset |
Owner: | removed |
Status: | assigned → new |
Triage Stage: | Accepted → Unreviewed |
Type: | Bug → Cleanup/optimization |
Thanks, that helps me understand the severity of the anti-pattern.
Now I'm aware that despite the implicit type-conversion you get on the database-side, python-side operations like ==
or obj in queryset
become unsafe when a stringified UUID is assigned to a model attribute.
Or even queryset.contains()
, but only if called on an evaluated queryset.
In [1]: from models import GraphModel In [2]: original = GraphModel.objects.first() In [3]: unsafe = GraphModel(str(original.pk)) In [4]: all_graphs = GraphModel.objects.all() In [5]: all_graphs.contains(unsafe) Out[5]: True In [6]: len(all_graphs) Out[6]: 1 In [7]: all_graphs.contains(unsafe) Out[7]: False
I think a version of that REPL would be useful in the docs somewhere. Calling your query tested by testing as far as line 5 but not line 7 is something I can see happening.
Selcuk, is it okay if I reframe your ticket as a documentation request and ask for another opinion?
comment:12 by , 8 months ago
Replying to Jacob Walls:
Selcuk, is it okay if I reframe your ticket as a documentation request and ask for another opinion?
It's perfectly fine. Simon has a good point, and I agree that it is easy to handle this on the application side.
comment:13 by , 8 months ago
Resolution: | → needsinfo |
---|---|
Status: | new → closed |
Hi all, currently closing as "needsinfo". It's not clear to me where in the docs you're hoping to add something or what would have been useful in the docs for this case. Can you share an idea of what you have in mind?
comment:14 by , 7 months ago
Has patch: | set |
---|---|
Resolution: | needsinfo |
Status: | closed → new |
comment:15 by , 7 months ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
I have reviewed the PR and as mentioned there I'm not comfortable with the adding, I think it's misleading and could be interpreted it as a use case that Django supports when it does not. In my view, the example line doing DailyEntry(day="2020-01-01", headline="Happy New Year!")
goes in direct opposition of what Simon said here:
There's a difference between lookup against the database (e.g.
filter(pet_id=str(pet.id))
) which performs implicit conversions and direct model attribute assignments. I would argue that if you explicitly assign the string representation of objects meant to be stored in a field to a model instance you are effectively doing something wrong.
I ran into a symptom of this problem recently and didn't get far enough to diagnose the root cause.
I was prefetching at two levels, so my hack was to just iterate the objects and reselect the object at the intermediate level (reintroducing N+1 queries at that level), but at least allowing me to avoid N+1 queries at the second level.
Thanks for the report.