Opened 4 years ago

Closed 4 years ago

Last modified 2 years ago

#32244 closed Cleanup/optimization (wontfix)

ORM inefficiency: ModelFormSet executes a single-object SELECT query per formset instance when saving/validating

Reported by: Lushen Wu Owned by: nobody
Component: Database layer (models, ORM) Version: 3.1
Severity: Normal Keywords: formsets
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Lushen Wu)

Conceptual summary of the issue:
Let's say we have a Django app with Author and Book models, and use a BookFormSet to add / modify / delete books that belong to a given Author. The problem is when the BookFormSet is validated, ModelChoiceField.to_python() ends up calling self.queryset.get(id=123) which results in a single-object SELECT query for each book in the formset. That means if I want to update 15 books, Django performs 15 separate SELECT queries, which seems incredibly inefficient. (Our actual app is an editor that can update any number of objects in a single formset, e.g. 50+).

My failed attempts to solve this:

  1. First I tried passing a queryset to the BookFormSet, i.e. formset = BookFormSet(data=request.POST, queryset=Book.objects.filter(author=1)), but the ModelChoiceField still does its single-object SELECT queries.
  2. Then I tried to see where the ModelChoiceField defines its queryset, which seems to be in BaseModelFormSet.add_fields(). I tried initiating the ModelChoiceField with the same queryset that I passed to the formset, e.g. Book.objects.filter(author=1) instead of the original code which would be Book._default_manager.get_queryset(). But this doesn't help because I guess the new queryset I defined isn't actually linked to what was passed to the formset (and already evaluated). So the multiple SELECT queries still happen. (Note: I realize _default_manager.get_queryset() might be necessary in cases where the formset can be used to switch one Model instance to another instance which might not be in the original queryset passed to the BaseModelFormset, but this is not our use case)
  3. I noticed that BaseModelFormSet._existing_object() actually provides a way to check whether an object exists in the queryset that was giving to the formset constructor, which means that queryset is evaluated at most once and the results stored in BaseModelFormSet._object_dict. I thought there might be some way to have ModelChoiceField.to_python() do something similar before calling self.queryset.get(id=123), but I don't think ModelChoiceField is aware of BaseModelFormSet, and it would seem an anti-pattern to reach up the hierarchy like this.

The easiest solution seems to me to pass BaseModelFormSet._object_dict in some way to each ModelForm that's created, and then allow the ModelChoiceField to check this _object_dict before making another SELECT query.

Even if this could be solved with some form of DB caching, it still seems inefficient application-layer logic.

Change History (16)

comment:1 by Lushen Wu, 4 years ago

Description: modified (diff)

comment:2 by Lushen Wu, 4 years ago

Description: modified (diff)

comment:3 by Lushen Wu, 4 years ago

Description: modified (diff)

comment:4 by Lushen Wu, 4 years ago

Description: modified (diff)

comment:5 by Lushen Wu, 4 years ago

Is this a more fundamental aspect of Django ORM behavior? In that if I have qs = Book.objects.filter(author=1) (let's say this gives back 3 Books with id = 1,2,3), and then evaluate it -- i.e. len(list(qs)) -- and then try book2 = qs.get(id=2) this will still run a SELECT query even though book2 was already fetched when the original queryset was evaluated

comment:6 by Lushen Wu, 4 years ago

I figured out a very hacky workaround that preserves the formset functionality we need. Basically in BookForm you can override ModelForm._clean_fields()

def _clean_fields(self):
        # remove 'id' field so it's not cleaned 
        # (this is the ModelChoiceField that's generating an extra SELECT query per Book object in the formset during clean)
        id_field = self.fields.pop('id')

        # run normal cleaning, with the form now unaware of its own 'id' field
        super(BookForm, self)._clean_fields()

        # add 'id' field back and insert it into clean_data
        id_value = id_field.widget.value_from_datadict(self.data, self.files, self.add_prefix('id'))
        self.cleaned_data['id'] = id_value

        # add the 'id' field back into the form
        self.fields['id'] = id_field

comment:7 by Carlton Gibson, 4 years ago

Resolution: wontfix
Status: newclosed

Hi. Thanks for the report.

It looks like you'e dug into the various aspects.

This seems to make the most sense to me:

The easiest solution seems to me to pass BaseModelFormSet._object_dict in some way to each ModelForm that's created, and then allow the ModelChoiceField to check this _object_dict before making another SELECT query.

You'd pass a reference via the form kwargs, and then need a custom form and ModelChoiceField subclass to make use of it in to_python().

I have to say this is probably a wontfix. It's not likely that this is a performance bottleneck for most projects. Even several hundred lookups are not going to noticeably affect the form submission in normal cases. Projects needing to process data in a performant tight-loop will want to make adjustments as discusses here, but the complication there probably isn't worth it for the general case.

If you experiment in your own project and come up with a clean and simple solution, it may be worth reviewing then, either here or on the DevelopersMailingList.

I hope that makes sense.
Thanks.

comment:8 by Lushen Wu, 4 years ago

Thanks for taking a look! I understand that the issue is probably not a performance bottle neck in most projects, but if it truly is an extra SQL query translating to millions of unnecessary roundtrips across all Django deployments (and the fix would be relatively straightforward), I feel like it still deserves some consideration. I'd be willing to contribute to a patch if helpful.

Conceptually I'm not exactly sure why ModelChoiceField needs a Python representation of the value to validate. I understand why this would be needed for e.g. a CharField with a max_length requirement. But if the ModelChoiceField value passed from the form is a PK, can it be validated without loading the entire instance into memory? (If fetching the full Python object is required for custom validators, the to_python() call and SELECT query could be deferred until such a case is encountered.)

Currently the validation callstack goes BaseForm.full_clean() -> BaseForm._clean_fields(value) -> Field.clean(value) -> ModelChoiceField.to_python(value).

From what I can see, ModelChoiceField does not override Field.validate() or Field.run_validators(). So the only validation step performed by ModelChoiceField basically checks whether the Book instance generated by Field.to_python() belongs to ModelChoiceField._get_choices() <-- which by default is a ModelChoiceIterator for all Book objects.

Is there a "lighter" way of doing this that simply checks whether the ModelChoiceField value (as a PK) exists in the corresponding model table? If the SELECT query is simply a side effect of ModelChoiceField inheriting Field.clean(), then I think it would be cleaner design to have ModelChoiceField.clean() override Field.clean() and avoid calling to_python() in the default validation loop. And as mentioned fetch the object only if needed for custom validators.

Thanks again!


P.S. The kw_args workaround you suggested seems to be the cleanest in the meantime if this stays a wontfix.

Version 0, edited 4 years ago by Lushen Wu (next)

comment:9 by powderflask, 3 years ago

Thanks for this discussion - I have the same issue with an app where user may update a formset with >100 records in a single post. Like OP, it's not the performance issue per say that bugs me, just the wasted cycles and DB hits.

That ModelChoiceField is added to the form dynamically by BaseModelFormSet.add_fields(self, form, index) so that every form has a pk field:

         form.fields[self._pk_field.name] = ModelChoiceField(qs, initial=pk_value, required=False, widget=widget)

I think what we want that field to validate here is that the pk value in the post data matches the form's instance, (or that the pk value is None if form has no instance) i.e., we need to verify that no one tampered with the form's pk data on the round trip.

Might this approach work:

  • define a ModelPkField intended specifically to define a read-only pk field.
  • validation as above, no DB hit
  • ModelFormSet uses this field type to define these hidden pk fields, and returns the form's original instance if it validates

From what I can tell, this appears to similar to approach taken for FK field added to an InlineFormSet, which uses an InlineForeignKeyField to serve a similar purpose:

      form.fields[name] = InlineForeignKeyField(self.instance, **kwargs)

I don't mind having a kick at this can, but looking for some advice on whether this is even a reasonable idea?

Last edited 3 years ago by powderflask (previous) (diff)

comment:10 by powderflask, 3 years ago

Here's a gist with a first crack at the idea I proposed above: https://gist.github.com/powderflask/de0d3d6c6bef6c1b3890bdc1e83ff05c

It worries me that this code is so much simpler than the original.
Also, it may be that some client code relies on the private behaviour and is expecting the pk form field to be ModelChoiceField.
So I suspect a robust solution will need to handle more use-cases, but...

This code seems to work for the use cases I tested, saving formsets without any extra DB hits and flagging validation errors if the post data is tampered with.

If you want to try this, simply pass the EfficientModelFormSet into modelformset_factory:

       modelformset_factory( ... , formset=EfficientModelFormSet)

If I get some feedback here, I can work up some test cases and turn this into a pull request.

comment:11 by Carlton Gibson, 3 years ago

Hi @powderflask

It looks a bit like the Admin's raw_id_fieldswhich is widget based.

I gave a talk at DjangoCon Europe 2020 on optimising ModelChoiceField — the basic idea in the FormSet case being to set the choices directly in order to avoid regenerating them for each form. Take a look at that: does it address your issue?

in reply to:  11 ; comment:12 by powderflask, 3 years ago

Replying to Carlton Gibson:
@Carlton - thanks for the reply, for the interesting talk, and for all the work you do maintaining django. Awesome.
I regularly run into the issues you cover in your talk, and have applied all 3 solutions at various times - it was nice to see them all laid out, and I'll be glad for the menu of options in future.

But that's not at all the issue here.

This issue is about how django validates the hidden pk field automatically added to each form by BaseModelFormSet.
That hidden field is represented as a ModelChoiceField, so it hits the DB during validation, but for zero effect as far as I can tell - the form already has the instance. It's just a side-effect of the way ModelChoiceField (correctly) validates data that isn't needed here.

What I think I see here is that using ModelChoiceField has several potential drawbacks:

  1. when saving a large modelformset, one redundant query is done for each record;
  1. if the form's pk data does get messed up (say by a buggy piece of JS), the behaviour of modelforset is a bit erratic and it may do any of these 3 things, all of which are unexpected (in my mind):
  1. if the form's pk data is not a valid value for the model's pk field, form validation fails but the error is attached to the hidden pk field, so is unlikely to be displayed to the user. Worse, the hidden pk field now has no value , but the rest of the form data looks the same to the end user. If the user just presses save again, a new instance is created with the form data - creating a kind-of duplicate of the original data.
  1. if the form's pk data is a valid value, but not in the formset's queryset, the form validates BUT the pk field for the instance is set to None. Thus, the save logic proceeds, and again, a new instance is created.
  1. if the bad pk data value happens to duplicate a pk that is in the queryset, the form validates and the save logic may overwrite one record with the other record's data, resulting in potential data loss .

Now, i understand getting bum pk values back in a formset is a strange edge case and means something went horribly wrong in the round-trip. BUT the framework should provide consistent, predictable behaviour, and it should not be making spurious DB queries for no good reason.

I posted a gist above to illustrate a concept I'm thinking of - I'm sure I'm missing some important use-cases, just looking for advice on whether this is even worth pursuing.
thanks again for taking an interest. Hope I made some sense here.

in reply to:  12 comment:13 by powderflask, 3 years ago

Replying to powderflask:
Correcting myself - some of the unexpected behaviors described above result from logic in BaseModelFormSet._construct_form and has little to do with this issue. As I investigated, scope broadened and that logic IS logically part of validating the pk value, but is needed to associate the form with its instance. Still thinking...
The code I proposed at least raises a validation error in cases (a) and (b). Case (c) seems most critical, but can't be detected at form level - it's a formset validation issue.

Last edited 3 years ago by powderflask (previous) (diff)

comment:14 by Carlton Gibson, 3 years ago

Hi powderflask, thanks for the follow-up.

This issue is about how django validates the hidden pk field automatically
added to each form by BaseModelFormSet. That hidden field is represented as a
ModelChoiceField, so it hits the DB during validation, but for zero effect as
far as I can tell - the form already has the instance. It's just a side-effect
of the way ModelChoiceField (correctly) validates data that isn't needed here.

What I think I see here is that using ModelChoiceField has several potential
drawbacks:

  1. when saving a large modelformset, one redundant query is done for each record;

But the lookup is not redundant… — it's a ModelChoiceField: it has a QuerySet of possible valid values, and in validation we need to check that the submitted value is in the QuerySet.

The reason it's evaluated per-row is because we go to great lengths to avoid stale QuerySets. Hence the suggestion in the talk to generate choices upfront if you want to avoid that repeated work.

…but for zero effect as far as I can tell - the form already has the instance…

I haven't understood exactly your point here… To wit, how about you put together a very minimal sample project with a test case showing the exact behaviour you think is wrong? (Here's an example I did for another issue.) It's much clearer with code in hand, and a "At this breakpoint() the behaviour should be X".

You're always welcome to use a custom field in your own project, but I don't suspect there's a change on the cards here. A search of the Trac history shows ModelChoiceField has been through a lot. It's the way it is for good reasons. However, always happy to have a look.

I hope that makes sense. :)

comment:15 by powderflask, 2 years ago

Thanks Carlton for your serious consideration of this issue and deep explanations - it's very kind of you to spend time on this.
I hit this issue for a 3rd time, so had another chance to think it through, but still don't see where I've misunderstood anything fundamental...

But the lookup is not redundant… — it's a ModelChoiceField: it has a QuerySet of possible valid values, and in validation we need to check that the submitted value is in the QuerySet.

You are right in general - a ModelChoiceField needs a QuerySet to validate. What I think I'm arguing is that hidden PK field should not be a ModelChoiceField at all - the overhead is not needed.

Here's why I think the Queryset / ModelChoiceField is redundant and could be replaced by a simpler type of field:

  1. this is a hidden field precisely because the PK field can't be edited. In fact, it should not be valid for that hidden field to have any value other than None (for new forms) or the PK of the form instance. Any other value should raise a validation error.
  1. so we don't need a QuerySet to do the validation.
    • If the hidden PK field is empty, great, it's a new instance, nothing to validate;
    • otherwise, it's value MUST match the form's instance - if it doesn't, something has gone horribly wrong in any case.

Perfectly possible I've just missed some crucial use case or fact about how model formsets are validated. But I've been through this 3 times now for 3 different use cases, and come to the same conclusion each time - valid values for the PK are already known at validation time, so the DB access to fetch the Queryset is just pure overhead (for reference, I have forms where >500 entries are being submitted, so the DB lag is non-trivial).

thanks again - sorry to be a dog with a bone :-P

comment:16 by Carlton Gibson, 2 years ago

Hi powderflask — no problem. :)

Here's why I think the Queryset / ModelChoiceField is redundant and could be replaced by a simpler type of field...

I think the way forward here would be for you to provide a proof-of-concept. Then folks could assess properly whether it were viable.

… for reference, I have forms where >500 entries are being submitted, so the DB lag is non-trivial

Sure. It's in this case that I provide pre-calculated choices to avoid the overhead (as per above).

Note: See TracTickets for help on using tickets.
Back to Top