Opened 14 years ago
Closed 14 years ago
#14474 closed (wontfix)
Unnecessary deepcopying of QuerySet inside filter() method results in slower execution
Reported by: | Piotr Czachur | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Keywords: | deepcopy, filter | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When building query using multiple filter() calls like that:
qs = MyModel.objects.filter(...).filter(...).filter(...)
it results in sql.Query.clone() for every filter() call.
Query.clone() uses deepcopy() for "where", "having", "aggregates" and "deferred_loading" properties:
obj.where = deepcopy(self.where, memo=memo) obj.having = deepcopy(self.having, memo=memo) obj.aggregates = deepcopy(self.aggregates, memo=memo) obj.deferred_loading = deepcopy(self.deferred_loading, memo=memo)
... which can be time-consumig operation.
My question is, do we really need deepcopy when we call filter() like this?
I mean we only need QuerySet created by last filter() call, so we can overwrite previous QuerySet instead of creating a (deep)copy.
Maybe adding flag for filter() method would be possible: filter(overwrite=False).
P.S.
I attach profiler results for non-trivial query run on my Ubuntu box: CPU: Intel(R) Celeron(R) M CPU 530 @ 1.73GHz, Python 2.6.5 / MySQL 5.1@InnoDB
Results show that for complicated query, query building time can be far larger than executing SQL and filling model instances with results.
I tested two version of the query:
# multi-filter result = Zero.objects.select_related('one__two__three') \ .filter(field_x=None) \ .filter(field_t='P') \ .filter(one__id__gt=2) \ .filter(one__field_j__isnull=True) \ .filter(one__two__id__isnull=False) \ .filter(one__two__field_y__year=2009) \ .filter(one__two__three__field_j__isnull=True) \ .filter(one__two__three__field_z__gt=2) \ [:1] len(result)
and
# filter-once result = Zero.objects.select_related('one__two__three') \ .filter(field_x=None, field_t='P', one__id__gt=2, one__field_j__isnull=True, one__two__id__isnull=False, one__two__field_y__year=2009, one__two__three__field_j__isnull=True, one__two__three__field_z__gt=2 )[:1] len(result)
Attachments (1)
Change History (4)
by , 14 years ago
Attachment: | QuerySet_execution_time.ods added |
---|
comment:1 by , 14 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Yes, a deepcopy is required, because you can build on querysets in parallel paths. i.e.,
qs = Author.objects.all() qs1 = qs.filter(name='foo') qs2 = qs.filter(age=5)
If we don't deepcopy, qs1 and qs2 will be partially sharing internal memory structures, which will lead to interesting behavior.
comment:2 by , 14 years ago
Resolution: | invalid |
---|---|
Status: | closed → reopened |
Maybe I haven't made myself entirely clear.
I'm not asking if deepcopying in general is necessary.
I'm suggesting that for specific - but not rare - use cases we can let users decide whether to skip deepcopying and overwrite existing queryset.
# In such case we are sure that "Author.objects.all()" return value won't be used again, to create other, more specific query - we don't need deepcopying here. qs = Author.objects.all().filter(overwrite=True, name='foo')
This will give use quite performance boost, while remaining backward compatible (default: overwrite=False).
comment:3 by , 14 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
While we're at it, we could add a 'explode_randomly=True' option, defaulting to false. :-)
In other words: No.
If you are in a situation where optimizing the usage of a queryset if an issue, you should be using a raw queryset.
QuerySet profiling on Ubuntu box CPU: Intel(R) Celeron(R) M CPU 530 @ 1.73GHz, Python 2.6.5, MySQL 5.1@Innodb