Opened 14 years ago

Closed 14 years ago

#14474 closed (wontfix)

Unnecessary deepcopying of QuerySet inside filter() method results in slower execution

Reported by: Piotr Czachur Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Keywords: deepcopy, filter
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When building query using multiple filter() calls like that:

    qs = MyModel.objects.filter(...).filter(...).filter(...)

it results in sql.Query.clone() for every filter() call.
Query.clone() uses deepcopy() for "where", "having", "aggregates" and "deferred_loading" properties:

        obj.where = deepcopy(self.where, memo=memo)
        obj.having = deepcopy(self.having, memo=memo)
        obj.aggregates = deepcopy(self.aggregates, memo=memo)
        obj.deferred_loading = deepcopy(self.deferred_loading, memo=memo)

... which can be time-consumig operation.

My question is, do we really need deepcopy when we call filter() like this?
I mean we only need QuerySet created by last filter() call, so we can overwrite previous QuerySet instead of creating a (deep)copy.
Maybe adding flag for filter() method would be possible: filter(overwrite=False).

P.S.

I attach profiler results for non-trivial query run on my Ubuntu box: CPU: Intel(R) Celeron(R) M CPU 530 @ 1.73GHz, Python 2.6.5 / MySQL 5.1@InnoDB
Results show that for complicated query, query building time can be far larger than executing SQL and filling model instances with results.
I tested two version of the query:

    # multi-filter
    result = Zero.objects.select_related('one__two__three') \
        .filter(field_x=None) \
        .filter(field_t='P') \
        .filter(one__id__gt=2) \
        .filter(one__field_j__isnull=True) \
        .filter(one__two__id__isnull=False) \
        .filter(one__two__field_y__year=2009) \
        .filter(one__two__three__field_j__isnull=True) \
        .filter(one__two__three__field_z__gt=2) \
        [:1]
    len(result)

and

    # filter-once
    result = Zero.objects.select_related('one__two__three') \
        .filter(field_x=None, 
        field_t='P', 
        one__id__gt=2, 
        one__field_j__isnull=True, 
        one__two__id__isnull=False, 
        one__two__field_y__year=2009, 
        one__two__three__field_j__isnull=True, 
        one__two__three__field_z__gt=2 
        )[:1]
    len(result)

Attachments (1)

QuerySet_execution_time.ods (18.4 KB ) - added by Piotr Czachur 14 years ago.
QuerySet profiling on Ubuntu box CPU: Intel(R) Celeron(R) M CPU 530 @ 1.73GHz, Python 2.6.5, MySQL 5.1@Innodb

Download all attachments as: .zip

Change History (4)

by Piotr Czachur, 14 years ago

Attachment: QuerySet_execution_time.ods added

QuerySet profiling on Ubuntu box CPU: Intel(R) Celeron(R) M CPU 530 @ 1.73GHz, Python 2.6.5, MySQL 5.1@Innodb

comment:1 by Russell Keith-Magee, 14 years ago

Resolution: invalid
Status: newclosed

Yes, a deepcopy is required, because you can build on querysets in parallel paths. i.e.,

qs = Author.objects.all()
qs1 = qs.filter(name='foo')
qs2 = qs.filter(age=5)

If we don't deepcopy, qs1 and qs2 will be partially sharing internal memory structures, which will lead to interesting behavior.

comment:2 by Piotr Czachur, 14 years ago

Resolution: invalid
Status: closedreopened

Maybe I haven't made myself entirely clear.

I'm not asking if deepcopying in general is necessary.
I'm suggesting that for specific - but not rare - use cases we can let users decide whether to skip deepcopying and overwrite existing queryset.

# In such case we are sure that "Author.objects.all()" return value won't be used again, to create other, more specific query - we don't need deepcopying here.
qs = Author.objects.all().filter(overwrite=True, name='foo')

This will give use quite performance boost, while remaining backward compatible (default: overwrite=False).

comment:3 by Russell Keith-Magee, 14 years ago

Resolution: wontfix
Status: reopenedclosed

While we're at it, we could add a 'explode_randomly=True' option, defaulting to false. :-)

In other words: No.

If you are in a situation where optimizing the usage of a queryset if an issue, you should be using a raw queryset.

Note: See TracTickets for help on using tickets.
Back to Top