Context Navigation

← Previous Ticket
Next Ticket →

#30207 closed Cleanup/optimization (needsinfo)

Optimise paginator for tables with massive records

Reported by:	M. Javidan Darugar	Owned by:	M. Javidan Darugar
Component:	Core (Other)	Version:	dev
Severity:	Normal	Keywords:
Cc:	Can Sarıgöl	Triage Stage:	Unreviewed
Has patch:	no	Needs documentation:	no
Needs tests:	no	Patch needs improvement:	no
Easy pickings:	no	UI/UX:	no

Description

I had a problem with Paginator class to slice huge dataset. In case of millions of record it can take 30 second to reach the last page. So I cam up with a solution to overcome this problem.

The problem is that for slicing data, current solution needs to compute whole list to render data bottom:top. The reason is that, current script will generate for example SELECT ID, COL_1, ..., COL_N, ... WHERE ... which has huge burden for database to slice the data. To overcome this problem we can instead select primary keys and do the slicing step and then fetch records that their pk are in that sliced list. Very simple but very efficient solution. I improved the performance for our project significantly using this approach. So form 30 seconds to only 2-3 seconds for 8 million records.

In this ticket I propose same approach to improve django Paginator class.

Change History (9)

comment:1 by M. Javidan Darugar, 6 years ago

Owner:	changed from nobody to M. Javidan Darugar
Status:	new → assigned

follow-up: 7 comment:2 by Simon Charette, 6 years ago

Hello there,

Could you provide more details about which database backend is sped up with this technique? I'm a bit surprised two queries perform faster than one in this case because even if the first one would allows the usage of an index only scan the tuples still have to be fetched in the end. The query planner should be smart enough to chain the operations in a single query. I suspect something else might be at play here such as the query to count the number of results.

By the way, the issue title should be changed to reflect what the optimization technique is as right now it's pretty vague.

follow-up: 6 comment:3 by Nikolas, 6 years ago

i too use paginator hack for admin, work for Postgres only if not use any filters, this is can be inbox good optimization

class LargeTablePaginator(Paginator):
    """
    Warning: Postgresql only hack
    Overrides the count method of QuerySet objects to get an estimate instead of actual count when not filtered.
    However, this estimate can be stale and hence not fit for situations where the count of objects actually matter.
    """

    def _get_count(self):
        if getattr(self, '_count', None) is not None:
            return self._count

        query = self.object_list.query
        if not query.where:
            try:
                cursor = connection.cursor()
                cursor.execute("SELECT reltuples FROM pg_class WHERE relname = %s",
                               [query.model._meta.db_table])
                self._count = int(cursor.fetchone()[0])
            except:
                self._count = super(LargeTablePaginator, self)._get_count()
        else:
            self._count = super(LargeTablePaginator, self)._get_count()

        return self._count

    count = property(_get_count)

maximum speed on any sized table

if use only pk order query can serios optimize when have max pk and wont get page-1, in this case just need select ..... where pk<last_max_id. In this case will be use index and have maximum speed

comment:4 by Tim Graham, 6 years ago

Resolution:	→ needsinfo
Status:	assigned → closed

The technical justification seems lacking both here and on the PR.

comment:5 by M. Javidan Darugar, 5 years ago

Resolution:	needsinfo
Status:	closed → new

I reopen this ticket because it is not solved yet, if I am replying to messages by delay it is because I have a full time job and rarely have time to work on a side project but still love to contribute to the django community.

Some of the comments in the github branch was about naming for variable or function arguments which means there is no standards for naming for variables because everyone can suggest something new.

Another comment was too irrelevant about what I proposed. If postgres has a new feature it does not mean that it is good for django because django has an ORM which should work with other databases as well.

After 4 months that we have used same approach in our project the performance is still very good for pagination of almost 8 million of records on production server.

Last edited 5 years ago by M. Javidan Darugar (previous) (diff)

in reply to: 3 comment:6 by M. Javidan Darugar, 5 years ago

What do you mean by index? Tables index? for a table with such huge data index table can be very huge. In our case indexing used almost 18GB of disk memory. Indexing is very useful method but not always it depends on the problem and the situation.

Replying to Nikolas:

i too use paginator hack for admin, work for Postgres only if not use any filters, this is can be inbox good optimization

class LargeTablePaginator(Paginator):
    """
    Warning: Postgresql only hack
    Overrides the count method of QuerySet objects to get an estimate instead of actual count when not filtered.
    However, this estimate can be stale and hence not fit for situations where the count of objects actually matter.
    """

    def _get_count(self):
        if getattr(self, '_count', None) is not None:
            return self._count

        query = self.object_list.query
        if not query.where:
            try:
                cursor = connection.cursor()
                cursor.execute("SELECT reltuples FROM pg_class WHERE relname = %s",
                               [query.model._meta.db_table])
                self._count = int(cursor.fetchone()[0])
            except:
                self._count = super(LargeTablePaginator, self)._get_count()
        else:
            self._count = super(LargeTablePaginator, self)._get_count()

        return self._count

    count = property(_get_count)

maximum speed on any sized table

if use only pk order query can serios optimize when have max pk and wont get page-1, in this case just need select ..... where pk<last_max_id. In this case will be use index and have maximum speed

in reply to: 2 comment:7 by M. Javidan Darugar, 5 years ago

Well this is a bit surprising for me as well that many developers don't know about this, because it is a very basic fact in database core, maybe because most developers focused on code base rather than raw queries.

But there are good reference to read more about these concepts and facts about databases:

https://www.scarydba.com/2016/10/24/sub-query-not-hurt-performance/

But this one gives developers a great insight about database (I read edition 2014 but I believe new edition will be published soon):
Joe Celko's SQL for Smarties: Advanced SQL Programming (The Morgan Kaufmann Series in Data Management Systems)

Replying to Simon Charette:

Hello there,

Could you provide more details about which database backend is sped up with this technique? I'm a bit surprised two queries perform faster than one in this case because even if the first one would allows the usage of an index only scan the tuples still have to be fetched in the end. The query planner should be smart enough to chain the operations in a single query. I suspect something else might be at play here such as the query to count the number of results.

By the way, the issue title should be changed to reflect what the optimization technique is as right now it's pretty vague.

Last edited 5 years ago by M. Javidan Darugar (previous) (diff)

comment:8 by Can Sarıgöl, 5 years ago

Cc:	Can Sarıgöl added

comment:9 by Carlton Gibson, 5 years ago

Resolution:	→ needsinfo
Status:	new → closed

Hi. At this point, the issue tracker here is not the appropriate place for this discussion.

Please see TicketClosingReasons/DontReopenTickets.

Rather, a message to the DevelopersMailingList explaining the issue and the proposed change would be needed to progress.

A couple of points to make it easier:

This kind of thing has come up previously: search here and the mailing list history to find the discussions. Summarizing that history may help.
Wrapping your suggestion in a custom paginator class and providing a sample project, with some benchmarking data maybe, so people can explore your solution, would be useful.
In general, the proficiency level of contributors to the ORM is very high. They know how databases work. Assume that.

Thanks.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues