Opened 3 years ago

Closed 3 years ago

#32824 closed Cleanup/optimization (fixed)

Potential micro-optimisations for NodeList.render

Reported by: Keryn Knight Owned by: Keryn Knight
Component: Template system Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

The current implementation of NodeList.render looks like:

def render(self, context):
    bits = []
    for node in self:
        if isinstance(node, Node):
            bit = node.render_annotated(context)
        else:
            bit = node
        bits.append(str(bit))
    return mark_safe(''.join(bits))

which, given a single template render (template + code to be attached, but is basically the same fakery as in #32814) is rendered 41 times (cProfile output, relevant portions to be discussed in this ticket only):

   41/1    0.000    0.000    0.007    0.007 base.py:956(render)
    591    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
    191    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
115/113    0.000    0.000    0.000    0.000 safestring.py:53(mark_safe)

Given the numbers are too miniscule to register, for the purposes of surfacing the most time consuming parts, and pushing things like importlib machinery etc to the bottom of the profile output, rendering the same template 1000 times gives us:

4897753 function calls (4539046 primitive calls) in 2.832 seconds

 41000/1000    0.193    0.000    2.671    0.003 base.py:956(render)
70045/68045    0.042    0.000    0.095    0.000 safestring.py:50(mark_safe)
     128045    0.053    0.000    0.053    0.000 safestring.py:26(__init__)
     159432    0.023    0.000    0.023    0.000 {built-in method builtins.hasattr}
     114077    0.012    0.000    0.012    0.000 {built-in method builtins.callable}
     686582    0.076    0.000    0.076    0.000 {built-in method builtins.isinstance}
     318184    0.036    0.000    0.036    0.000 {method 'append' of 'list' objects}

(to ensure SafeString.__init__ calls are registered, I manually created the method, so timings for it aren't correct but counts will be).

Proposed change

The primary change I'm suggesting is deviating from using mark_safe to directly returning a new SafeString - as far as I know, the result of ''.join(...) is always of type str regardless of if any of the underlying bits are themselves SafeString etc, so for each call to NodeList.render we'd be saving 1 hasattr(...) test and 1 callable(...) test, which would change the 1000 runs to:

4774747 function calls (4416040 primitive calls) in 2.710 seconds

 41000/1000    0.194    0.000    2.669    0.003 base.py:956(render)
29045/27045    0.019    0.000    0.063    0.000 safestring.py:50(mark_safe)
     128045    0.048    0.000    0.048    0.000 safestring.py:26(__init__)
     118432    0.019    0.000    0.019    0.000 {built-in method builtins.hasattr}
      73077    0.008    0.000    0.008    0.000 {built-in method builtins.callable}
     686582    0.073    0.000    0.073    0.000 {built-in method builtins.isinstance}
     318184    0.038    0.000    0.038    0.000 {method 'append' of 'list' objects}

You can see that whilst the timings don't change substantially, it is a whole bunch less calls.

Worth investigating?

But possibly more interesting than that is the rest of the method. As far as I can tell from running the test suite (which, tbf, does include a boatload of skips being output), the line if isinstance(node, Node): is always true -- that is, there's no test anywhere for what happens if you somehow, somewhere include a non Node instance into a NodeList? If that's the case (which again, it may not be, given skips) then there's 2 possiblities:

  1. add in a test demonstrating what should happen when a non-Node appears in a NodeList and canonically accept it as part of DTL
  2. take the stance that NodeLists are inherently composed of only nodes (or at least things with render_annotated(context)) and sweep away the isinstance check.

If option 2 were taken, the final method could be (without any intermediate deprecation stuffs if necessary):

def render(self, context):
    bits = []
    append = bits.append
    for node in self:
        append(str(node.render_annotated(context)))
    return SafeString(''.join(bits))

which, compounded with the change from mark_safe to SafeString proposed above would give (again, 1000 renders):

4575748 function calls (4217041 primitive calls) in 2.685 seconds

 41000/1000    0.152    0.000    2.567    0.003 base.py:956(render)
29045/27045    0.018    0.000    0.062    0.000 safestring.py:53(mark_safe
     128045    0.048    0.000    0.048    0.000 safestring.py:26(__init__))
     118432    0.019    0.000    0.019    0.000 {built-in method builtins.hasattr}
      73077    0.008    0.000    0.008    0.000 {built-in method builtins.callable}
     487582    0.058    0.000    0.058    0.000 {built-in method builtins.isinstance}
     318184    0.039    0.000    0.039    0.000 {method 'append' of 'list' objects}

for a total saving of 199,000 isinstance checks, 41,000 hasattr checks and 41,000 callable checks (from a starting point of 4897753 function and 4539046 primitive calls down to 4575748 and 4217041 respectively)

Were that final method implementation the target/goal, it may be OK to move from stacking up a intermediate list (the nodes variable) to a generator expression into the ''.join(...) but briefly using memory_profiler to @profile the method shows its all 0 allocations anyway (presumably references & gc detection?)

Additional notes

Yappi is slower than cProfile but suggests the same small wins (amplified in both cases by the profiling machinery slowing everything down).

The difference (when measured via yappi, and reported via pycharm's handy profile visualiser) between the original method and the ideal replacement would be a decrease from 7.6% of time being spent (directly) on this method to 5.7%

The difference by just removing the mark_safe call and the SafeString suggestion is obviously less pronounced, but yappi suggests that the time spent in mark_safe goes from 1.9% of time to 0.9%

Those proportions are vaguely the same for cProfile, give or take.

Attachments (2)

textnode_test2.py (2.2 KB ) - added by Keryn Knight 3 years ago.
yappi/cprofile/timeit oriented template rendering (1 complex template, fake/minimal context)
nodelist_render_test.py (312 bytes ) - added by Keryn Knight 3 years ago.
synthetic test for isolating just this method, with memory profiler etc

Download all attachments as: .zip

Change History (10)

by Keryn Knight, 3 years ago

Attachment: textnode_test2.py added

yappi/cprofile/timeit oriented template rendering (1 complex template, fake/minimal context)

comment:1 by Jacob Walls, 3 years ago

Tim McCurrach proposed directly returning SafeString in several places in #32568.

Regarding the rest of the proposal, NodeList is documented to consist of Node objects, so option #2 sounds right-track to me:

``parser.parse()`` takes a tuple of names of block tags ''to parse until''. It
returns an instance of ``django.template.NodeList``, which is a list of
all ``Node`` objects that the parser encountered ''before'' it encountered
any of the tags named in the tuple.
Version 0, edited 3 years ago by Jacob Walls (next)

comment:2 by Mariusz Felisiak, 3 years ago

Triage Stage: UnreviewedAccepted

We don't even need bits, so maybe:

    def render(self, context):
        return SafeString(''.join([
            str(node.render_annotated(context))
            for node in self
        ]))

comment:3 by Keryn Knight, 3 years ago

Owner: changed from nobody to Keryn Knight
Status: newassigned

I think I'm happier still with that change, I wasn't proposing it because I couldn't previously get memory_profiler to play ball, but (at least in the pathological case of a stupid number of child nodes) it does skip the intermediate list, which does have some small measure on runtime performance:

Using a small stub for benchmarking, using ipython:

from nodelist_render_test import render
from django.template.base import NodeList
%load_ext memory_profiler
%mprun -f NodeList.render render()
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   957     58.8 MiB     51.1 MiB          10       def render(self, context):
   958                                                 # return SafeString(''.join([
   959                                                 #     str(node.render_annotated(context))
   960                                                 #     for node in self
   961                                                 # ]))
   962     58.8 MiB      0.0 MiB          10           bits = []
   963     58.8 MiB      0.0 MiB          10           append = bits.append
   964     59.5 MiB      0.0 MiB     1000010           for node in self:
   965     59.5 MiB      7.4 MiB     1000000               append(str(node.render_annotated(context)))
   966     59.6 MiB      1.1 MiB          10           return SafeString(''.join(bits))

and:

%timeit render()
216 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

timeit is consistently between 215ms & 225ms, repeatedly running the above (rather than using -nX00)

switching to the generator expression version, now I've got memory_profiler to report something:

%mprun -f NodeList.render render()
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   957     58.7 MiB     51.3 MiB          10       def render(self, context):
   958     58.9 MiB -11945.8 MiB     1000040           return SafeString(''.join([
   959     58.8 MiB -11939.6 MiB     1000000               str(node.render_annotated(context))
   960     58.8 MiB -25012.3 MiB     1000010               for node in self
   961                                                 ]))
   962                                                 # bits = []
   963                                                 # append = bits.append
   964                                                 # for node in self:
   965                                                 #     append(str(node.render_annotated(context)))
   966                                                 # return SafeString(''.join(bits))

dunno what to make of the deallocations, but I guess zeroes or minus numbers are better than 7.1MB ... maybe I really am generating 11GB of nodes in this synthetic benchmark :)

and timeit:

%timeit render()
202 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

timeit was consistently around 200 to 205ms repeatedly running it.

by Keryn Knight, 3 years ago

Attachment: nodelist_render_test.py added

synthetic test for isolating just this method, with memory profiler etc

comment:4 by Keryn Knight, 3 years ago

Oh and for completeness, the original method before any of these proposed changes, for historic reference:

%mprun -f NodeList.render render()
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   957     62.3 MiB    -43.6 MiB          10       def render(self, context):
   958     62.3 MiB      0.0 MiB          10           bits = []
   959     72.8 MiB -4336382.0 MiB     1000010           for node in self:
   960     72.8 MiB -4336466.3 MiB     1000000               if isinstance(node, Node):
   961                                                         bit = node.render_annotated(context)
   962                                                     else:
   963     72.8 MiB -4333316.3 MiB     1000000                   bit = node
   964     72.8 MiB -4333210.0 MiB     1000000               bits.append(str(bit))
   965     72.8 MiB     10.1 MiB          10           return mark_safe(''.join(bits))

and the timeit (remember this is the pathological case in terms of nodes to iterate over):

578 ms ± 4.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

consistently between 850-615ms across repeated runs.

comment:5 by Keryn Knight, 3 years ago

Has patch: set

comment:6 by Keryn Knight, 3 years ago

Sigh. Apologies for the noise, but I went for a coffee and realised I was mis-representing the performance of the original method (last set of benchmarks above) by not ensuring my FakeNode was a subclass of Node (and thus render_annotated wasn't being called, and so the string being returned was the class repr, inflating things massively) ... and that's not useful to anyone, so here's a repeat which should more accurately demonstrate the change for historic record:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   956     60.2 MiB     52.9 MiB          10       def render(self, context):
   957     60.2 MiB      0.0 MiB          10           bits = []
   958     60.4 MiB -11945.9 MiB     1000010           for node in self:
   959     60.4 MiB -11946.1 MiB     1000000               if isinstance(node, Node):
   960     60.4 MiB -11946.1 MiB     1000000                   bit = node.render_annotated(context)
   961                                                     else:
   962                                                         bit = node
   963     60.4 MiB -11939.6 MiB     1000000               bits.append(str(bit))
   964     60.5 MiB      1.0 MiB          10           return mark_safe(''.join(bits))

and the timeit:

306 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(consistently between 305 and 315ms across repeated runs)

comment:7 by Tim McCurrach, 3 years ago

Triage Stage: AcceptedReady for checkin

comment:8 by Mariusz Felisiak <felisiak.mariusz@…>, 3 years ago

Resolution: fixed
Status: assignedclosed

In 854e9b0:

Fixed #32824 -- Improved performance of NodeList.render().

This avoids the following:

  • checking that each item in the nodelist is a subclass of Node,
  • calling str() on the render_annotated() output, because it's documented that Node.render() must return a string,
  • calling mark_safe() on the output, when the value to be wrapped is definitively known to be a string because the result of .join() is always of that type,
  • using an intermediate list to store each individual string.
Note: See TracTickets for help on using tickets.
Back to Top