Opened 11 years ago
Closed 11 years ago
#21573 closed Cleanup/optimization (fixed)
Cache regular expression for normalizing newlines in django.utils.text.normalize_newlines
Reported by: | Vajrasky Kok | Owned by: | Vajrasky Kok |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | performance |
Cc: | sky.kok@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | yes | UI/UX: | no |
Description
This is the method:
def normalize_newlines(text): return force_text(re.sub(r'\r\n|\r|\n', '\n', text)) normalize_newlines = allow_lazy(normalize_newlines, six.text_type)
It should be:
_newlines_re = re.compile(r'\r\n|\r|\r') def normalize_newlines(text): return force_text(_newlines_re.sub('\n', text)) normalize_newlines = allow_lazy(normalize_newlines, six.text_type)
Things to be considered:
- Do we use this method often in Django? If not, caching may not be necessary.
- From http://docs.python.org/2/library/re.html#re.compile: The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
- The performance gain maybe negligible.
Change History (6)
comment:1 by , 11 years ago
Cc: | added |
---|---|
Has patch: | set |
Owner: | changed from | to
Status: | new → assigned |
comment:2 by , 11 years ago
I ran a small benchmark [1] with four variants:
- The original function
- The same function but with a simpler regexp (
r'\r\n|\r'
instead ofr'\r\n|\r|\n'
) - The original function with a compiled regexp
- A function using
str.replace
instead ofre.sub
I tried each variant on 3 different sizes of text: small (~20 lines), medium (~200 lines), and big (~2000 lines).
Here are the results (less is better):
Small | Medium | Big | |
---|---|---|---|
original | 19.2 usec | 141 usec | 1.65 msec |
simple_re | 7.42 usec | 14.8 usec | 135 usec |
compiled_re | 6.15 usec | 13.8 usec | 146 usec |
str_replace | 5.29 usec | 16.4 usec | 130 usec |
The str.replace
looks like a clear winner to me so unless there are objections, I'll commit that one.
comment:3 by , 11 years ago
Triage Stage: | Unreviewed → Accepted |
---|
(Oops, forgot to change the status of the ticket)
comment:4 by , 11 years ago
Objection. I got different result in my machine (32 bit) with Fedora Core 18.
On Python 3.4 (Yeah!!!!):
Function | Small | Medium | Big |
original: | 95 usec | 625 usec | 7.52 msec |
simple_re: | 33.6 usec | 57 usec | 281 usec |
compiled_re: | 27.2 usec | 50.7 usec | 273 usec |
str_replace: | 27.1 usec | 51 usec | 303 usec |
But on Python 2.7, the compiled_re
is still the winner:
Function | Small | Medium | Big |
original: | 34.3 usec | 231 usec | 2.48 msec |
simple_re: | 17.6 usec | 49.6 usec | 343 usec |
compiled_re: | 14.6 usec | 46.5 usec | 342 usec |
str_replace: | 14 usec | 53.3 usec | 414 usec |
comment:5 by , 11 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
Alright, let's go with compiled_re
then (they're all in the same ballpark anyway).
comment:6 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Here is the PR: https://github.com/django/django/pull/2046
The unit test for normalize_newlines can be found in #21572.