Opened 12 years ago
Closed 11 years ago
#20568 closed Bug (fixed)
templatetag truncatewords_html split words containing HTML entities
Reported by: | Owned by: | Jaap Roes | |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | bmispelon@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I'm working with a Englsih / French website and when I use truncatewords_html with french texts with special caracters like "é,è,à, etc." (which is very common), it split words in half at thoses caracters.
Example:
Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, réfrigérateur ou congélateur, facilement [...]
become:
Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, r ...
Change History (6)
comment:1 by , 12 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:2 by , 12 years ago
Resolution: | worksforme |
---|---|
Status: | closed → new |
I can reproduce it, but only if I convert the special characters to html entities first. Think that might be the actual cause:
>>> s = u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Québec vous permet de vous débarrasser d\u2019un vieil appareil, réfrigérateur ou congélateur, facilement' >>> truncatewords_html(s, 8) u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu ...'
comment:3 by , 12 years ago
Cc: | added |
---|---|
Summary: | templatetag truncatewords_html split words on special caracters → templatetag truncatewords_html split words containing HTML entities |
Triage Stage: | Unreviewed → Accepted |
Version: | 1.3 → master |
Hi,
Thanks for reopening this, there does appear to be an issue.
I made some quick tests and it seems that this behavior has always been present.
The problem seems to be that the regexp used to split words [1] doesn't consider a &
to be part of a word, hence the behavior.
comment:4 by , 11 years ago
What about converting html entities back to chars before the regex? Just whipped up a quick proof of concept that seems to work fine (and uses just stdlib code)
>>> import xml.sax.saxutils >>> import htmlentitydefs >>> entity2unicode = dict([('&%s;' % k, unichr(v)) for k, v in htmlentitydefs.name2codepoint.items()]) >>> truncatewords_html(xml.sax.saxutils.unescape(s, entity2unicode), 8) u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu\xe9bec ...'
comment:5 by , 11 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Status: | new → assigned |
Noticed that the django.utils.text
module already had an unescape_entities
function. So I created this pull request:
comment:6 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Hi,
I cannot reproduce the issue you're describing.
I tried the following code with both 1.3 and master but it seems to be working correctly for me:
I'm closing this ticket as
worksforme
.Could you please reopen it with an example of a piece of code that shows the issue you're having?
Thanks.