Opened 12 years ago

Closed 11 years ago

#20568 closed Bug (fixed)

templatetag truncatewords_html split words containing HTML entities

Reported by: yann0@… Owned by: Jaap Roes
Component: Utilities Version: dev
Severity: Normal Keywords:
Cc: bmispelon@… Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

I'm working with a Englsih / French website and when I use truncatewords_html with french texts with special caracters like "é,è,à, etc." (which is very common), it split words in half at thoses caracters.

Example:
Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, réfrigérateur ou congélateur, facilement [...]

become:
Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, r ...

Change History (6)

comment:1 by Baptiste Mispelon, 12 years ago

Resolution: worksforme
Status: newclosed

Hi,

I cannot reproduce the issue you're describing.
I tried the following code with both 1.3 and master but it seems to be working correctly for me:

>>> from django.template.defaultfilters import truncatewords_html
>>> s = u"Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, réfrigérateur ou congélateur, facilement [...]"
>>> truncatewords_html(s, 18)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu\xe9bec vous permet de vous d\xe9barrasser d\u2019un vieil appareil, r\xe9frig\xe9rateur ...'

I'm closing this ticket as worksforme.
Could you please reopen it with an example of a piece of code that shows the issue you're having?

Thanks.

comment:2 by Jaap Roes, 12 years ago

Resolution: worksforme
Status: closednew

I can reproduce it, but only if I convert the special characters to html entities first. Think that might be the actual cause:

>>> s = u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Québec vous permet de vous débarrasser d\u2019un vieil appareil, réfrigérateur ou congélateur, facilement'
>>> truncatewords_html(s, 8)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu ...'

comment:3 by Baptiste Mispelon, 12 years ago

Cc: bmispelon@… added
Summary: templatetag truncatewords_html split words on special caracterstemplatetag truncatewords_html split words containing HTML entities
Triage Stage: UnreviewedAccepted
Version: 1.3master

Hi,

Thanks for reopening this, there does appear to be an issue.

I made some quick tests and it seems that this behavior has always been present.

The problem seems to be that the regexp used to split words [1] doesn't consider a & to be part of a word, hence the behavior.

comment:4 by Jaap Roes, 11 years ago

What about converting html entities back to chars before the regex? Just whipped up a quick proof of concept that seems to work fine (and uses just stdlib code)

>>> import xml.sax.saxutils
>>> import htmlentitydefs
>>> entity2unicode = dict([('&%s;' % k, unichr(v)) for k, v in htmlentitydefs.name2codepoint.items()])
>>> truncatewords_html(xml.sax.saxutils.unescape(s, entity2unicode), 8)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu\xe9bec ...'

comment:5 by Jaap Roes, 11 years ago

Has patch: set
Owner: changed from nobody to Jaap Roes
Status: newassigned

Noticed that the django.utils.text module already had an unescape_entities function. So I created this pull request:

https://github.com/django/django/pull/1332

comment:6 by Tim Graham <timograham@…>, 11 years ago

Resolution: fixed
Status: assignedclosed

In 40b95a24ae159b6600457a23d6c2779a18037b7b:

Fixed #20568 -- truncatewords_html no longer splits words containing HTML entities.

Thanks yann0 at hotmail.com for the report.

Note: See TracTickets for help on using tickets.
Back to Top