I'm working with a Englsih / French website and when I use truncatewords_html with french texts with special caracters like "é,è,à, etc." (which is very common), it split words in half at thoses caracters.

Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, réfrigérateur ou congélateur, facilement [...]

Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, r ...

comment:1 by Baptiste Mispelon, 12 years ago

I cannot reproduce the issue you're describing.
I tried the following code with both 1.3 and master but it seems to be working correctly for me:

>>> from django.template.defaultfilters import truncatewords_html
>>> s = u"Depuis mars 2008, le programme RECYC-FRIGO d’Hydro-Québec vous permet de vous débarrasser d’un vieil appareil, réfrigérateur ou congélateur, facilement [...]"
>>> truncatewords_html(s, 18)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu\xe9bec vous permet de vous d\xe9barrasser d\u2019un vieil appareil, r\xe9frig\xe9rateur ...'

I'm closing this ticket as worksforme.
Could you please reopen it with an example of a piece of code that shows the issue you're having?


comment:2 by Jaap Roes, 12 years ago

I can reproduce it, but only if I convert the special characters to html entities first. Think that might be the actual cause:

>>> s = u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Québec vous permet de vous débarrasser d\u2019un vieil appareil, réfrigérateur ou congélateur, facilement'
>>> truncatewords_html(s, 8)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu ...'

comment:3 by Baptiste Mispelon, 12 years ago

Thanks for reopening this, there does appear to be an issue.

I made some quick tests and it seems that this behavior has always been present.

The problem seems to be that the regexp used to split words [1] doesn't consider a & to be part of a word, hence the behavior.

comment:4 by Jaap Roes, 12 years ago

What about converting html entities back to chars before the regex? Just whipped up a quick proof of concept that seems to work fine (and uses just stdlib code)

>>> import xml.sax.saxutils
>>> import htmlentitydefs
>>> entity2unicode = dict([('&%s;' % k, unichr(v)) for k, v in htmlentitydefs.name2codepoint.items()])
>>> truncatewords_html(xml.sax.saxutils.unescape(s, entity2unicode), 8)
u'Depuis mars 2008, le programme RECYC-FRIGO d\u2019Hydro-Qu\xe9bec ...'

comment:5 by Jaap Roes, 12 years ago

Noticed that the django.utils.text module already had an unescape_entities function. So I created this pull request:

comment:6 by Tim Graham <timograham@…>, 11 years ago

In 40b95a24ae159b6600457a23d6c2779a18037b7b:

Fixed #20568 -- truncatewords_html no longer splits words containing HTML entities.

Thanks yann0 at for the report.

