#8391 closed Bug (wontfix)
slugify template filter poorly encodes non-English strings
Reported by: | Bjorn Kristinsson | Owned by: | nobody |
---|---|---|---|
Component: | Template system | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | hr.bjarni+django@…, kmike84@…, mmitar@… | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Going through the admin interface with a slug field, 'bøøøø' becomes 'boooo' (as expected)
But running this code:
from django.template.defaultfilters import slugify
print slugify('bøøøø')
print slugify(u'bøøøø')
results in:
'b'
'ba-a-a-a'
Results vary depending on which characters are used; I found this trying to inject a bunch of cyrillic and greek into a database, and most of the slug fields were empty. Entering them manually through the admin interface worked fine.
Change History (39)
comment:1 by , 16 years ago
comment:2 by , 16 years ago
For the first example, the expected results aren't well defined: Unicode characters can't be reliably represented in a bytestring. I think Django does smart_unicode
on the input so it works for UTF-8 byte strings but that's just Django being flexible.
For the second example, I suspect where you've typed u'bøøøø', Python has interpreted your unicode string literal as the wrong character set, equivalent to
u'bøøøø'.encode('utf8').decode('iso-8859-1')
Putting the same thing in a script with a PEP-263 header gives 'b' for the second example.
follow-up: 4 comment:3 by , 16 years ago
The reason why it doesn't give the same result in the admin and via the code above is that different algorithms are used.
In the admin, it is done with some javascript, and odd characters are replaced by their latin 'equivalent', in particular:
var LATIN_MAP = { ... 'ø': 'o', ... }
In the template filter, odd characters that are not representable in ASCII are simply stripped out, see the 'ignore' below:
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
Maybe the filter should replicate the javascript's algorithm, or vice versa, to make things homogeneous?
comment:4 by , 16 years ago
Yep, I've been digging in and found the same. I absolutely think the two should work the same way, especially since it would make my like so much easier ;)
var LATIN_MAP = { ... 'ø': 'o', ... }
I'm trying to translate the javascript into Python, but python's handling of this is giving me a headache. In the javascript function, there's a regular expression that matches one or more characters not in, a sequence of characters in, the list of special characters.
As an example, 'bjørn' becomes 'bj', 'ø' and 'rn'. It then tries to find these in the above map: LATIN_MAPbj returns nothing so it's left unchanged, LATIN_MAPø returns 'o', and LATIN_MAPrn nothing again. The result is 'bjorn'
Python, on the other hand, matches 'bj', '\xc3' and '\xb8rn'. There is no match for '\xc3' in LATIN_MAP, only in the regexp. Now I just need to find a way of splitting this correctly, any ideas?
comment:5 by , 16 years ago
... I really need to start using the 'Preview' function. Hope that's legible.
comment:7 by , 16 years ago
Summary: | Results of slugify in the admin interface differ from the one in shell. → Admin slugify function's results defer from those of slugify template filter |
---|
There's another difference between the two algorithms. In the admin, small words are stripped out by the javascript. For example, "This is a sentence with small words" returns "sentence-small-words". Whereas the template filter gives "this-is-a-sentence-with-small-words".
Ideally, this word replacement should be configurable per-language. Maybe it already is, I've never tried.
Another remark. In the admin, the javascript function is called 'URLify', so that's maybe for a good reason...
comment:8 by , 16 years ago
milestone: | → 1.0 maybe |
---|
comment:9 by , 16 years ago
Summary: | Admin slugify function's results defer from those of slugify template filter → Admin slugify function's results differ from those of slugify template filter |
---|
Unfortunately, slugify isn't very well-defined outside of English. In German, for example, you would want to slugify 'Grüß' as 'gruess', but the same logic doesn't apply in other languages, where generally you can just omit accents at a pinch. We could build JSON transliteration character maps, but for i18n we would need several so that they can be selected based on locale. As an alternative, we could just do something with IRIs instead of trying to coerce Unicode to a "nice" ASCII string.
comment:10 by , 16 years ago
Perhaps some sort of overriding mechanism could be implemented, say a dictionary in settings.py that is appended to and overrides the defaults? So by default 'ö' becomes 'o', but this behaviour could be changed by something like:
CHARACTER_MAP = { 'ö': 'oe', 'ü': 'ue', ..... }
But julien raises a good point, the javascript function is called URLify, not slugify, so perhaps the issue is not that slugify is 'wrong', but more that we need a python equivalent of URLify?
comment:11 by , 16 years ago
My efforts at translating the javascript so far fall flat at the way python matches things:
import re LATIN_MAP = { 'ö': 'o' } regex = re.compile('[ö]|[^ö]+') pieces = regex.findall('björn') downcoded = "" for piece in pieces: mapped = "" try: mapped = LATIN_MAP[piece] except: mapped = piece downcoded += mapped print pieces, downcoded # Expected: ['bj', 'ö', 'rn'] bjorn # Result: ['bj', '\xc3', '\xb6' 'rn'] björn
I.e. LATIN_MAP['ö']
isn't looked up, but LATIN_MAP['\xc3']
and LATIN_MAP['\xb6']
are, separately. LATIN_MAP['\xc3\xb6']
would work, but how to make sure these 'stay together' is something that leaves me stumped.
comment:12 by , 16 years ago
I think the template filter should accept an argument, so slugify:"de"
might add in the German-specific rules, and so on; a settings.py option would just choose whether that was done by default. But it would be difficult to ensure those tables are comprehensive enough. I note that libiconv has rudimentary support for transliteration, so perhaps we could use their data.
Slugs and URLs are the same thing, imho.
If we can adequately provide IRIs, on the other hand, the slugify operation can be well-defined, eg.
>>> slugify(u'Føø Bär Baß') u'føø-bär-baß'
@bjornkri: You're still using UTF-8 encoded bytestrings. You must use unicode strings.
comment:13 by , 16 years ago
@Daniel Pope
Using IRIs seems quite promising. I guess language specific rules could be stored in the local flavors.
comment:14 by , 16 years ago
Hmmmm... this ticket is looking less and less like a ticket, and more and more like an email discussion. Should it be brought to the dev-list? Volunteer?
comment:15 by , 16 years ago
#7980 aims to improve i18n and using CLRD data. What we need for slugify should be in there somewhere.
comment:16 by , 16 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Okay, this is all a bit of a non-issue. The Javascript and Python versions are not intended to give the same results. There is much more functionality available at the Python level, for a start, in the form of codecs and unicode mapping data. Secondly, we're not interested in shipping more and more data over in the Javascript file. The Javascript version works reasonably well for a bunch of cases. It doesn't work at all in other cases (e.g. Japanese text. In other cases it gives some result and some people may prefer something else. The point is that it doesn't matter. It's just an aid. If you don't like what the aid gives you, you can happily edit the field in the admin, or always do it on the Python side, or create your own Javascript function to use.
The Javascript function is not meant to be something that works perfectly for everybody because transliteration is a very ambiguous area. If it doesn't work for your purposes, don't use it.
comment:17 by , 16 years ago
Component: | Uncategorized → Template system |
---|---|
milestone: | 1.0 maybe |
Resolution: | wontfix |
Status: | closed → reopened |
Summary: | Admin slugify function's results differ from those of slugify template filter → slugify template filter poorly encodes non-English strings |
Sorry Malcolm, I may have subverted this ticket a little by talking about generalised handling of slugs, and thrown you off the scent.
The original ticket was about a broken Python slugify filter, not a broken Javascript function. It was simply an observation on bjornkri's part that the admin javascript works better. The Python filter is not "just an aid". It should produce acceptably good results, which it has not done for the string u'bøøøø'.
Reopening.
comment:18 by , 16 years ago
I have created a function that downcodes a string in a way similar to what urlify does but in Python.
This can be used in conjunction to slugify like this :
slug = slugify(downcode(u'Γειά σου κόσμε!'))
Or it can be called from within slugify if the developers agree to merge it in!
Have fun!
#!/usr/bin/python # -*- coding: utf-8 -*- # # (c) 2008 Harry Kalogirou <harkal@gmail.com> # # * Language maps taken from django's javascript urlify # import re LATIN_MAP = { u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A', u'Æ': 'AE', u'Ç':'C', u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'Ì': 'I', u'Í': 'I', u'Î': 'I', u'Ï': 'I', u'Ð': 'D', u'Ñ': 'N', u'Ò': 'O', u'Ó': 'O', u'Ô': 'O', u'Õ': 'O', u'Ö':'O', u'Ő': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ü': 'U', u'Ű': 'U', u'Ý': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a', u'ã': 'a', u'ä':'a', u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e', u'ë': 'e', u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n', u'ò': 'o', u'ó':'o', u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u', u'ú': 'u', u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y' } LATIN_SYMBOLS_MAP = { u'©':'(c)' } GREEK_MAP = { u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z', u'η':'h', u'θ':'8', u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3', u'ο':'o', u'π':'p', u'ρ':'r', u'σ':'s', u'τ':'t', u'υ':'y', u'φ':'f', u'χ':'x', u'ψ':'ps', u'ω':'w', u'ά':'a', u'έ':'e', u'ί':'i', u'ό':'o', u'ύ':'y', u'ή':'h', u'ώ':'w', u'ς':'s', u'ϊ':'i', u'ΰ':'y', u'ϋ':'y', u'ΐ':'i', u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z', u'Η':'H', u'Θ':'8', u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Μ':'M', u'Ν':'N', u'Ξ':'3', u'Ο':'O', u'Π':'P', u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X', u'Ψ':'PS', u'Ω':'W', u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H', u'Ώ':'W', u'Ϊ':'I', u'Ϋ':'Y' } TURKISH_MAP = { u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U', u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G' } RUSSIAN_MAP = { u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e', u'ё':'yo', u'ж':'zh', u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m', u'н':'n', u'о':'o', u'п':'p', u'р':'r', u'с':'s', u'т':'t', u'у':'u', u'ф':'f', u'х':'h', u'ц':'c', u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'', u'э':'e', u'ю':'yu', u'я':'ya', u'А':'A', u'Б':'B', u'В':'V', u'Г':'G', u'Д':'D', u'Е':'E', u'Ё':'Yo', u'Ж':'Zh', u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M', u'Н':'N', u'О':'O', u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F', u'Х':'H', u'Ц':'C', u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'', u'Э':'E', u'Ю':'Yu', u'Я':'Ya' } UKRAINIAN_MAP = { u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ґ':'G', u'є':'ye', u'і':'i', u'ї':'yi', u'ґ':'g' } CZECH_MAP = { u'č':'c', u'ď':'d', u'ě':'e', u'ň':'n', u'ř':'r', u'š':'s', u'ť':'t', u'ů':'u', u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R', u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z' } POLISH_MAP = { u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o', u'ś':'s', u'ź':'z', u'ż':'z', u'Ą':'A', u'Ć':'C', u'Ę':'e', u'Ł':'L', u'Ń':'N', u'Ó':'o', u'Ś':'S', u'Ź':'Z', u'Ż':'Z' } LATVIAN_MAP = { u'ā':'a', u'č':'c', u'ē':'e', u'ģ':'g', u'ī':'i', u'ķ':'k', u'ļ':'l', u'ņ':'n', u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E', u'Ģ':'G', u'Ī':'i', u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z' } def _makeRegex(): ALL_DOWNCODE_MAPS = {} ALL_DOWNCODE_MAPS.update(LATIN_MAP) ALL_DOWNCODE_MAPS.update(LATIN_SYMBOLS_MAP) ALL_DOWNCODE_MAPS.update(GREEK_MAP) ALL_DOWNCODE_MAPS.update(TURKISH_MAP) ALL_DOWNCODE_MAPS.update(RUSSIAN_MAP) ALL_DOWNCODE_MAPS.update(UKRAINIAN_MAP) ALL_DOWNCODE_MAPS.update(CZECH_MAP) ALL_DOWNCODE_MAPS.update(POLISH_MAP) ALL_DOWNCODE_MAPS.update(LATVIAN_MAP) s = u"".join(ALL_DOWNCODE_MAPS.keys()) regex = re.compile(u"[%s]|[^%s]+" % (s,s)) return ALL_DOWNCODE_MAPS, regex _MAPINGS = None _regex = None def downcode(s): """ This function is 'downcode' the string pass in the parameter s. This is useful in cases we want the closest representation, of a multilingual string, in simple latin chars. The most probable use is before calling slugify. """ global _MAPINGS, _regex if not _regex: _MAPINGS, _regex = _makeRegex() downcoded = "" for piece in _regex.findall(s): if _MAPINGS.has_key(piece): downcoded += _MAPINGS[piece] else: downcoded += piece return downcoded if __name__ == "__main__": string = u'Καλημέρα Joe!' print 'Original :', string print 'Downcoded :', downcode(string)
comment:19 by , 16 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
Please don't reopen tickets closed by a committer. The correct way to revisit issues is to take it up on django-dev.
comment:20 by , 16 years ago
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
Jacob, I probably wontfixed in error, due to the confusion sown by Daniel. It's worth looking at this.
comment:21 by , 16 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:22 by , 16 years ago
Another datapoint: In my language both 'å' and 'ø' are valid words in and of themselves... the standard slugify reduces both of these to . Oopsy. 'æææææææ' is a popular way to describe a scream, it also becomes... .
I have my own slugify-function that turns the unicode-string into NFKD then slugifys that, then checks that the string isn't empty and if it is: adds a dummy string + the datetime + random string. This is independent of locale, which I consider a bonus.
comment:24 by , 15 years ago
Why not use proper Unicode transliteration package like http://pypi.python.org/pypi/Unidecode/0.04.1 ? Transliteration is currently the best way to go Unicode->ASCII
comment:25 by , 15 years ago
That package is too big to bundle and too trivial for Django to depend strongly upon. But it's a good starting place if you want to write your own slugify filter.
comment:26 by , 14 years ago
Cc: | added |
---|
Hi, this ticket is way to old for a trivial feature (for non-English-speaking-country based programmers).
When slugifying characters in my language, some of them are properly downgraded to ascii lookalikes but some of them get lost. That's pretty irritating for such a trivial feature.
There seems to be a consensus in other web frameworks on slugifying international characters, and that is to have a map.
Did you (core programmers) take a look at harkal's suggestion above? This is the way i.e. WordPress and others go about solving this problem.
I also found this ready made function: slughify
It is small, concise and it works.
If you decide you don't want/need to fix/upgrade the slugify function, or if you think it will take a very long time before you decide, then I'd like to suggest that it be made into a setting as soon as possible:
SLUGIFY_FUNCTION = myown_slugify
with django.template.defaultfilters.slugify as the default
But optimally the function provided by django should work for all languages in my opinion.
Thanks
comment:27 by , 14 years ago
Cc: | added |
---|
comment:28 by , 14 years ago
Cc: | added |
---|
comment:29 by , 14 years ago
Cc: | removed |
---|
comment:30 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → Bug |
comment:31 by , 14 years ago
Cc: | added |
---|---|
Easy pickings: | unset |
UI/UX: | unset |
I have added made slugify2 function which first downcodes and then translates to slug. It behaves exactly the same as its JavaScript counterpart. So now it is possible to have both in Python and JavaScript same behavior.
comment:32 by , 13 years ago
Triage Stage: | Design decision needed → Accepted |
---|
see #16853 for a Turkish case
Seems that there have been no objections to the downcode then slugify approach.
This seems ready for someone to take a shot at implementing that approach in a patch.
comment:34 by , 13 years ago
Above slugify2 function won't fix #16853.
# -*- coding: utf-8 -*- import sys import re from django.utils import encoding TURKISH_MAP = { u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', u'ü':'u', u'Ü':'U', u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G' } ALL_DOWNCODE_MAPS = [ TURKISH_MAP, ] class Downcoder(object): map = {} regex = None def __init__(self): self.map = {} chars = u'' for lookup in ALL_DOWNCODE_MAPS: for c, l in lookup.items(): self.map[c] = l chars += encoding.force_unicode(c) self.regex = re.compile(ur'[' + chars + ']|[^' + chars + ']+', re.U) downcoder = Downcoder() def downcode(value): downcoded = u'' pieces = downcoder.regex.findall(value) if pieces: for p in pieces: mapped = downcoder.map.get(p) if mapped: downcoded += mapped else: downcoded += p else: downcoded = value return value def slugify2(value): """ Normalizes string, converts to lowercase, removes non-alpha characters, and converts spaces to hyphens. """ import unicodedata value = downcode(value) value = unicodedata.normalize('NFD', value).encode('ascii', 'ignore') value = unicode(re.sub('[^\w\s-]', '', value).strip().lower()) return re.sub('[-\s]+', '-', value) print(slugify2(u"Işık ılık süt iç"))
This prints "isk-lk-sut-ic", but expected value is, "isik-ilik-sut-ic".
comment:36 by , 12 years ago
Status: | reopened → new |
---|
comment:37 by , 10 years ago
Note that slugify2 is now here: https://github.com/mitar/django-missing/blob/master/missing/templatetags/url_tags.py
comment:38 by , 10 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
There's obviously more than one way to achieve slugification, depending on your tastes and constraints.
If we try to be smart, we'll get dozens and dozens of tickets from people who want to be smarter -- see the urlize filter for an example.
Django's implementation has the advantage of being simple and relying only on the stdlib. Pretty good solutions are available externally.
The drawbacks of implementing something more complicated outweigh the advantages at this stage.
The code snippet again, for easy copying and pasting...
results in: 'b' 'ba-a-a-a'