#33218 closed Bug (invalid)
slugify() can't handle Turkish İ while allow_unicode = True
Reported by: | sowinski | Owned by: | nobody |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | slugify |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Please see the following example.
The first character test_str = "i̇zmit" is not a normal i. It is the İ from the Turkish alphabet.
Using allow_unicode=True should keep the Turkish İ instead of replacing it with a normal i.
import unicodedata import re def slugify(value, allow_unicode=False): """ Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren't alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores. """ value = str(value) if allow_unicode: value = unicodedata.normalize('NFKC', value) else: value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii') value = re.sub(r'[^\w\s-]', '', value.lower()) return re.sub(r'[-\s]+', '-', value).strip('-_') test_str = "i̇zmit" output = slugify(test_str, allow_unicode = True) print(test_str) print(output) print(test_str == output)
Change History (2)
follow-up: 2 comment:1 by , 3 years ago
Component: | CSRF → Utilities |
---|---|
Resolution: | → invalid |
Status: | new → closed |
comment:2 by , 3 years ago
Thank you for the fast response.
I do not agree, because of this behavior it would be impossible to create an article for the capital of Turkey while allow_unicode=True.
https://tr.wikipedia.org/wiki/%C4%B0stanbul
Maybe someone else have a international website and will hit this problem.
I solved the problem by adding the I to the regular expression.
value = re.sub(r'[^\w\si̇-]', '', value.lower())
I testes the implementation with all cities in the world with all the different language variants of the city name and it worked for me.
http://www.geonames.org/
It is interesting to see that this the only edge case. Not sure if this will work in all situations. So I run only my modification if the strange i is in the string. Otherwise is jump to the django implementation.
See: https://github.com/wagtail/wagtail/issues/7637#issuecomment-949366560
It's not about 'İ' but about '̇' which is the second character. IMO,
slugify()
properly removes '̇', see:See also related ticket #30892 about "İ".