Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#33218 closed Bug (invalid)

slugify() can't handle Turkish İ while allow_unicode = True

Reported by: sowinski Owned by: nobody
Component: Utilities Version: dev
Severity: Normal Keywords: slugify
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Please see the following example.
The first character test_str = "i̇zmit" is not a normal i. It is the İ from the Turkish alphabet.

Using allow_unicode=True should keep the Turkish İ instead of replacing it with a normal i.

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s]+', '-', value).strip('-_')


test_str = "i̇zmit"

output = slugify(test_str, allow_unicode = True)

print(test_str)
print(output)
print(test_str == output)

Change History (2)

comment:1 by Mariusz Felisiak, 3 years ago

Component: CSRFUtilities
Resolution: invalid
Status: newclosed

It's not about 'İ' but about '̇' which is the second character. IMO, slugify() properly removes '̇', see:

>>> test_str = "i̇zmit"
>>> output = slugify(test_str, allow_unicode = True)
>>> for x, y in enumerate(test_str):
...     print(y, output[x], y == output[x])
i i Truė
 z False
z m False
m i False
i t False
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
IndexError: string index out of range

See also related ticket #30892 about "İ".

in reply to:  1 comment:2 by sowinski, 3 years ago

Thank you for the fast response.

I do not agree, because of this behavior it would be impossible to create an article for the capital of Turkey while allow_unicode=True.
https://tr.wikipedia.org/wiki/%C4%B0stanbul

Maybe someone else have a international website and will hit this problem.

I solved the problem by adding the I to the regular expression.

value = re.sub(r'[^\w\si̇-]', '', value.lower())

I testes the implementation with all cities in the world with all the different language variants of the city name and it worked for me.
http://www.geonames.org/

It is interesting to see that this the only edge case. Not sure if this will work in all situations. So I run only my modification if the strange i is in the string. Otherwise is jump to the django implementation.

See: https://github.com/wagtail/wagtail/issues/7637#issuecomment-949366560

Note: See TracTickets for help on using tickets.
Back to Top