Opened 5 years ago
Closed 5 years ago
#30892 closed Bug (fixed)
slugify() doesn't return a valid slug for "İ".
Reported by: | Luis Nell | Owned by: | Christoffer Sjöbergsson |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
While working on an international project, we discovered that the turkish/azerbaijani letter İ
can not be properly processed when SlugField
and slugify
are run with allow_unicode=True
.
The project itself runs with Django 2.2.6 and Wagtail 2.6.2. I first talked about this in the Wagtail Support Channel and while researching further, discovered that this is a Django/Python related issue. This was tested on Python 3.6 and on Python 3.7.
(quick shoutout to Matt Wescott @gasmanic of Wagtail Fame for being a sparing partner in this)
There is a rather detailed analysis (README) in a git repo I created https://github.com/originell/django-wagtail-turkish-i - it was also the basis for my initial call for help in wagtail's support channel. Meanwhile I have extended it with a Django-only project, as to be a 100% sure this has nothing to do with Wagtail.
I was not able to find anything similar in trac. While I encourage whoever is reading this to actually read the README in the git repo, I want to honor your time and will try to provide a more concise version of the bug here.
Explanation
models.py
from django.db import models from django.utils.text import slugify class Page(models.Model): title = models.CharField(max_length=255) slug = models.SlugField(allow_unicode=True) def __str__(self): return self.title
Using this in a shell/test like a (Model)Form might:
from django.utils.text import slugify page = Page(title="Hello İstanbul") page.slug = slugify(page.title, allow_unicode=True) page.full_clean()
full_clean()
then raises
django.core.exceptions.ValidationError: {'slug': Enter a valid 'slug' consisting of Unicode letters, numbers, underscores, or hyphens.}
Why is that?
slugify
does the following internally:
re.sub(r'[^\w\s-]', '', value).strip().lower()
Thing is, Python's .lower()
of the İ
in İstanbul
looks like this:
>>> [unicodedata.name(character) for character in 'İ'.lower()] ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
So, while slugify()
finishes, the result is then passed into SlugField
which uses the slug_unicode_re
(^[-\w]+\Z
). However, the Combining Dot Above is not a valid \w
:
>>> [(character, unicodedata.name(character), slug_unicode_re.match(character)) for character in 'İ'.lower()] [ ('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>), # EUREKA!! ('̇', 'COMBINING DOT ABOVE', None) ]
So that's why the ValidationError
is raised.
Proposed Solution
The culprit seems to be the order in which lower()
is called in slugify. The assumption that the lowercase version of a re.sub
-ed string is still a valid slug_unicode_re
, does not seem to hold true.
Hence, instead of doing this in slugify()
re.sub(r'[^\w\s-]', '', value).strip().lower()
It might be better to do it like this
re.sub(r'[^\w\s-]', '', value.lower()).strip()
Is Python the actual culprit?
Yeah it might be. Matt (@gasmanic) urged me to also take a look if Python might be doing this wrong.
- The
İ
is the Latin Capital Letter I with Dot Above. It's codepoint isU+0130
According to the chart for the Latin Extended-A set (https://www.unicode.org/charts/PDF/U0100.pdf), it's lowercase version isU+0069
. -
U+0069
lives in the C0 Controls and Basic Latin set (https://www.unicode.org/charts/PDF/U0000.pdf). Lo and behold: it is the Latin small letter I. So a latin lowercasei
.
Does this really mean that Python is doing something weird here by adding the Combining dot above? Honestly, I can't imagine that and I am probably missing an important thing here because my view is too naive.
---
I hope this shorter explanation makes sense. If it does not, please try to read through the detailed analysis in the repo (https://github.com/originell/django-wagtail-turkish-i/blob/master/README.md). If that also does not make a ton of sense, let me know.
In any case, thank you for taking the time to read this bug report. Looking forward to feedback and thoughts. I am happy to oblige in any capacity necessary.
Change History (6)
comment:1 by , 5 years ago
Summary: | Slugify and SlugField with Diacritics → slugify() doesn't return a valid slug for "İ". |
---|---|
Triage Stage: | Unreviewed → Accepted |
Version: | 2.2 → master |
comment:2 by , 5 years ago
True that.
I also thought about looping over the characters and simply removing everything that does not match the \w
. However, I threw that away because I thought that this might as well hit some edge case where it could modify the meaning of a word.
In any case, if we can find a (more solid) process, I'd be very happy to produce a patch and tests for that.
follow-up: 4 comment:3 by , 5 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
I did some digging into this and found some interesting things that I thought that I would share.
In short my conclusion is that python does as it should.
Firstly because if you perform a lowercase operation on İ
in JavaScript the result becomes the same and secondly because in https://unicode.org/Public/UNIDATA/SpecialCasing.txt which describes casing rules in some special occasions we can see that the lower case mapping of İ
is indeed ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
.
Things are however a little bit more complicated than that, as it turns out that the casing operation is performed differently depending on which locale is used. Since locale settings should not break code I will not go in to much on it here but for further reading take a look att JavaSript's toLocalLowerCase or at this stack overflow answer https://stackoverflow.com/a/19031612. If the locale setting 'TR' is used in these examples then the lowercase version of İ
is only LATIN SMALL LETTER I
.
Now to the possible solution:
Replying to Luis Nell:
...I also thought about looping over the characters and simply removing everything that does not match the builtin slug validation regular expressions...
As far as I understand it this is mostly what happens by changing the placement of lower()
the way you suggests.
re.sub(r'[^\w\s-]', '', value)
is removing all symbols that are not standard Unicode characters or spaces which is almost the same regexp as the slug is then validated against. As previously discovered the problem is when the lower()
operation then add new symbols that are not allowed by the regexp. I would therefore argue that moving lower()
is a decent solution because it will make the generated slug validate as long as the validation regexp is the same as now. I would however make the case for moving the lower()
operation to a different place since Unicode documentation https://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf states that normalization is not kept during casing operations.
Casing operations as defined in Section 3.13, Default Case Algorithms are not guaranteed to
preserve Normalization Forms. That is, some strings in a particular Normalization Form
(for example, NFC) will no longer be in that form after the casing operation is performed.
Therefore I would argue that it would be better to place the lower operation over the normalization as follows:
value = str(value).lower() if allow_unicode: value = unicodedata.normalize('NFKC', value) else: value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii') value = re.sub(r'[^\w\s-]', '', value).strip() return re.sub(r'[-\s]+', '-', value)
This way the string is lower cased then normalized to keep as much of the special characters as possible and then the remaining symbols are removed with the regexp substitution.
I guess this could in theory lead to unintended different meaning of words but I don't know if it would be feasible to do this kind of string manipulation with guaranteed preserved meaning.
I have started to prepare a patch with my proposed change as well as tests so I have assigned this issue to me for now. My intention is to submit a patch later this week that could be tested by a few others maybe to check for further issues. Luis Nell, if you feel that you would rather write the patch yourself and that I overstepped by assigning myself, just claim the issue for yourself no hard feelings on my part :)
comment:4 by , 5 years ago
Replying to Christoffer Sjöbergsson:
Thanks for looking further into this. The part about the special casing is very enlightening and confirms my view being a tad too naive haha
I have started to prepare a patch with my proposed change as well as tests so I have assigned this issue to me for now. My intention is to submit a patch later this week that could be tested by a few others maybe to check for further issues. Luis Nell, if you feel that you would rather write the patch yourself and that I overstepped by assigning myself, just claim the issue for yourself no hard feelings on my part :)
By no means are you overstepping. Please, if you can find the time, go ahead. I was just offering my time because I know how stressful maintaining can be and how nice it is to not have to fix "yet another bug". In case you can't find the time, just let me know.
Again, thanks for looking into this!
comment:5 by , 5 years ago
Has patch: | set |
---|
This far there are two different but similar fixes. Fix 1 proposed by Luis Nell to move the lower()
operation to before the re.sub like so
re.sub(r'[^\w\s-]', '', value.lower()).strip()
The second idea that I proposed were to move the lower()
operation before the how if statement and as such place it before the normalization operation.
I have now done some further testing and my conclusion is that the first alternative seems to be bests. Both solutions does fix the bug in question but the second
solution seems to change the output of strings quite a bit when they contain certain characters. I would assume that it would be better to fix the bug without altering the behavior too much from the original behavior. As far as I can tell the first solution does not alter the output for any other characters but the intended one.
I have created a pull request. It can be found here https://github.com/django/django/pull/12237
That one have a quite small set of tests that should cover most of the issues I have found.
I have also created a separate branch in my fork of Django with an extended set of tests that I have made to test a large number of characters in a systematic way. They are however both slow and quite ugly so I will not include them in the pull request. You can check them out here https://github.com/Sjbrgsn/django/blob/ticket_30892_extended_tests/tests/model_fields/test_slugfield.py
So right now I have not been able to find any unwanted behavior with the solution in the patch. But there might very well be some so feel free to test and report any issues.
Thanks for this ticket. I'm not sure if your solution is correct because I'm afraid that we may hit another edge case, nevertheless
slugify()
should return a valid slug.