Opened 5 years ago
Closed 5 years ago
#31225 closed Cleanup/optimization (wontfix)
Use NFD normalization in get_valid_filename().
Reported by: | Guillaume Thomas | Owned by: | nobody |
---|---|---|---|
Component: | Utilities | Version: | 3.0 |
Severity: | Normal | Keywords: | text |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Django uses the function `get_valid_filename` to get a 'clean filename' from any input string.
Theoretically, this function only keeps unicode characters (underscore included), dashes (-
) and points (.
). It relies on the standard re
package to match unicode characters.
There are several forms of unicode normalization (https://docs.python.org/3.8/library/unicodedata.html#unicodedata.normalize) and after having done some tests, it appears that re
only handle the NFC normalization.
For instance:
import re import unicodedata re.match("^\w$", unicodedata.normalize("NFC", "é"), re.UNICODE) # <_sre.SRE_Match object; span=(0, 1), match='é'> re.match("^\w$", unicodedata.normalize("NFD", "é"), re.UNICODE) # None
This makes get_valid_filename
behave differently according to the unicode normalization of the input string. Thus:
import unicodedata from django.utils.text import get_valid_filename get_valid_filename(unicodedata.normalize("NFC", "é")) # é get_valid_filename(unicodedata.normalize("NFD", "é")) # e
It appears that this normalization depends on the operating system. On MacOS, it uses a nearly NFD. On Unix, it's NFC. In the end, for files coming from MacOS systems, filenames are "slugified" which is not the case for other operating systems. My feeling at this stage is that this complexity could be abstracted for the developer and have a "normalization independant" handling of strings for this function.
I also think we could go further and force filenames to only contain ascii characters. This curiosity was found after we had an issue with our setup which consists of a django app behing a nginx. To retrieve private media files, django returns an empty http response and provide the internal filename with the `X-Accel-Redirect` header. The problem was that nginx does not seem to like non ascii characters here.
In the end, i think a lot of bug could be avoided by forcing a NFD normalization the in get_valid_filename
function. It'd be roughly the same behaviour as slugify
with allow_unicode=False
What do you think?
Change History (2)
comment:1 by , 5 years ago
Summary: | Use slugify in get_valid_filename → Use NFD normalization in get_valid_filename |
---|
comment:2 by , 5 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Summary: | Use NFD normalization in get_valid_filename → Use NFD normalization in get_valid_filename(). |
Type: | Uncategorized → Cleanup/optimization |
Thanks for this ticket, however I don't think that Django should normalize filenames, the current behavior is tested and documented (see #16315 with a discussion and arguments against a similar change in
FileSystemStorage
). You can start a discussion on DevelopersMailingList if you don't agree.