Opened 13 years ago

Closed 12 years ago

#16315 closed Bug (wontfix)

FileSystemStorage.listdir returns names with unicode normalization form that is different from names in database

Reported by: philomat Owned by: nobody
Component: File uploads/storage Version: 1.3
Severity: Normal Keywords: storage unicode normalization
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When you want to write a function that finds files on disk that are not stored in the database anymore, and use FileSystemStorage.listdir to compare what's returned with what's in the database: You will not be able to compare strings without normalizing them first since unicode characters can be encoded using different normalization forms.

This problem is best demonstrated with some example code:

# Assuming that my storage root contains one folder named u'ä'
>>> import os
>>> from django.core.files.storage import FileSystemStorage
>>> import unicodedata
>>>
# listdir returns u'a' followed by 'COMBINING DIAERESIS' (U+0308)
>>> FileSystemStorage().listdir('')[0][0]
u'a\u0308'
# in the database, this character is stored using a different normalization form: 
>>> os.path.basename(FileSystemStorage().path(u'ä'))
u'\xe4'
# the values should be normalized:
>>> unicodedata.normalize('NFC', FileSystemStorage().listdir('')[0][0])
u'\xe4'

Change History (9)

comment:1 by Aymeric Augustin, 13 years ago

If I understand correctly, the bug is the fact that the file name is normalized in NFC form in the database and in NFD form on the filesystem.

Django doesn't do any unicode normalisation — well, it does in two places, but they're obviously unrelated to the situation you describe.

Maybe the normalization in NFC form appears when the string round-trips in the database. Or maybe the normalization in NFD form appears when the file is written on the file system. In both cases, that's outside the control of Django, but I'd like to understand what happens.

Can you test:

  • writing a file called u'\xe4', then listdir(), and see if it has turned into u'a\u0308'?
  • saving the string u'a\u0308' in the database (in any CharField), then retreive it, and see if it has turned into u'\xe4'

Also, which database and which filesystem are you using?

comment:2 by Aymeric Augustin, 13 years ago

Resolution: needsinfo
Status: newclosed

At this point, we don't have enough information to assess if this is a bug in Django. Please reopen the ticket if you can provide the answers to my questions above.

comment:3 by anonymous, 13 years ago

Resolution: needsinfo
Status: closedreopened
  • writing a file called u'\xe4', then listdir(), and see if it has turned into u'a\u0308'

This indeed turns into u'a\u0308'. The file system is Mac OS Extended (Journaled).

  • saving the string u'a\u0308' in the database (in any CharField), then retreive it, and see if it has turned into u'\xe4'

This fails, MySQL gives me an OperationalError: (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='")

in reply to:  3 comment:4 by Łukasz Rekucki, 13 years ago

Resolution: needsinfo
Status: reopenedclosed

Replying to anonymous:

This fails, MySQL gives me an OperationalError: (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='")

I'm not a MySQL expert, but this looks like a configuration error. As the original reported didn't provided more info, pushing this back to "needsinfo".

Version 0, edited 13 years ago by Łukasz Rekucki (next)

comment:5 by Julien Phalip, 12 years ago

Resolution: needsinfo
Status: closedreopened
Triage Stage: UnreviewedAccepted

I'm getting some test failures for the django staticfiles tests on my Ubuntu 10.04.4 LTS box due to unicode issues similar to those as reported in this ticket.

The root of the problem is that my box's filesystem apparently uses combining diacritical marks for encoding certain characters when creating filenames. In the particular case of these failing tests, my box encodes the character u'ş' (u'\u015f') from the filename fişier.txt as: u's\u0327', that is with a 's' followed by a combining cedilla.

Here's one way of illustrating the problem:

>>> u'ş'
u'\u015f'
>>> print u'\u015f'
ş
>>> print u's\u0327'  # Combining cedilla
ş

It seems like the right approach would be to make Django normalize filenames:

>>> import unicodedata
>>> unicodedata.normalize("NFC", u"s\u0327")
u'\u015f'

comment:6 by Julien Phalip, 12 years ago

On a second thought, rather than systematically normalizing, Django should preserve the original encoding.

comment:7 by Aymeric Augustin, 12 years ago

Indeed, in my experience, preserving the encoding is easier.

The reason is that if you normalize the filename, you cannot look up the file on disk by name any longer, you have to try all normalization forms.

comment:8 by Julien Phalip, 12 years ago

For reference, this issue is discussed here: http://nedbatchelder.com/blog/201106/filenames_with_accents.html

comment:9 by Aymeric Augustin, 12 years ago

Resolution: wontfix
Status: reopenedclosed

If I'm reading this ticket correctly — our decision is *not* to perform any normalization.

That's what Django does currently: it relies on the fact that normalization is preserved both in the database and in the filesystem (which wasn't true for the reporter; maybe the files were moved from one filesystem to another one with different normalization; maybe his database wasn't set up correctly; etc.)

This may not always hold true, but I'm convinced that it isn't a problem that can (or should) be fixed at the framework level.

Note: See TracTickets for help on using tickets.
Back to Top