Opened 13 years ago

Closed 13 years ago

Last modified 9 years ago

#17816 closed Cleanup/optimization (invalid)

UnicodeEncodeError in Image- and FileFields

Reported by: andi@… Owned by: nobody
Component: Forms Version: 1.3
Severity: Normal Keywords:
Cc: anssi.kaariainen@… Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Uploading files (and images) containing non-ASCII characters (e.g. German umlauts) with a form containing an ImageField or FileField causes a UnicodeEncodeError on recent versions of Ubuntu servers. Curiously this does not happen on the development-server or older Debian 5 servers.

In order to avoid this error and non-ASCII characters in URLs I'd like to suggest a built-in (optional) conversion of the filename in the Image- and/or FileField class (or corresponding base class).

class MyImageField(ImageField):

    def __init__(self, *args, **kwargs):
        super(MyImageField, self).__init__(*args, **kwargs)

    def clean(self, *args, **kwargs):
        data = super(MyImageField, self).clean(*args, **kwargs)
        filename = os.path.splitext(data.name)
        data.name = unicode(data.name)
        if len(filename[1]):
            data.name += u'.'+slugify(filename[1])
        return data

where slugify is e.g. the slugify function from django.template.defaultfilters

Change History (18)

comment:1 by Jannis Leidel, 13 years ago

What version of Ubuntu did you use? Which version of Python? Which of Django? What kind of deployment did you use?

comment:2 by andi@…, 13 years ago

The answers to your questions:
Ubuntu 10.04
Python 2.6.5
Django 1.3.1
Deployment with mod_wsgi

in reply to:  2 comment:3 by andi@…, 13 years ago

Another elegant solution woud be:

class MyImageField(ImageField):

    def __init__(self, *args, **kwargs):
        super(MyImageField, self).__init__(*args, **kwargs)

    def clean(self, *args, **kwargs):
        data = super(MyImageField, self).clean(*args, **kwargs)
        data.name = data.name.encode('ascii','ignore')
        return data

comment:4 by Aymeric Augustin, 13 years ago

If this is a bug in Django, it must be fixed, not worked-around with a data-destroying operation.

It may also be a problem in your setup (environment, locale, etc.)

comment:5 by anonymous, 13 years ago

I would not call it a "bug", as it was working absolutely fine on Debian Lenny. I've tried to locate the problem with help of the Django users group, but I found no way of getting it to work on a Ubuntu server.

In my particular case I'm uploading images for a gallery app. And having nice URL-friendly filenames would be a "feature".
Having non-ASCII characters in an URL (even if allowed) can get quite messy, especially when server and browser speak different dialects, e.g. de_DE.utf8 and de_DE.latin1.

So, please look at this "bug report" as a feature request.

comment:6 by Anssi Kääriäinen, 13 years ago

Cc: anssi.kaariainen@… added

Isn't this the nice "you must set your locale correctly for your web-server user" error. Something like this: http://stackoverflow.com/questions/6171278/unicode-in-django-admin. If you Google for this there are lot more similar errors to be found.

I, too, was hit with this issue some time ago. I have two suggestions:

  • Issue a warning when Django is ran in non-UTF8 environment. Granted, this will be hidden in the log files, but still gives developers a chance to fix this before bug-reports from production. This is one of those bugs which are hard to spot in testing...
  • When the unicode error happens in file saving, convert it to a more explanatory one. Link to documentation explaining this issue if possible.

You could also issue an warning always when a file save operation gets an unicode string and the server is not in UTF-8 locale setting, even if there isn't any actual error.

Of course, adding more warnings about this in the documentation is one way forward, too.

I am not too sure if the optional conversion is a good solution. The problem here is that you will still get the error in production. You will not remember to check that option _before_ you are hit with this, and having it default to on is not going to work. If there are enough users who want that option then why not, but it will not magically solve this problem.

comment:7 by Aymeric Augustin, 13 years ago

(I wrote this before reading Anssi's comment.)

Many developers would be surprised if Django automatically altered the names of uploaded files, and it would be backwards incompatible, so we won't do that.

Should we offer this feature as an option? "Normalizing" file names is easy -- the snippet you pasted above shows how to do it -- but it's also a matter of taste. For instance, one could prefer:

data.name = unicodedata.normalize('NFKD', data.name).encode('ascii', 'ignore')

Therefore, I'm reluctant to provide this feature in Django.


Still, I'm surprised that you got an UnicodeEncodeError. It may reveal a bug in Django.

I suppose you already set LANG and LC_ALL correctly? Does the server have the proper locales installed? (dpkg-reconfigure locales)

Can you provide a link to the django-users discussion?

Last edited 13 years ago by Aymeric Augustin (previous) (diff)

comment:8 by Anssi Kääriäinen, 13 years ago

One more idea: would it be possible to actually try to set the locale to UTF8 based one when the server is started and the locale isn't one already? That would be a new setting: something like LOCALE='en_US.UTF8'. The global_settings default would be None for "use whatever configured", and the settings template would need to have OS-dependent LOCALE. There is already a precedent: the TIME_ZONE setting alters os.environ...

The way to do this is use locale.setlocale(LC_ALL, wanted_locale) in django.conf.init.py.

I haven't tried this idea, so I don't know if this really works, but maybe worth a try in 1.5.

comment:9 by andi@…, 13 years ago

Checking the locale was the first thing to check. No matter what I did: UnicodeEncodeError.
The system, Apache and python (os.environ) all report locale to be as set de_DE.UTF-8.

As I said, it was working on Debian 5 (Python 2.5, Django 1.3.1). It just fails on Ubuntu (Python 2.6, Django 1.3.1; haven't tried other server OS).

This is the link to the group discussion: http://groups.google.com/group/django-users/browse_thread/thread/444e92fffbac31ae

comment:10 by Claude Paroz, 13 years ago

Could you test with Django 1.4? I'm quite sure at least one bug related with Unicode file names has been fixed, but was unable to find the ticket.

comment:11 by Anssi Kääriäinen, 13 years ago

Seems like I was mistaken about this being the system locale mismatch.

I quickly checked the idea of altering the locale on process startup. In short: seems like a bad idea. However improved error messages and warnings could help other people solve locale mismatch issues as painlessly as possible. Should I open another ticket for these improvements?

comment:12 by Aymeric Augustin, 13 years ago

Resolution: needsinfo
Status: newclosed

In order to rule out locale configuration issues, could you insert in a view :

import locale
locales = "Current locale: %s %s -- Default locale: %s %s" % (locale.getlocale() + locale.getdefaultlocale())

and echo the contents of locales in a template? (Do that on a test page or in a HTML comment so it doesn't show up for regular users.)

I expect:

Current locale: None None -- Default locale: de_DE UTF8

This should be done on the same server that exhibits the problem, of course.

comment:13 by andi@…, 13 years ago

Locale settings are all ok.
As I suspected it was a bug in the apache package of Ubuntu. I have reverted to the original code and uploaded (after Apache restart) a file containing a German umlaut. Everything went fine.

Sorry for bothering you with this issue...

comment:14 by clime7@…, 11 years ago

I have encountered a similar problem and I'd like to add some info for this. Two things can cause unicode errors like this:

1) non-utf8 encoding returned by sys.getdefaultencoding() causes unicode errors in cases like: str(unicode_string_with_accents), i.e. whenever there is a conversion from unicode string to byte string without explicitly specifying encoding like this: str(unicode_string_with_accents.encode('utf-8')). However, ascii is default for python 2 and it shouldn't be fiddled with so this is an expected problem.

2) non-utf8 encoding returned by sys.getfilesystemencoding(). This should on the other hand really return an utf-8 encoding because otherwise you get unicode errors in cases like os.stat(unicode_string_with_accents). os module looks at filesystem encoding when trying to interpret unicode strings. And on linux file system encoding is inferred from locale of python interpretter. Specifically there should be LANG=something.utf8 in the environment.

I have resolved my problem by adding "env = LANG=en_US.utf8" to my uwsgi.ini. I believe other people with this problem might need to do something similar.

comment:15 by Tim Graham <timograham@…>, 9 years ago

In 25b912ab:

Fixed #17686, refs #17816 -- Added "Files" section to Unicode topic.

Thanks Fako Berkers for help with the patch.

comment:16 by Tim Graham <timograham@…>, 9 years ago

In da20004a:

[1.8.x] Fixed #17686, refs #17816 -- Added "Files" section to Unicode topic.

Thanks Fako Berkers for help with the patch.

Backport of 25b912abbe31fa440e702b5273c18cf74e2d6e0b from master

comment:17 by Tim Graham <timograham@…>, 9 years ago

In 84006fd:

[1.9.x] Fixed #17686, refs #17816 -- Added "Files" section to Unicode topic.

Thanks Fako Berkers for help with the patch.

Backport of 25b912abbe31fa440e702b5273c18cf74e2d6e0b from master

comment:18 by Tim Graham, 9 years ago

Resolution: needsinfoinvalid
Note: See TracTickets for help on using tickets.
Back to Top