Opened 12 years ago

Closed 12 years ago

#19398 closed Bug (wontfix)

django.utils._os.safe_join should return a native string

Reported by: Aymeric Augustin Owned by: nobody
Component: Python 2 Version: 1.5-beta-1
Severity: Release blocker Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

By default, filesystem paths are represented with native strings (ie. str objects) in Python 2 and Python 3.

% python2
>>> import os
>>> type(os.listdir('.')[0])
<type 'str'>
% python3
>>> import os
>>> type(os.listdir('.')[0])
<class 'str'>

In other words, they were switch from bytestrings to unicode in Python 3.


A brief interlude for perfectionists and pedants :)

In Python 2, it's possible to use unicode for filesystem paths, when os.path.supports_unicode_filenames = True, but that's not the default mode of operation.

In Python 3, it's possible to use bytestrings for filesystem paths, because not all supported platforms sport unicode-aware filesystems: see http://docs.python.org/3/library/os.path:

The path parameters can be passed as either strings, or bytes. Applications are encouraged to represent file names as (Unicode) character strings.


My initial statement still reflects the intent of Python's developers, from which Django shouldn't deviate.


The conversion to unicode was introduced 4 years ago in 8fb1459b5294fb9327b241fffec8576c5aa3fc7e. This commit was fixing an issue with the reporting of template loading errors.

In hindsight, it would have been better to keep safe_join similar to os.path.join, and preprocess the arguments or introduce a safe_joinu method.

Excluding tests, safe_join is used in four places in Django. Auditing these for proper use of bytestrings vs. unicode strings seems doable.

safe_join isn't documented and the name _os is a strong hint that it's a private API.

Therefore, I propose:

  • to remove the coercion to unicode — which is incorrect anyway, because it doesn't honor sys.getfilesystemencoding(), and thus fails on non-utf-8 filesystems;
  • to perform the coercion in callers that need it, or remove it altogether if possible.

This bug is preventing me from fixing #19357, which is a release blocker, because the static files finders are using safe_join. It is also the root cause for #17686.

Change History (4)

comment:1 by Aymeric Augustin, 12 years ago

Digging deeper, it appears that Django tries to manipulate filesystem paths as unicode in Python 2, which doesn't work so well.

>>> from django.utils import _os
>>> _os.abspathu('café')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "django/utils/_os.py", line 28, in abspathu
    path = join(os.getcwdu(), path)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 71, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

>>> import os
>>> os.path.abspath('café')
'/Users/aaugustin/Documents/dev/dj\xc3\xa1ng\xc3\xb2/caf\xc3\xa9'

This exception is all the more entertaining since the docstring of abspathu says "... thus avoiding a UnicodeDecodeError in join when the cwd has non-ASCII characters ..." ;)

Also:

>>> _os.safe_join(_os.abspathu('.'), 'café')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "django/utils/_os.py", line 39, in safe_join
    final_path = abspathu(join(base, *paths))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 71, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

>>> os.path.join(os.path.abspath('.'), 'café')
'/Users/aaugustin/Documents/dev/dj\xc3\xa1ng\xc3\xb2/caf\xc3\xa9'

OK, I'm being mean, I understand that the intended use case is:

>>> _os.safe_join(_os.abspathu('.'), u'café')
u'/Users/aaugustin/Documents/dev/dj\xe1ng\xf2/caf\xe9'

Still, supporting unicode filesystems path on Python 2 is a worthy goal, but not at the expense of breaking the default.

In order to avoid UnicodeErrors in all cases, I think we should convert values to str explicitly, using sys.getfilesystemencoding() as charset, where necessary. Having this conversion happen as a side effect in safe_join isn't explicit or predictable; it'd better be done at the call site or with a dedicated utility function.

For the record, abspathu is used in safe_join and in core.files.storage; once again there aren't many places to fix.

Version 0, edited 12 years ago by Aymeric Augustin (next)

comment:2 by Aymeric Augustin, 12 years ago

I've created a pull request that:

1) Introduces path_as_str and path_as_text functions:

  • They do nothing under Python 3; they're only needed under Python 2; the following only applies to Python 2.
  • Their primary goal is to make conversions to and from filesystem paths explicit. For instance, if TEMPLATE_DIRS contains an unicode string, it must be converted to the filesystem's charset before being used for filesystem-related operations. (If we were strong typing fanatics, we'd have a FileSystemPath type, distinct from the regular string types.)
  • Currently, they're used at all safe_join and abspathu call sites, to guarantee backwards-compatibility. But the goal is to remove as many uses as possible in the near future (ie. #19357).

2) Drops _os.abspathu(...) in favor of the equivalent path_as_text(os.path.abspath(path_as_str(...)).

3) Modifies the implementation _os.safe_join(..., ...) to be consistent with os.path.join typewise, and replaces its current uses by the new equivalent path_as_text(_os.safe_join(path_as_str(...), path_as_str(...))

4) Tests that safe_join doesn't coerce to unicode any longer.

Link: https://github.com/django/django/pull/564

comment:3 by Aymeric Augustin, 12 years ago

Pull request updated to accept bytestrings as paths under Python 3; path_as_str isn't a no-op any longer.

Tests pass with Python 2.6 and 3.2.

comment:4 by Aymeric Augustin, 12 years ago

Resolution: wontfix
Status: newclosed

The design decision for #19357 makes this unnecessary.

Note: See TracTickets for help on using tickets.
Back to Top