Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#30481 closed Cleanup/optimization (wontfix)

Document that force_str() allows lone surrogates.

Reported by: Adam Hooper Owned by: nobody
Component: Documentation Version: 2.2
Severity: Normal Keywords: force_text unicode
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

$ python3
Python 3.7.3 (default, Mar 27 2019, 13:36:35)
[GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> invalid_text = '\ud802\udf12'
>>> print(invalid_text)  # we'd expect this to fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

>>> import django.utils.encoding
>>> django.VERSION
(2, 2, 0, 'alpha', 1)

>>> valid_text = django.utils.encoding.force_text(invalid_text)
>>> print(valid_text)  # we'd expect this to succeed?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

>>> valid_text
'\ud802\udf12'

Perhaps this is a flaw in my expectations? I'd expect force_text()'s output to always be a valid text -- even though Python allows me to create _non-text_ str objects. (In this case, I'd expect maybe \ufffd\ufffd -- Unicode replacement characters.)

Unicode primer: \ud802 is a "lone surrogate" in this context. A lone surrogate is a valid Unicode _code point_ but it does not represent _text_. (Lone surrogates can crop up if someone decodes valid UCS-2 as UTF-16.) I don't think any caller of force_text() expects it to ever return a non-textual Unicode string.

Change History (11)

comment:1 by Claude Paroz, 5 years ago

I don't think that fixing unvalid unicode input is in the contract of force_str/force_text.

comment:2 by Adam Hooper, 5 years ago

That's fair; then perhaps there should be some documentation to that effect?

Where I'm coming from: Postgres gave me an error when I tried to INSERT a string that was passed to my handler via JSON. It turns out Python's json.loads() can produce lone surrogates (because JSON can contain them -- https://bugs.python.org/issue17906); but Postgres TEXT (or JSON or JSONB) fields only store well-formed Unicode text. "I mustn't be the only person with this problem," I figured. I found force_text(). It looks like exactly the utility I need -- especially since it's littered all over the django.db package.

Then I needed to learn that it wasn't.

I ended up writing my own utility to replace surrogates. For anyone reading:

import re
Surrogates = re.compile(r'[\ud800-\udfff]')
def force_valid_text(text):
    return Surrogates.sub('\ufffd', text)

(I had to add \u0000 to the regex, too, because Postgres doesn't allow that, either. But I feel that's a Postgres-specific issue, whereas the utility of force_text() is more general.)

In the end, I wrote my own force_text() utility. It would have saved me some effort if the documentation had told me that force_text() wasn't what I want when preparing arbitrary input text for a database text field.

I'd be happy to compose a few sentences to clarify this in the docs. Where does this documentation belong? I was startled when I learned Django can allow invalid-text str as input in perfectly ordinary usage; but it turns out it must because JSON allows them.

in reply to:  2 comment:3 by Carlton Gibson, 5 years ago

Component: UtilitiesDocumentation
Triage Stage: UnreviewedAccepted
Type: UncategorizedCleanup/optimization

Replying to Adam Hooper:

I'd be happy to compose a few sentences to clarify this in the docs. Where does this documentation belong?

Hey Adam. Since you're happy to compose the patch, let's Accept this to see what you come up with. (I'm a bit _meh_ to be honest: this looks like more trouble that it's worth to explain but...)

The place for it would be the `force_text()` docs.

Thanks!

comment:4 by Carlton Gibson, 5 years ago

Summary: force_text() allows lone surrogatesDocument that force_text() allows lone surrogates.

comment:5 by Mariusz Felisiak, 5 years ago

Resolution: wontfix
Status: newclosed

Django 2.2 has reached the end of mainstream support and force_text() is deprecated in Django 3.0, so this ticket is not valid anymore.

comment:6 by Simon Charette, 5 years ago

Resolution: wontfix
Status: closednew
Summary: Document that force_text() allows lone surrogates.Document that force_str() allows lone surrogates.

I think this issue still stands, force_text was just an alias for force_str.

comment:7 by Baptiste Mispelon, 5 years ago

I think the original issue came up because of the confusing usage of the word "text" in force_text()
Django used "text" in opposition to "bytes" but the reporter understood "text" in the context of Unicode which has a slightly different meaning.

The original report said:

[...] Python allows me to create _non-text_ str objects

So I think the renaming of force_text to force_str fixed this issue by removing the association with the concept of "text".
As things are now, force_str has the same limitations as python's str when it comes to Unicode issues like lone surrogates so I don't believe we need to document them.

comment:8 by Claude Paroz, 5 years ago

Resolution: wontfix
Status: newclosed

Thanks Baptiste, convincing conclusion :-)

comment:9 by Simon Charette, 5 years ago

Makes sense to me.

comment:10 by Adam Hooper, 5 years ago

As the original reporter, I agree: calling it force_str() makes clear what it does.

I still perceive Django to lack functionality. I originally filed this bug report because I assumed the Django framework supported JSON-encoded requests. This which led me to assume force_text() was a solution.

But Django docs don't mention JSON-encoded requests. So I think it's consistent to close this bug and declare, "Django doesn't support JSON-encoded requests, unless you invest serious effort."

in reply to:  10 comment:11 by Baptiste Mispelon, 5 years ago

Replying to Adam Hooper:

As the original reporter, I agree: calling it force_str() makes clear what it does.

I still perceive Django to lack functionality. I originally filed this bug report because I assumed the Django framework supported JSON-encoded requests. This which led me to assume force_text() was a solution.

But Django docs don't mention JSON-encoded requests. So I think it's consistent to close this bug and declare, "Django doesn't support JSON-encoded requests, unless you invest serious effort."

Personally I'm not familiar with JSON-encoded requests or what would be required for Django to support them but if that's a feature you're interested in, you could try starting a discussion on the DevelopersMailingList.

Note: See TracTickets for help on using tickets.
Back to Top