Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#29323 closed Bug (wontfix)

HTTPRequest QueryDict wrongly decodes binary encoded values

Reported by: Thomas Riccardi Owned by: nobody
Component: HTTP handling Version: 1.11
Severity: Normal Keywords:
Cc: Claude Paroz Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Thomas Riccardi)

application/x-www-form-urlencoded bodies allow binary values: just percent-escape (most of) the bytes.

It doesn't work with python2 and django.

In python2, QueryDict returns an unicode string for values, which makes it impossible to represent binary data.
Worse: it returns an unicode string whose unicode code points are the raw expected bytes (caused by lax behavior of str.decode('iso-8859-1')) so it silently fails.

Example:

q = QueryDict('foo=%00%7f%80%ff')
q['foo']
# actual: 
# u'\x00\x7f\x80\xff'
# expected:
# '\x00\x7f\x80\xff'

Relevant code:
https://github.com/django/django/blob/f89b11b879b83aa505dc8231da5f06ca4b1b062e/django/http/request.py#L397-L401

This was introduced by https://github.com/django/django/commit/fa02120d360387bebbbe735e86686bb4c7c43db2 while trying to accept broken user agents that send non-ascii as urlencoded raw values:

  • the python 3 version seems fine: it tries to decode before calling the query string parser (from urllib)
  • the python 2 version made a mistake: it decodes after calling the query string parser (and only on the value, not the key strangely)

I'm not sure how to fix this, as there are many locations where the key and value are re-encoded (appendlist does it too..)

In python 3 it fails too, but probably only because of the forced post-query-string-parser decoding into unicode string: to support binary values we don't want decoding at this point.

A test for python3:

    def test_binary_input(self):
        """
        QueryDicts must be able to handle binary.
        """
        q = QueryDict(b'foo=%00%7f%80%ff')
        self.assertListEqual(list(map(ord, q['foo'])), [0x00, 0x7F, 0x80, 0xFF])
AssertionError: Lists differ: [0, 127, 65533, 65533] != [0, 127, 128, 255]

First differing element 2:
65533
128

- [0, 127, 65533, 65533]
+ [0, 127, 128, 255]

Attachments (1)

29323.diff (2.7 KB ) - added by Claude Paroz 7 years ago.
Let bytes be bytes in QueryDict

Download all attachments as: .zip

Change History (13)

comment:1 by Simon Charette, 7 years ago

At this point the only version of Django still supporting Python 2 is only receiving security fixes so unless you can also reproduce on Python 3 I'm afraid this will not get backported.

comment:2 by Tim Graham, 7 years ago

Resolution: wontfix
Status: newclosed
Type: UncategorizedBug

comment:3 by Thomas Riccardi, 7 years ago

Description: modified (diff)
Resolution: wontfix
Status: closednew
Summary: HTTPRequest QueryDict wrongly decodes binary encoded values in python2HTTPRequest QueryDict wrongly decodes binary encoded values

OK.

I tested with python3 and it breaks too, but only because the code explicitly tries to generates an unicode string (bytes_to_text in appendlist).

I updated the description to also add the python3 test.

To support binary we would need to add an explicit option to get the bytes (maybe an encoding that asks to keep bytes?); not sure how to expose it in HTTPRequest.

comment:4 by Tim Graham, 7 years ago

Can you explain the use case for encoding bytes this way?

comment:5 by Simon Charette, 7 years ago

Is there a reason you can't base64 encode your binary content instead? It should be easy to turn it into bytes when necessary after that.

comment:6 by Tim Graham, 7 years ago

Resolution: wontfix
Status: newclosed

in reply to:  5 comment:7 by Thomas Riccardi, 7 years ago

(sorry I did not receive a mail notification for comments, just for issue status update...)

Replying to Tim Graham:

Can you explain the use case for encoding bytes this way?

Replying to Simon Charette:

Is there a reason you can't base64 encode your binary content instead? It should be easy to turn it into bytes when necessary after that.

Indeed base64 seems better fitted (and is more efficient), but today our API accept both raw bytes and base64. It seems allowed by the various RFCs, and it works with django 1 and python 2 (with the addition of str.decode('iso-8859-1')).
It breaks with python 3: we currently cannot upgrade to python 3 or django 2 without breaking our API.

comment:8 by Tim Graham, 7 years ago

Could you cite your source? I found:

[application/x-www-form-urlencoded] is not suitable for the persistence of binary content. Therefore, it is recommended that forms capable of containing binary content use another serialization method.

and:

application/x-www-form-urlencoded: the values are encoded in key-value tuples separated by '&', with a '=' between the key and the value. Non-alphanumeric characters are percent encoded: this is the reason why this type is not suitable to use with binary data (use multipart/form-data instead)

Maybe a custom middleware could allow your use case.

If you have a fix to offer, we could evaluate it.

in reply to:  8 comment:9 by Thomas Riccardi, 7 years ago

Replying to Tim Graham:

Could you cite your source? I found:

[application/x-www-form-urlencoded] is not suitable for the persistence of binary content. Therefore, it is recommended that forms capable of containing binary content use another serialization method.

I believe this is a spec related to XHTML. Firefox [deprecated it in firefox 19](https://developer.mozilla.org/en-US/docs/Archive/Web/XForms), XForms 2 is targeted for XML languages (see abstract: https://www.w3.org/TR/xforms20/ ).

Anyway, it is just a recommendation: the format technically allows serializing binary: at worst each byte is percent-encoded.

The html5 spec explains the [application/x-www-form-urlencoded byte serializer](https://url.spec.whatwg.org/#concept-urlencoded-byte-serializer): it takes a bytes array as input, and returns a string. A Previous step documents how to generate the bytes array via encoding: https://url.spec.whatwg.org/#concept-urlencoded-serializer.

So this may be a stretch to serialize raw binary and not a string that we first encode, but the format would be the same. Only the user API would change: bytes instead of string.

and:

application/x-www-form-urlencoded: the values are encoded in key-value tuples separated by '&', with a '=' between the key and the value. Non-alphanumeric characters are percent encoded: this is the reason why this type is not suitable to use with binary data (use multipart/form-data instead)

Either "not suitable" means it won't work, which is wrong, or that it's not recommended, but still would work.

Maybe a custom middleware could allow your use case.

If you have a fix to offer, we could evaluate it.

I'm OK to write a fix (I initially started to write a simple fix and unit tests, but I encountered more issues, as previously explained), but I would require some guidance. The open questions I have are:

  • QueryDict output type: it should return bytes instead of string in this case. How do we expose that? is it OK to have variable types?
  • How to ask for bytes? Do we cheat with the encoding parameter? Not sure who is controlling it: the request sender or the developer?

Thanks,
Thomas

comment:10 by Tim Graham, 7 years ago

Cc: Claude Paroz added

Claude, do you have any expertise to lend here? If not, perhaps we can ask for advice on the DevelopersMailingList.

comment:11 by Claude Paroz, 7 years ago

It would be possible to let raw bytes in QueryDict when a for value cannot be properly decoded (I'll attach a patch doing that). This would however change the (tested in handlers.tests.HandlerTests.test_non_ascii_query_string) behavior when Django receive a badly-encoded query string like url?want=caf%E9 (latin-1 encoded instead of utf-8 encoded).
With the current code, request.POST['want'] contains 'caf�' (i.e. 'caf\ufffd'). If we let bytes be bytes, request.POST['want'] would result in the binary b'caf\xe9' instead.

My main concern would be that developers should then add much defensive coding every time they access request.POST values. Typically, the currently valid request.POST.get('want', '').startswith('caf') test in a view could easily crash by posting specially-crafted values.

by Claude Paroz, 7 years ago

Attachment: 29323.diff added

Let bytes be bytes in QueryDict

comment:12 by Claude Paroz, 7 years ago

An alternative would be to let undecodable quoted values as is in request.POST: request.POST['want'] -> 'caf%E9'.
Then if the app knows some posted content will contain such a value, it can take care of unquoting the value itself. I think it would add less backwards incompatibility.

Note: See TracTickets for help on using tickets.
Back to Top