#10571 closed Bug (fixed)
FakePayload Truncates Unicode Content
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Testing framework | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Anand Kumria, Peter van Kampen | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Hi,
While testing an API for posting new objects using JSON, I think I've tripped over a problem with cStringIO and Unicode. My test had a few line like this, and the raw_post_data
in the receiving view was truncated.
json = u'{"name": "Rick"}' response = self.c.post('/api/person/', json, content_type="application/json")
Looking a little closer, it seems that cStringIO doesn't treat Unicode objects very well. In particular, because the codec is utf-16, the number of bytes required for reading is doubled.
Here's a demonstration (I hope it's sufficient):
>>> from django.test.client import FakePayload >>> json = u'{"name": "Rick"}' >>> payload = FakePayload(json) >>> len(json) 16 >>> payload._FakePayload__len 16 >>> payload.read() '\x00{\x00"\x00n\x00a\x00m\x00e\x00"\x00:' >>> '\x00{\x00"\x00n\x00a\x00m\x00e\x00"\x00:'.decode("utf-16") u'{"name":' >>> import cStringIO >>> import StringIO >>> cStringIO.StringIO(json).read() '\x00{\x00"\x00n\x00a\x00m\x00e\x00"\x00:\x00 \x00"\x00R\x00i\x00c\x00k\x00"\x00}' >>> len(cStringIO.StringIO(json).read()) 32 >>> StringIO.StringIO(json).read() u'{"name": "Rick"}' >>> len(StringIO.StringIO(json).read()) 16 >>> cStringIO.StringIO(json).read(16) '\x00{\x00"\x00n\x00a\x00m\x00e\x00"\x00:' >>> cStringIO.StringIO(json).read(16).decode("utf-16") u'{"name":' >>>
A direct fix would be to use StringIO instead of cStringIO, but I appreciate the desire to keep the speed-up, especially for unit testing.
Please correct me if I've misinterpreted the use POST in the test client.
Thanks,
Rick
PS: This may be related to #9480.
Attachments (3)
Change History (27)
comment:1 by , 16 years ago
milestone: | → 1.1 |
---|---|
Triage Stage: | Unreviewed → Accepted |
comment:2 by , 16 years ago
comment:3 by , 16 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
I can't reproduce this, and since Russ can't either I'm marking worksforme. If you can provide more information along the lines Russ asked please feel free to reopen.
comment:4 by , 16 years ago
I suspect Russ is right that it's something to do with my environment. I added a test to the client regression tests as suggested, and it's failing only on one of my installed version of Python. For reference, it's the Enthought Python Distribution, which is a big bundle of scientific packages. My build of Python 2.5 from source is fine. I take this as an indicator that it's probably not Django, but the Enthought build. I may be headed over to their bug tracker shortly.
Thanks for looking at this!
--Rick
P.S. Here's the unnecessary test case:
class FakePayloadTests(TestCase): def test_fake_payload_stringio(self): # Ensure that StringIO doesn't truncate Unicode strings json = u'{"name": "Rick"}' payload = FakePayload(json) new_json = payload.read() self.assertEqual(json, new_json)
comment:5 by , 16 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
After a bit more work, I think this is an issue that is possible for others to trip over. According to this feature request, cStringIO returns the raw binary data. For whatever reason, both the Enthought Python build, and production 2.6 release use UTF-16 as the default encoding for Unicode strings. This makes me think that cStringIO is behaving correctly.
Here's the setup I'm using:
- Machine: PowerBook G4
- OS: OS X 10.4.11
- Python: 2.6.1 python-2.6.1-macosx2008-12-06.dmg
As a test, I wrote a view that deserializes a JSON string, and then regurgitates it back as a file. This ensures that the JSON data isn't getting mangled on the way, and mirrors what I'm doing.
def return_json_file(request): "A view that parses and returns a JSON string as a file." obj_dict = simplejson.loads(request.raw_post_data) obj_json = simplejson.dumps(obj_dict, cls=DjangoJSONEncoder, ensure_ascii=False) response = HttpResponse(obj_json, status=200, mimetype='application/json') response['Content-Disposition'] = 'attachment; filename=testfile.json' return response
For the test, I tried two separate cases, one with a simple string as in my previous test case, the other with some Japanese characters pulled from the unicode.html
template in the regression tests. I tested these using both the post and response mechanism, and FakePayload directly.
class UnicodePostTests(TestCase): def test_simple_fakepayload(self): json = u'{"english": "mountain pass"}' payload = FakePayload(json) self.assertEqual(payload.read(), json) def test_fakepayload(self): json = u'{"japanese": "\u5ce0 (\u3068\u3046\u3052 t\u014dge)"}' payload = FakePayload(json) self.assertEqual(payload.read(), json) def test_simple_unicode_payload(self): json = u'{"english": "mountain pass"}' response = self.client.post("/test_client_regress/parse_unicode_json/", json, content_type="application/json") # N.B: This relys on the simplejson formatting self.assertEqual(response.content, json) def test_unicode_payload(self): json = u'{"japanese": "\u5ce0 (\u3068\u3046\u3052 t\u014dge)"}' response = self.client.post("/test_client_regress/parse_unicode_json/", json, content_type="application/json") # N.B: This relys on the simplejson formatting self.assertEqual(response.content, json)
Here are the results, using cStringIO (all tests fail).
cable:~/Projects/Django/tests rpwagner$ ./runtests.py --settings=testsite.settings test_client_regress ====================================================================== ERROR: test_simple_unicode_payload (regressiontests.test_client_regress.models.UnicodePostTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/models.py", line 634, in test_simple_unicode_payload content_type="application/json") File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/test/client.py", line 306, in post response = self.request(**r) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/handlers/base.py", line 92, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/views.py", line 71, in return_json_file obj_dict = simplejson.loads(request.raw_post_data) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/__init__.py", line 307, in loads return _default_decoder.decode(s) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/decoder.py", line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/decoder.py", line 338, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded ====================================================================== ERROR: test_unicode_payload (regressiontests.test_client_regress.models.UnicodePostTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/models.py", line 642, in test_unicode_payload content_type="application/json") File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/test/client.py", line 306, in post response = self.request(**r) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/handlers/base.py", line 92, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/views.py", line 71, in return_json_file obj_dict = simplejson.loads(request.raw_post_data) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/__init__.py", line 307, in loads return _default_decoder.decode(s) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/decoder.py", line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/decoder.py", line 338, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded ====================================================================== FAIL: test_fakepayload (regressiontests.test_client_regress.models.UnicodePostTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/models.py", line 638, in test_fakepayload self.assertEqual(payload.read(), json) AssertionError: '\x00{\x00"\x00j\x00a\x00p\x00a\x00n\x00e\x00s\x00e\x00"\x00:\x00 \x00"' != u'{"japanese": "\u5ce0 (\u3068\u3046\u3052 t\u014dge)"}' ====================================================================== FAIL: test_simple_fakepayload (regressiontests.test_client_regress.models.UnicodePostTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/models.py", line 633, in test_simple_fakepayload self.assertEqual(payload.read(), json) AssertionError: '\x00{\x00"\x00e\x00n\x00g\x00l\x00i\x00s\x00h\x00"\x00:\x00 \x00"\x00m' != u'{"english": "mountain pass"}' ---------------------------------------------------------------------- Ran 53 tests in 1.432s FAILED (failures=2, errors=2)
Now, from experience, I know that the data showing up in the view is truncated. So I go in to django/test/client.py
, and hack out the import of cStringIO to force the use of the pure-Python StingIO, and the error moves to the wsgi handler. I'm not worried about that (yet), because this shows me that in my case it's cStringIO that's causing problems.
cable:~/Projects/Django/tests rpwagner$ ./runtests.py --settings=testsite.settings test_client_regress ====================================================================== ERROR: test_unicode_payload (regressiontests.test_client_regress.models.UnicodePostTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/models.py", line 642, in test_unicode_payload content_type="application/json") File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/test/client.py", line 306, in post response = self.request(**r) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/handlers/base.py", line 92, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/Users/rpwagner/Projects/Django/tests/regressiontests/test_client_regress/views.py", line 71, in return_json_file obj_dict = simplejson.loads(request.raw_post_data) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/handlers/wsgi.py", line 205, in _get_raw_post_data size=content_length) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/handlers/wsgi.py", line 72, in safe_copyfileobj fdst.write(buf) UnicodeEncodeError: 'ascii' codec can't encode character u'\u5ce0' in position 14: ordinal not in range(128) ---------------------------------------------------------------------- Ran 51 tests in 3.237s FAILED (errors=1)
I'm hoping that my test with the Japanese characters will cause the same failures on other machines. I suspect any multi-byte characters would reproduce this.
Thanks again for the help, and please let me if I need to be more specific, or if I'm not misunderstanding some functionality.
--Rick
follow-up: 7 comment:6 by , 16 years ago
Rick - thanks so much for taking the trouble to work out such a complete set of demonstration cases.
I'm going to qualify my comments with the following: I would not, on my best day, call Unicode handling one of my strengths. It's entirely possible I've got everything bass-ackwards here. I'll be the first to defer to anyone with superior knowledge in this case.
That said, there appears to be several problems at play here.
- Firstly, FakePayload itself. FakePayload isn't really intended to be an externally visible container - it's an internal facility for storing file content so it can be passed to the WSGI input handler. This handler expects a byte stream, so it isn't really appropriate to be putting unencoded unicode data into the FakePayload. Instead, what you should be putting into the FakePayload is an encoded byte string; what you get back will be a similarly encoded string. Internally, the file handler uses the FakePayload to compose a wsgi input stream; unicode content isn't appropriate in this context. As a result, your payload tests should look more like:
def test_fakepayload(self): json = u'{"japanese": "\u5ce0 (\u3068\u3046\u3052 t\u014dge)"}' payload = FakePayload(json.encode('utf-8')) self.assertEqual(payload.read().decode('utf-8'), json)
I've used utf-8 here, but the actual encoding shouldn't matter (as long as it is symmetrical between encode and decode) - the point is that you need to explicitly encode from unicode to a byte string. On my machine, which uses a default charset of ascii, I don't need to make this modification to the simple_fakepayload test - I suspect that your default charset of utf-16 will mean that you will need to make the same change on both tests.
- However, that said, Django's test client should be doing this encoding for you when you provided unicode data as file content. This means that the post() method of the test client (django/test/client.py, line 285), which currently starts:
if content_type is MULTIPART_CONTENT: post_data = encode_multipart(BOUNDARY, data) else: post_data = data
should, in fact, read:if content_type is MULTIPART_CONTENT: post_data = encode_multipart(BOUNDARY, data) else: post_data = data.encode('utf-8')
This encoding would match the Content-Type header that is transmitted with the POST; ideally, we should be checking for a user-provided Content-Type header so we can encode to a different charset if required.
- response.content holds the raw encoded byte stream from the response. If you want to compare against an original unicode input, you will need to encode that byte stream to the same encoding. This means your final assertion in the test_unicode_payload test should read
self.assertEqual(response.content.decode('utf-8'), json)
In this case, it needs to be utf-8, because that's the content type that is being returned in the response (again, this is the default Content-Type header on HTTP responses).
You are correct that using StringIO rather than cStringIO makes problem (1) go away, but as far as I can make out, that is just hiding the deeper problems (2) and (3).
So - to confirm whether or not I understand anything about unicode... if you do the following:
- Add the .encode() and .decode() calls to the FakePayload tests.
- Add the .encode('utf-8') to the post() method on the test client.
- Add the .decode('utf-8') to the response.content check in the payload tests.
I believe that all your tests should pass. Can you confirm that this is the case? Or do I need to go back to the books and read up on unicode some more?
comment:7 by , 16 years ago
Replying to russellm:
Rick - thanks so much for taking the trouble to work out such a complete set of demonstration cases.
Gladly. I respect the time you put in developing and maintaining Django.
I'm going to qualify my comments with the following: I would not, on my best day, call Unicode handling one of my strengths. It's entirely possible I've got everything bass-ackwards here. I'll be the first to defer to anyone with superior knowledge in this case.
I assure you, that would not be me. This has been one of those "learning experiences" we all appreciate.
So - to confirm whether or not I understand anything about unicode... if you do the following:
- Add the .encode() and .decode() calls to the FakePayload tests.
- Add the .encode('utf-8') to the post() method on the test client.
- Add the .decode('utf-8') to the response.content check in the payload tests.
I believe that all your tests should pass. Can you confirm that this is the case? Or do I need to go back to the books and read up on unicode some more?
You were correct in all cases. I went ahead and added the check for a charset on the content-type argument to set the encoding based on a user-supplied encoding. I also added a test using KOI8-R, just to step away from ASCII and UTF-8.
The check for the charset is not the cleanest, but it's working:
if content_type is MULTIPART_CONTENT: post_data = encode_multipart(BOUNDARY, data) else: # Encode the content so that the byte representation is correct. charset = settings.DEFAULT_CHARSET if 'charset=' in content_type: left, charset = content_type.split('charset=') post_data = data.encode(charset)
Likewise, I updated my test view to use the correct encoding when dumping the dict to JSON, and in the response:
def return_json_file(request): "A view that parses and returns a JSON string as a file." charset = settings.DEFAULT_CHARSET if 'charset=' in request.META['CONTENT_TYPE']: left, charset = request.META['CONTENT_TYPE'].split('charset=') # This just checks that the uploaded data is JSON obj_dict = simplejson.loads(request.raw_post_data) obj_json = simplejson.dumps(obj_dict, encoding=charset, cls=DjangoJSONEncoder, ensure_ascii=False) response = HttpResponse(obj_json, status=200, mimetype='application/json; charset=' + charset) response['Content-Disposition'] = 'attachment; filename=testfile.json' return response
Finally, I added your decode suggestions to the regression tests, and the new test using KOI-8.
class UnicodePostTests(TestCase): # FakePayload is not intended to be externally accessible. # These are also exercised using the _payload_ tests. def test_simple_fakepayload(self): "Make sure that FakePayload returns what it gets" #Regression test for #10571 json = u'{"english": "mountain pass"}' payload = FakePayload(json.encode('utf-8')) self.assertEqual(payload.read().decode('utf-8'), json) def test_fakepayload(self): "Make sure that FakePayload returns what it gets, outside of ASCII" #Regression test for #10571 json = u'{"japanese": "\u5ce0 (\u3068\u3046\u3052 t\u014dge)"}' payload = FakePayload(json.encode('utf-8')) self.assertEqual(payload.read().decode('utf-8'), json) def test_simple_unicode_payload(self): "Test POSTing a simple JSON document" #Regression test for #10571 json = u'{"english": "mountain pass"}' response = self.client.post("/test_client_regress/parse_unicode_json/", json, content_type="application/json") # N.B: This relys on the simplejson formatting self.assertEqual(response.content, json) def test_unicode_payload_utf8(self): "Test POSTing data outside of ASCII" #Regression test for #10571 json = u'{"dog": "собака"}' response = self.client.post("/test_client_regress/parse_unicode_json/", json, content_type="application/json; charset=utf-8") # N.B: This relys on the simplejson formatting self.assertEqual(response.content.decode('utf-8'), json) def test_unicode_payload_kio8(self): "Test POSTing data outside of ASCII and UTF-8" #Regression test for #10571 json = u'{"dog": "\u044f\u2502\u043f\u256c\u043f\u2560\u043f\u255f\u043f\u2568\u043f\u255f"}' response = self.client.post("/test_client_regress/parse_unicode_json/", json, content_type="application/json; charset=koi8-r") # N.B: This relys on the simplejson formatting self.assertEqual(response.content.decode('koi8-r'), json)
As you point out, FakePayload isn't really meant to be exposed, so the first two tests should probably be removed. After all, the test client's functionality is being tested by the use of the post method. Which probably means that the correct title for this ticket should have been "Test Client Does Not Properly Encode POST Data".
I hope this is progress, since all of the tests pass (really all, I checked to ensure that I didn't break anything). All of the above code is in a patch I am about to attach to this ticket. If there is some particular cleaning or improvement you see, let me know, and I'll try to oblige.
Thanks,
Rick
comment:8 by , 16 years ago
Has patch: | set |
---|
Adding patch. I hope I don't end up doing this a bunch of times and spamming everyone and the RSS feed.
by , 16 years ago
Attachment: | unicode_payload.patch added |
---|
Patch to encode test client post content and tests.
comment:9 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
comment:10 by , 16 years ago
comment:11 by , 16 years ago
Rick - thanks for your most excellent assistance tracking this problem. I owe you one (1) ISO standard, correctly UTF-16 encoded lollipop. :-)
comment:12 by , 16 years ago
For the record, the test added here using KOI-8R isn't particularly correct. The content-type of an HTTP request has no expected influence on the encoding of the HTTP response and in lieu of an Accept-Charset header, the server is permitted to return any encoding at all, so testing for a particular encoding is only checking the view implementation works, not the Django's HTTP implementation or test client is any way correct. So this test is not incorrect, but it doesn't actually show a regression if the assertion fails.
Only noting this here because the test is going to change as part of the http-wsgi-improvements branch work (Chris Cahoon's summer of code work) and if anybody traces the change back to this ticket, there's now an explanation (to be clear: the test itself isn't enforcing incorrect behaviour at the moment. It's a valid test, but it's not adding any value currently, since it's not testing anything general about Django.)
comment:13 by , 14 years ago
Resolution: | fixed |
---|---|
Severity: | → Normal |
Status: | closed → reopened |
Type: | → Uncategorized |
This problem seems to still occur in Django 1.3 when using client.put() to PUT fake data to the application. POST works correctly.
It's reproducible with a simple app that echoes back request.raw_post_data from a view:
def index(request): return HttpResponse(request.raw_post_data)
class SimpleTest(TestCase): def test_basic_addition(self): response = self.client.post('/', data=u'{"str":"Hello World"}', content_type='text/javascript') print response.content response = self.client.put('/', data=u'{"str":"Hello World"}', content_type='text/javascript') print response.content
Test result:
Creating test database for alias 'default'... {"str":"Hello World"} {"str" . ---------------------------------------------------------------------- Ran 1 test in 0.084s
comment:14 by , 14 years ago
Easy pickings: | unset |
---|
I just encountered this same problem (Django 1.2.3), but it looks like it's still not fully solved in Django 1.3 from the above comment. It seems like the real problem here is that FakePayload only works when the content is a str and not when it's unicode, and yet it doesn't advertise this correctly. If it's passed a unicode, it happily accepts it, failing later when .read() is called. This seems like a classic duck-typing bug, but since it's very common to expect that functions will accept either str or unicode, it seems like the correct behavior here would be to fail hard when FakePayload.init is passed a unicode object.
In my particular case, I was using a StringIO as a the file-like object. StringIO.read() returns a str if it's been given strings and unicode if it's been given unicode. I seeded it with what to me looked like a string but turned out to be a unicode, and no one along the way checked to see that the output of f.read() (where f is the StringIO object) was returning unicode. A simple check in FakePayload would have saved me a couple of hours of debugging what seemed like very mysterious behavior.
comment:15 by , 14 years ago
Patch needs improvement: | set |
---|
unicode_payload.patch fails to apply cleanly on to trunk
comment:16 by , 14 years ago
Type: | Uncategorized → Bug |
---|
by , 14 years ago
Attachment: | unicode_payload_put.patch added |
---|
comment:17 by , 14 years ago
UI/UX: | unset |
---|
@pterk
comments:
in post; why not just call it 'post_data' to make it clear rather than replacing 'data'
in put; why not call is 'put_data' also you have a commented out debug line
Also I think you want to explain the removal of query_string? Whilst I can see it is not being used, that removal needs someone who understands that section better to confirm that it is correct.
tests/regressiontests/test_client_regress/models.py
- trailing whitespace
- unnecessary line with whitespace
- I think you should also exclude 'delete' since a GET-like request is one without side-effects
by , 14 years ago
Attachment: | unicode_payload_put.2.patch added |
---|
Updated patch addressing wildfire's comments. The deletion of the query_string handling in the put method still needs to be looked at, as for the deletion of 'put' , 'delete' in test_get_like_requests.
comment:18 by , 14 years ago
Patch looks a lot better. Time to hassle someone with more knowledge about query_string handling and if PUT and DELETE are GET-like requests.
comment:19 by , 14 years ago
Patch needs improvement: | unset |
---|---|
Triage Stage: | Accepted → Ready for checkin |
I've taken a look at the change to query_string in PUT and entirely agree with it - the body data shouldn't be put there - and the removal of put and delete from get_like_requests, since I'm taking it to mean "requests with side effects".
Thus, I'm marking this RFC.
comment:20 by , 14 years ago
Cc: | added |
---|
comment:21 by , 14 years ago
Cc: | added |
---|
Thanks for going to the trouble of trying to work out a test case, but there is something else going on here. I don't get the same behavior that you are reporting for your demonstration case:
Now, this could be caused by a discrepancy in the runtime environment, but you haven't provided any details on how your environment is configured. I've fiddled around for a bit with the test case that I provided for #9480, but I couldn't get that to break either.
I don't doubt that you're seeing a truncation problem, and I can certainly see how UTF-16 could be the missing piece. Your hunch that this is related to #9480 is probably correct.
The best way to help close out this bug is to provide a failing test case against Django itself. If you can drop a failing test case into the end of
tests/regressiontests/test_client_regress/models.py
(see the docs for details on how to run the Django test suite), then it will be much easier to narrow down and fix the problem.