Opened 11 years ago
Closed 11 years ago
#22721 closed Bug (wontfix)
Fallback encoding support on request.GET required for MSIE
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | HTTP handling | Version: | 1.7-beta-2 |
Severity: | Normal | Keywords: | |
Cc: | linovia, kevin@… | Triage Stage: | Unreviewed |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Originally described at https://github.com/tomchristie/django-rest-framework/pull/1590 before I realised this was a Django issue.
While testing a new system in pre-production we found a bug when MSIE was being used; basically, the way MSIE handles query strings are interesting, to say the least. System locale can have a say in how it's being sent in the request, previous website's encoding can have a say, heck, even from where you launch the browser session.
In our testing, MSIE did not urlquote the querystring, but instead sent it in what looked like raw latin1. The problem becomes, that if you access this query string through request.GET in a Django view, you can an encoding error, if you happen to be using Python 3.x. From my findings, this was introduced in Django 1.6, and worked correctly under Django 1.5 (at least according to the PR I did for Django Rest Framework showcasing this bug).
It seems to me that Django should have some fallback encoding support for this - even if we ignore the fact that MSIE really should be urlquoting the querystring.
Could it be that Django 1.6 somehow introduced a regression over Django 1.5?
Attachments (1)
Change History (9)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
Cc: | added |
---|
by , 11 years ago
Attachment: | fallback_encoding_msie.diff added |
---|
Patch for test case show-casing the MSIE behaviour under certain conditions
comment:3 by , 11 years ago
After some work I managed to write a test case show-casing the behaviour generated by MSIE in certain conditions, when using Django 1.6+ and Python3.
I've bisected the actual commit where the changed occurred: https://github.com/django/django/commit/7fcd6aa6695b39370154d6993cdbb3ba4363de91
comment:4 by , 11 years ago
If I understand your report correctly, MSIE can send non-ASCII query strings in a variety of encodings, depending on several factors.
Currently Django assumes UTF-8 for any non-ASCII data found in the query string. settings.DEFAULT_CHARSET
could be an improvement.
But that doesn't address your problem. How do you propose to determine the appropriate encoding?
comment:5 by , 11 years ago
Cc: | added |
---|
comment:6 by , 11 years ago
Right, my previous report might actually have been a bit wrong - I don't think it's a case of MSIE sending the wrong encoding (in my case at least). I'm not very well-versed in encodings, but from what I see is happening, and what my attached test demonstrates is:
- MSIE will in some cases not properly urlencode a query string, meaning that if I for example have a URL of /?q=æøå, then raw "æøå" will actually be passed on in the request (in what I think is actually, at least in my case, a unicode string)
- If this happens, Django throws up (as per my attached tests) - this happened after the mentioned commit during Django 1.6 betas, and only if using Python 3. Django will try to encode that unicode string as iso-8859-1, then decode it to UTF-8. This works if the browser contained properly urlencoded query strings, but not if the browser sent the raw unicode string.
comment:7 by , 11 years ago
Indeed, the browser should urlencode the data so that the query string only contains ASCII data. Then any ASCII-compatible encoding, like utf-8, can be used to encode or decode it. Here MSIE doesn't urlencode. So we need a way to determine which encoding it used in order to decode the data properly.
There's no such thing as a "raw Unicode string". Unicode is an abstract representation. You can't send Unicode over the network. The browser sends a bunch of bytes on the wire, encoded in a given encoding.
Then the WSGI server arbitrarily decodes these bytes with the latin-1 encoding (everyone knows that this part of WSGI on Python 3 is ridiculous.) Django attempts to be less stupid by reencoding with the latin-1 encoding, recovering the original bytes sent by the browser, and decoding with an appropriate charset, currently hard coded to utf-8.
Ignore the latin-1 (= ISO-8859-1) entirely, it's required to work around WSGI and I swear it's correct. What we need is a way to determine the appropriate charset.
comment:8 by , 11 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
I'm going to close this ticket because we don't know what we could do to work around this bug.
Please reopen if you can suggest a better algorithm for selecting an appropriate charset.
Any chance you can write a test case for our test suite and bisect when the behavior changed? That would be really helpful.