Opened 14 years ago

Closed 12 years ago

Last modified 12 years ago

#15152 closed Bug (fixed)

Common middleware raises UnicodeDecodeError if receives non-ASCII QUERY_STRING from buggy web server

Reported by: Loststylus Owned by: Aymeric Augustin
Component: Core (Other) Version: 1.2
Severity: Normal Keywords: common middleware
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

So, FlaxCrawler seems to like my site and visits it very often always getting a 500 error.

Here's the common traceback:

Traceback (most recent call last):

 File "/usr/local/lib/python2.6/dist-packages/django/core/handlers/base.py", line 80, in get_response
   response = middleware_method(request)

 File "/usr/local/lib/python2.6/dist-packages/django/middleware/common.py", line 79, in process_request
   newurl += '?' + request.META['QUERY_STRING']

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 3: ordinal not in range(128)


<WSGIRequest
GET:<QueryDict: {u'q': [u'\u0427\u0430\u0439\u043a\u0430']}>,
POST:<QueryDict: {}>,
COOKIES:{},
META:{'CONTENT_LENGTH': '',
 'CONTENT_TYPE': '',
 'HTTP_ACCEPT': 'text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2',
 'HTTP_ACCEPT_ENCODING': 'gzip,defalte',
 'HTTP_ACCEPT_LANGUAGE': 'ru,en-us;q=0.7,en;q=0.3',
 'HTTP_CACHE_CONTROL': 'no-cache',
 'HTTP_CONNECTION': 'close',
 'HTTP_HOST': '{sorry, i've got that censored out}',
 'HTTP_PRAGMA': 'no-cache',
 'HTTP_USER_AGENT': 'FlaxCrawler/1.0',
 'PATH_INFO': u'/articles/ajaxsearch',
 'QUERY_STRING': 'q=\xd0\xa7\xd0\xb0\xd0\xb9\xd0\xba\xd0\xb0',
 'REMOTE_ADDR': '92.241.173.132',
 'REQUEST_METHOD': 'GET',
 'SCRIPT_NAME': u'',
 'SERVER_NAME': '{sorry, i've got that censored out}',
 'SERVER_PORT': '80',
 'SERVER_PROTOCOL': 'HTTP/1.1',
 'wsgi.errors': <flup.server.fcgi_base.TeeOutputStream object at 0x2634850>,
 'wsgi.input': <flup.server.fcgi_base.InputStream object at 0x2634610>,
 'wsgi.multiprocess': True,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 0)}>

The major problem i see here is that developer cannot do anything to catch the error :(

Change History (18)

comment:2 by Luke Plant, 14 years ago

There is no "specially formatted unicode-like" query string here - it is a straightforward UTF-8 encoded string.

The strange thing here is that non-ASCII characters are ending up in META['QUERY_STRING']. With browsers, non-ASCII characters get percent encoded. So the request is simply wrong - this is definitely a bug the crawler (but that is irrelevant).

The next question is whether this is a bug in the web server, which appears to be flup. Looking at the spec for QUERY_STRING in CGI (http://ken.coar.org/cgi/draft-coar-cgi-v11-03.txt) which is the basis of the WSGI spec (http://www.python.org/dev/peps/pep-0333/#environ-variables), the value of QUERY_STRING should not contain these values.

So AFAICS, this is a bug in flup, because it should never be passing on values like these. That doesn't mean we shouldn't fix it in Django to stop 500 errors being produced. The best behaviour would be to return a '400 Malformed request' error if QUERY_STRING has any non-ascii chars, but we probably don't want to do that in the bit of code that is raising this exception, but somewhere like WSGIRequest.__init__ or BaseHandler.get_response. But this will add overhead to every request, so I'm not sure what to do.

There is a way to catch this at the developer level - install an exception middleware. You could also install a request middleware that checked that no invalid chars were in QUERY_STRING.

comment:3 by Loststylus, 14 years ago

Oh, thank you for your response, i'll try to catch it via middleware.

I think the WSGIRequest constructor seems like the proper place to check for that.

comment:4 by Loststylus, 14 years ago

Temporary workaround (middleware should be added befor common middleware):

class RequestCheckMiddleware(object):

    def process_request(self, request):
        
        try:            
            u'%s' % request.META.get('QUERY_STRING','')
        except UnicodeDecodeError:
            response = HttpResponse()
            response.status_code = 400  #Bad Request
            return response
        
        return None

comment:5 by Russell Keith-Magee, 14 years ago

Triage Stage: UnreviewedAccepted

Accepted on the basis that we could do something here, but I agree with Luke - we don't want to pay a big price because a handful of servers can't implement the spec correctly.

comment:6 by Ramiro Morales, 14 years ago

Summary: Common middleware raises UnicodeDecodeError if receives specially formatted unicode-like query stringCommon middleware raises UnicodeDecodeError if receives non-ASCII QUERY_STRING from buggy web server

comment:7 by Łukasz Rekucki, 14 years ago

Severity: Normal
Type: Bug

comment:8 by anonymous, 14 years ago

btw, the problem often shows up when just using ie10 beta

comment:9 by Jacob, 13 years ago

milestone: 1.3

Milestone 1.3 deleted

comment:11 by Aymeric Augustin, 13 years ago

UI/UX: unset

Change UI/UX from NULL to False.

comment:12 by Aymeric Augustin, 13 years ago

Easy pickings: unset

Change Easy pickings from NULL to False.

comment:13 by Flávio Juvenal, 13 years ago

I got the same error.
My server is Apache with mod_wsgi. The client seems to be Internet Explorer 9.

Traceback (most recent call last):

 File "/usr/local/lib/python2.7/dist-packages/Django-1.3-py2.7.egg/django/core/handlers/base.py", line 89, in get_response
   response = middleware_method(request)

 File "/usr/local/lib/python2.7/dist-packages/Django-1.3-py2.7.egg/django/middleware/common.py", line 89, in process_request
   newurl += '?' + request.META['QUERY_STRING']

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 58: ordinal not in range(128)


<WSGIRequest
GET:<QueryDict: {u'datetime': [u'--------------'], u'time_group_id': [u'---'], u'speciality': [u'Psicologia (Dist\xfarbios emocionais e de personalidade)'], u'office': [u'--------------------------------------------'], u'health_insurance': [u'----------']}>,
POST:<QueryDict: {}>,
META:{
 'GATEWAY_INTERFACE': 'CGI/1.1',
 'HTTP_ACCEPT': 'text/html, application/xhtml+xml, */*',
 'HTTP_ACCEPT_LANGUAGE': 'pt-BR',
 'HTTP_CONNECTION': 'Keep-Alive',
 'HTTP_USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
 'HTTP_VIA': '1.1 SVGWF02',
 'QUERY_STRING': 'health_insurance=----------&speciality=Psicologia%20(Dist\xc3\xbarbios%20emocionais%20e%20de%20personalidade)&office=--------------------------------------------&time_group_id=---&datetime=----------------',
 'REQUEST_METHOD': 'GET',
 'SERVER_PROTOCOL': 'HTTP/1.1',
 'SERVER_SOFTWARE': 'Apache',
 'mod_wsgi.callable_object': 'application',
 'mod_wsgi.handler_script': '',
 'mod_wsgi.input_chunked': '0',
 'mod_wsgi.listener_host': '',
 'mod_wsgi.process_group': '',
 'mod_wsgi.request_handler': 'wsgi-script',
 'mod_wsgi.script_reloading': '1',
 'mod_wsgi.version': (3, 3),
 'wsgi.errors': <mod_wsgi.Log object at 0xba62c0c0>,
 'wsgi.file_wrapper': <built-in method file_wrapper of mod_wsgi.Adapter object at 0xba4b2608>,
 'wsgi.input': <mod_wsgi.Input object at 0xba41cd90>,
 'wsgi.multiprocess': True,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 1)}>

comment:14 by anonymous, 13 years ago

I have same error

Traceback (most recent call last):

  File "/usr/lib/python2.6/site-packages/django/core/handlers/base.py", line 89, in get_response
    response = middleware_method(request)

  File "/usr/lib/python2.6/site-packages/django/middleware/common.py", line 89, in process_request
    newurl += '?' + request.META['QUERY_STRING']

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 4: ordinal not in range(128)

Some additional data

HTTP_USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)',
 'HTTP_VIA': '1.0 niiri.kharkov.com (squid/3.0.STABLE7), 1.0 wwwniiri (squid/3.2.0.12)',
 'HTTP_X_FORWARDED_FOR': 'unknown, 172.16.0.4, 82.117.230.71',
 'PATH_INFO': u'/ru/products/tag/N',
 'QUERY_STRING': 'N??\xb0???\xbb??????/fancybox/fancy_loading.png',
 'REMOTE_PORT': '27370',
 'REQUEST_METHOD': 'GET',
 'REQUEST_URI': '/ru/products/tag/N?N??\xb0???\xbb??????/fancybox/fancy_loading.png',

comment:15 by anonymous, 13 years ago

Same recurrent error here, just after updating to 1.4 :

Traceback (most recent call last):                                                                                                                                               
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 85, in run
    self.result = application(self.environ, self.start_response)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/contrib/staticfiles/handlers.py", line 67, in __call__
    return self.application(environ, start_response)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/core/handlers/wsgi.py", line 241, in __call__
    response = self.get_response(request)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/core/handlers/base.py", line 146, in get_response
    response = debug.technical_404_response(request, e)
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/views/debug.py", line 432, in technical_404_response
    'reason': smart_str(exception, errors='replace'),
  File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/utils/encoding.py", line 116, in smart_str
    return str(s)
File "/usr/lib/python2.7/site-packages/Django-1.4-py2.7.egg/django/core/urlresolvers.py", line 185, in __repr__
    return smart_str(u'<%s %s %s>' % (self.__class__.__name__, self.name, self.regex.pattern))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

I tried add utf8 spec at the top of urlresolver + add DEFAULT utf8 at settings.py => no effect.

comment:16 by Aymeric Augustin, 12 years ago

Owner: changed from nobody to Aymeric Augustin

comment:17 by KyleMac, 12 years ago

Apache + mod_wsgi and the Bingbot is causing this error on one of my servers.

comment:18 by Aymeric Augustin, 12 years ago

This happens because the URL produced by reversing is a unicode string and the query string is a bytestring.

(Interestingly, this bug doesn't exist under Python 3.)

comment:19 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

Resolution: fixed
Status: newclosed

In be6522561f01aa2a0b503fb35f35c9fd34c5110f:

[1.5.x] Fixed #15152 -- Avoided crash of CommonMiddleware on broken querystring

Backport of 973f539 from master.

comment:20 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

In 973f539ab83bb46645f2f711190735c66a246797:

Fixed #15152 -- Avoided crash of CommonMiddleware on broken querystring

Note: See TracTickets for help on using tickets.
Back to Top