#3664 closed (fixed)
UnicodeDecodeError in contrib/syndication/feeds.py
Reported by: | Owned by: | Jacob | |
---|---|---|---|
Component: | Documentation | Version: | dev |
Severity: | Keywords: | unicode | |
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
I'm using contrib.syndication for making feeds for Flickr photos and Ma.gnolia links that both have tags which have funky characters (tags like 'pärnu' and 'työ'). Django dies with UnicodeDecodeError when trying to make a feed that has url with funky characters.
The error message is:
UnicodeDecodeError at /syndicate/tag/pärnu/ 'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128) ... Exception Location: /usr/lib/python2.4/site-packages/Django-0.95-py2.4.egg/django/contrib/syndication/feeds.py in add_domain, line 9
add_domain function is very simple, and the problem seems to be with line that is:
url = u'http://%s%s' % (domain, url)
I tested this and found that when decoding the url with latin1 (iso-8859-1) like:
url = u'http://%s%s' % (domain, url.decode('latin1'))
but I'm not very confident of this being a good fix for this.
Attachments (1)
Change History (13)
comment:1 by , 18 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 18 years ago
I wrote a workaround for myself for this. Details are at http://www.unessa.net/en/hoyci/2007/03/unicode-and-django-rss-framework/
It would have been better to write a good patch to resolve the problem and not it's causes, but I'm still not really sure how this should be fixed "right".
comment:3 by , 18 years ago
Component: | RSS framework → Documentation |
---|---|
Owner: | changed from | to
This is a documentation bug, rather than a code bug.
Anything you pass up as a link, including things returned from item_link() in syndication classes and get_absolute_urls() on models, must already be in the character set specified in RFC 1738 (the URL spec). So you must already have done the necessary conversion from non-ASCII characters to ASCII and called urllib.quote() if necessary. In the above example, you are passing non-ASCII characters to something expecting content for a URL, so it is failing.
We cannot perform the conversion to utf-8 and/or url quoting, because, for example, the standard IRI -> URI conversion process is that you convert first and then quote(), so we don't want to accidently do it twice (and there are lots of other places where get_absolute_url() needs to already be returning the correctly quoted string).
I will update the documentation.
comment:4 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:5 by , 18 years ago
Has patch: | set |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
Triage Stage: | Accepted → Ready for checkin |
comment:6 by , 18 years ago
Has patch: | unset |
---|---|
Resolution: | → fixed |
Status: | reopened → closed |
Triage Stage: | Ready for checkin → Accepted |
oops, wrong ticket number mentioned in [5250]
comment:7 by , 18 years ago
I can't see how this is fixed now. Still makes errors for me, I have quoted everything correctly but feeds.py still seems to get in trouble because of the request URL containing urlencoded unicode.
Why is it even
url = u'http://%s%s' % (domain, url)
and not
url = 'http://%s%s' % (domain, url)
if the urls shouldnt be unicode??
comment:8 by , 18 years ago
It sounds like you haven't fully URL and IRI encoded your "url" fragment. Please ask support questions on the mailing list (django-users), though, rather than in Trac.
comment:9 by , 18 years ago
Needs tests: | set |
---|---|
Patch needs improvement: | set |
I still have this error, I think the ticket should be reopened.
From what I can tell the error has nothing to do with fully encoding your url fragments and so on. The problem seems to be that the feed object gets a somehow not URL-quoted feed_url where it says
def __init__(self, slug, feed_url):
when I do a print feed_url it does not show me a URL which is "ASCII and URL-quoted". So the part after
# 'url' must already be ASCII and URL-quoted, so no need for encoding
throws an error. Maybe no one ever discovered the bug because you don't have to do with foreign-language sites!?
comment:10 by , 18 years ago
Please read the Unicode URI/IRI documentation carefully; if you have Unicode inside URLs, you are responsible for ensuring that you call the proper function to escape it before handing it off to anything else. If you have further questions, please follow Malcolm's suggestion and ask them on the django-users mailing list.
comment:11 by , 18 years ago
That would mean I can't use the feeds as described in the docs!?
The request URL has encoded and quoted Unicode, so what can I do when it is passed wrong to the feed object which throws an error?
All my other URLs are completely correct.
comment:12 by , 18 years ago
We have asked a number of times in the comments to please ask questions on the django-users list. You can post an example of how your code is generating the URL and what the problem is. The lack of examples you have provided makes it impossible to debug anything and Trac is not a good place to have support and debugging conversations. Certainly the earlier examples in this ticket were cases of bad user code, rather than a bug in Django, and yours may well be similar.
Post to django-users. Give an example of what the URL string is and how you are generating it. Then you will get help with fixing it.
This looks to be another unicode issue that we're going to look into after 0.96 is released.