Context Navigation

← Previous Ticket
Next Ticket →

#9655 closed Bug (fixed)

urlize should not quote already quoted urls.

Reported by:	flytwokites	Owned by:	Aymeric Augustin
Component:	Template system	Version:	1.0
Severity:	Normal	Keywords:	urlize
Cc:	towjzhou@…, cmlenz@…, lamotte85@…	Triage Stage:	Accepted
Has patch:	yes	Needs documentation:	yes
Needs tests:	yes	Patch needs improvement:	no
Easy pickings:	yes	UI/UX:	no

Description

This template:

{{ user.homepage:urlize }}

When user.homepage is "http://hi.baidu.com/%D6%D8%D0%C2%BF%AA%CA%BC",
the output will be "http://hi.baidu.com/%25D6%25D8%25D0%25C2%25BF%25AA%25CA%25BC",
It quoted '%' char in source url and outputs an invalid url.

Attachments (2)

9655.patch (2.1 KB ) - added by Aymeric Augustin 13 years ago.
9655-2.diff (3.2 KB ) - added by Claude Paroz 13 years ago.: Same patch with test

Download all attachments as: .zip

Change History (22)

comment:1 by Karen Tracey, 16 years ago

Resolution:	→ invalid
Status:	new → closed

urlize works with plain text:

http://docs.djangoproject.com/en/dev/ref/templates/builtins/#urlize

You are not passing plain text here, you are passing something that has already been percent-encoded. There's no way for urlize to know that, so it percent-encodes the link value, which it must do in order to generate correct urls for data that has not already been percent-encoded. If you want to use urlize, you need to pass it a not-percent-quoted, plain text value to work with.

comment:2 by Malcolm Tredinnick, 16 years ago

Resolution:	invalid
Status:	closed → reopened

With all due respect to Karen's logic, I'm going to reopen this. I can't really reconcile the current behaviour as logical. We already treat a few particular characters as "safe" in URLs -- those marked specifically with "http://" or "https://" prefixes. It's not unreasonable to say that those particular strings (with those prefixes) are already correctly written and all we are doing is converting them to links. It looks like just an oversight that we don't have "%" in that special set.

comment:3 by Karen Tracey, 16 years ago

Hmm...so you are saying if the input value already has a http[s]: prefix that the value should be treated as if it were already percent-encoded? Thus urlize for something like Big savings available when you visit www.mystore.com/30%OffCoupons! would percent-encode the percent, but Big savings available when you visit http://www.mystore.com/30%OffCoupons! would not?

comment:4 by Malcolm Tredinnick, 16 years ago

Triage Stage:	Unreviewed → Design decision needed

I'm not sure what to do in your first case. Maybe it should be treated the same way, too. Needs thought. This is certainly a bit of an edge-case, but I can see the use-case where a URL is cut-and-pasted into an email or other text and that text is run through the urlize filter before being displayed. It's more likely than not that it will already be encoded at that point.

A URL of "www.mystore.com/30%OffCoupons" where the "%" is meant to be treated as unencoded is actually invalid (the URL spec includes the encoding requirement). So we need to have a guess at where the balance of use-cases goes: are people going to be using "real" URLs or things they hashed together themselves? We'll need some coins and a bit of time to toss them (although obviously I'm leaning slightly towards favouring the cut-and-paste scenario at the moment).

I just don't see this as an immediate "definitely don't change" case, which is my logic for reopening. We might ultimately decide it's not worth changing, but we at least should work out how people are meant to operate with such URLs. I will move it to "design decision needed", though, which I should have done yesterday, so we remember to actually make a decision if it slips under the radar.

comment:5 by Ryszard Szopa <ryszard.szopa@…>, 16 years ago

A case for changing the current behavior and treating specially aready quoted urls: currently if urlize is applied to a text, it is impossible to write correct links to some Google Chart API charts. For example, http://chart.apis.google.com/chart?cht=p&chtt=Percentage%20of%20Google%20Chart%20Which%20Resembles%20Pac-man&chs=550x250&chd=t:10,80,10&chco=FAFAFA,FFFF00,FAFAFA&chl=Does%20not%20resemble%20Pac-man|Resembles%20Pac-man will be always linked as http://chart.apis.google.com/chart?cht=p&chtt=Percentage%2520of%2520Google%2520Chart%2520Which%2520Resembles%2520Pac-man&chs=550x250&chd=t:10%2C80%2C10&chco=FAFAFA%2CFFFF00%2CFAFAFA&chl=Does%2520not%2520resemble%2520Pac-man%7CResembles%2520Pac-man

comment:6 by Christopher Lenz <cmlenz@…>, 16 years ago

Cc:	cmlenz@… added

I just came across this bug in a Django app of ours. Users copy and paste URLs from the location bar or from elsewhere into a text field, and when those URLs are already encoded (which is very likely as far as I can tell), urlize is often going to break the link (bad). I've had to copy and modify the whole urlize function and filter to get rid of the quoting.

Why not just leave the URL alone, and HTML-escape it (both in the href attribute and the link text) to fend off potential XSS problems. If the original link did indeed require encoding, the user agent will take care of that, and the link will at least work. The author can still go in and paste a properly encoded link if she's a nitpicking-validator-using power user.

At the very least, it should be possible to turn off the quoting (another keyword arg on django.utils.html.urlize). But I still think the current behavior does not make sense and should be changed.

comment:7 by ptarjan, 15 years ago

What I really want is to have this in my template array :

http://www.wowarmory.com/character-achievements.xml?r=Mal%27Ganis&n=Vosk

And then have this rendered by urlize :

<a href="http://www.wowarmory.com/character-sheet.xml?r=Mal%27Ganis&cn=Vosk" rel="nofollow">http://www.wowarmory.com/character-sheet.xml?r=Mal'Ganis&cn=Vosk</a>

Are there currently any options to urlize to make this use case possible? Or at least a custom filter that I can use?

comment:8 by ptarjan, 15 years ago

I think something like this custom filter does what I want, but I would prefer if urlize did this by default (because then I get mailto: linking and url detection for free) :

from django import template
from django.utils.safestring import mark_safe
import urllib
register = template.Library()

@register.filter(name="myurlize")
def myurlize(url) :

r = '<a href="%s">%s</a>' % (url, urllib.unquote(url))
return mark_safe(r)

comment:9 by ptarjan, 15 years ago

bad formatting, sorry

from django import template
from django.utils.safestring import mark_safe
import urllib
register = template.Library()

@register.filter(name="myurlize")
def myurlize(url) :
    r = '<a href="%s">%s</a>' % (url, urllib.unquote(url))
    return mark_safe(r)

comment:10 by bugmenot, 15 years ago

I agree with the people who want this changed. Beeing able to copy-paste an url containing % from the address bar is more important then beeing able to manually write an url containing % which is an uncommon use case.

The fix seems to be replacing safe='/&=:;#?+*' with safe='/%&=:;#?+*' in django.utils.html.py.

comment:11 by myneur, 14 years ago

Please, think in the way of most common usecase instead of "theoretical correctness" or what's your point(?).

Most common usecase is that user copy-paste address from his browser and these links needs to be translated.

in urlize, this case doesn't work at all

there is absolutely no way how the current urlize can work with a space in link (common in linked google documents) (plain text space breaks detection, and encoded space %20 is replaced into %2520).

there is no way why to use this function if it doesn't work:(

What is usecase you have built urlize for?
Does anybody do anything else than copypasting links?

Thank you for rethinking this.

comment:12 by Luke Plant, 14 years ago

Severity:	→ Normal
Type:	→ Bug

comment:13 by Julien Phalip, 13 years ago

Easy pickings:	unset

#16020 is a dupe and has a more recent discussion and patch.

comment:14 by Dan LaMotte <lamotte85@…>, 13 years ago

Cc:	lamotte85@… added
UI/UX:	unset

Seems like this has already been re-considered, but I'd like to add my $0.02. I also came across this problem where someone copy pasted a URL and %20 was encoded %2520 which makes the URL unusable. My argument against re-encoding is simple:

By changing the behavior, the use case: http://www.mystore.com/30%OffCoupons! can be worked around by encoding the URL by hand (stick it in your address bar, then copy/paste it)
The current behavior cannot be worked around, it renders URLs entirely useless

I feel that while the other comments lead someone to believe that copy/paste is the majority case as an argument (as it may well be), that more importantly, the current behavior cannot be worked around without code change. Given that, will the current behavior be changed?

comment:15 by Aymeric Augustin, 13 years ago

Owner:	changed from nobody to Aymeric Augustin
Status:	reopened → new
Triage Stage:	Design decision needed → Accepted

We have the following options here:

keep the current behavior: escape everything
keep URLs as is: escape nothing
be smart: try to detect if the URL is already URL-encoded, and if it is, don't re-encode it.

Currently, URL escaping is performed with urlquote(url, safe='/&=:;#?+*') (where urlquote from django.utils.http is basically a lazy + smart version of urllib.quote).

Instead, we could test if the URL contains an URL escape (%xx with xx in [0-9A-F]); if it does, assume it's already escaped; otherwise, escape with urllib.quote.

Obviously, the problem is ill-defined. We have no way of knowing if the input is or isn't URL-encoded. A quick test shows that browsers do something akin to the above. If I type google.com/search?q=foo bar in the URL bar, it gets encoded to google.com/search?q=foo%20bar. If I type google.com/search?q=foo%20bar, it doesn't get encoded.

So, even though being smart often isn't a good idea, in this case I believe it's the best solution.

PS: I've researched the topic of URL escaping quite thoroughly a few years ago and here's what I had concluded:

Based on RFC 1738:
   [ <>"#%{}|\^~[]`] are unsafe and must be encoded
   [;/?:@=&]         are reserved and must be encoded
                     unless they are used for their special meaning
                     we assume that they are encoded in input if necessary
   [$-_.+!*'(),]     are safe and need not be encoded

In comparison, Django considers /&=:;#?+* as safe characters — in addition to _.- which Python always considers safe. I think the list of safe characters should be extended to ;/?:@=&$-_.+!*'(), while we're working on this.

comment:16 by Aymeric Augustin, 13 years ago

For the record, the escaping was introduced at r7079 to fix #6279 and #6514. However, those tickets were about HTML-escaping (<>&'"), not URL-encoding.

comment:17 by Aymeric Augustin, 13 years ago

The latest reference is http://tools.ietf.org/html/rfc3986#section-2.1

comment:18 by Aymeric Augustin, 13 years ago

See also http://bugs.python.org/issue2637

by Aymeric Augustin, 13 years ago

Attachment:	9655.patch added

comment:19 by Aymeric Augustin, 13 years ago

Easy pickings:	set
Has patch:	set
Needs documentation:	set
Needs tests:	set

by Claude Paroz, 13 years ago

Attachment:	9655-2.diff added

Same patch with test

comment:20 by Aymeric Augustin, 13 years ago

Resolution:	→ fixed
Status:	new → closed

In [17347]:

Fixed #9655 -- Prevented the urlize template filter from double-quoting URLs. Thanks Claude Paroz for writing the tests.

Note: See TracTickets for help on using tickets.

Download in other formats:

Issues