Opened 5 weeks ago
Last modified 5 weeks ago
#36013 assigned Bug
Inconsistent handling of IDNs in urlize and AdminURLFieldWidget
Reported by: | Mike Edmunds | Owned by: | Mike Edmunds |
---|---|---|---|
Component: | Utilities | Version: | dev |
Severity: | Normal | Keywords: | idna |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
django.utils.html.smart_urlquote() and Urlizer use obsolete IDNA 2003 encoding on some—but not all—international domain names (IDNs), leading to inconsistent URLs, failure to urlize email addresses for some IDNs, and other problems.
That code is used only by the urlize/urlizetrunc template filters and the AdminURLFieldWidget. Django does not provide IDNA encoding for any other non-ASCII URLs. This ticket proposes dropping those IDNA 2003 special cases, so that the browser can handle IDNs consistently for all URLs coming from Django.
[IDNA 2003 was superseded by IDNA 2008 starting in ~2010. Browsers follow the WHATWG URL Standard and implement UTS #46, which builds on IDNA 2008.]
Examples
Urlizer and smart_urlquote() apply IDNA encoding to some URLs, but use percent-encoded UTF-8 for others:
from django.template.defaultfilters import urlize urlize("https://עִתוֹן.example.il") # '<a href="https://xn--cdbk7fubl3c.example.il" rel="nofollow">https://עִתוֹן.example.il</a>' urlize("https://މިހާރު.example.mv") # '<a href="https://%DE%89%DE%A8%DE%80%DE%A7%DE%83%DE%AA.example.mv" rel="nofollow">https://މިހާރު.example.mv</a>' from django.utils.html import smart_urlquote smart_urlquote("https://עִתוֹן.example.il") # 'https://xn--cdbk7fubl3c.example.il' smart_urlquote("https://މިހާރު.example.mv") # 'https://%DE%89%DE%A8%DE%80%DE%A7%DE%83%DE%AA.example.mv'
Urlizer linkifies email addresses in some IDNs, but rejects email addresses in others:
from django.template.defaultfilters import urlize urlize("editor@עִתוֹן.example.il") # '<a href="mailto:editor@xn--cdbk7fubl3c.example.il">editor@עִתוֹן.example.il</a>' urlize("editor@މިހާރު.example.mv") # 'editor@މިހާރު.example.mv'
Examples were run with Django 5.1.4. މިހާރު is the local name of a Maldivian newspaper. It can be encoded in IDNA 2008, but was invalid under IDNA 2003. עִתוֹן is Hebrew, and can be encoded with all IDNA versions. Both use RTL scripts, but RTL support was improved in IDNA 2008.
Using obsolete IDNA 2003 encoding can cause other problems. For example, it strips Unicode characters necessary for accurate rendering of some scripts: ශ්රී (the Sri part of Sri Lanka in Sinhalese) is corrupted to ශ්රී after passing through IDNA 2003. (It's maybe still readable, but is sort of like writing "mañana" as "man~ana".) IDNA 2008 addresses this.
Proposed change
The easiest fix seems to be simply removing the IDNA 2003 encoding (calls to punycode()) from URL generation in django.utils.html. Instead, run unquote_quote() on the netloc to render IDNs as percent-encoded UTF-8, like other non-ASCII characters in the URL. That leaves IDNA encoding details to the browser, ensuring consistent handling of all international URLs.
There doesn't seem to be any need for Django to IDNA encode domains in URLs. In fact, apart from the urlize template filters and the AdminURLFieldWidget, nothing else in Django applies IDNA encoding to URL hosts. (The iriencode and urlencode template filters generate %-encoded UTF-8, not IDNA. And it seems like many projects just render IDN URLs as raw UTF-8. Modern browsers support both.)
This approach complies with relevant standards:
- WHATWG's HTML Standard (working your way down from section 2.4, URLs) and URL Standard (arriving at section 3.5, Host parsing, step 4) allow both raw and %-encoded UTF-8 in URL hosts. This is what all modern browsers support.
- RFC 3986, which specifies URIs, permits %-encoded UTF-8 in URL hosts (last paragraph of section 3.2.2; a "registered name" is a host). RFC 6068, for mailto URIs, includes similar language (section 2 item 4). (Although the RFCs suggest that apps should apply "IDNA encoding, rather than a percent-encoding, if they wish to maximize interoperability with legacy URI resolvers" [emphasis added], this is a should recommendation, not a must requirement.)
Past discussions have proposed updating django.utils.encoding.punycode() from IDNA 2003 to IDNA 2008. But that would not be a workable solution. IDNA 2008 alone does not perform case folding and other transformations needed to match user expectations around IDN resolution. And IDNA 2008 disallows some domains that WHATWG's URL Specification specifically permits (such as emoji domains).
To match browser IDN handling, the correct spec to follow (per WHATWG) would be Unicode UTS #46 "non-transitional." Currently, there doesn't seem to be any complete Python implementation of that standard. (The idna package's uts46 option implements only UTS #46 preprocessing, section 4.4.)
Considering all that, deferring IDN encoding to the browser seems like the cleanest and most reliable approach. (And, indeed, is what already happens for URLs that aren't generated by urlize or displayed in an AdminURLFieldWidget.)
Compatibility
Django's admin app fully supports only "recent versions of modern, web standards compliant browsers" (admin faq, language has been there since Django 3.1). Modern browsers all follow the WHATWG standards cited above, so there should be no compatibility concerns with the AdminURLFieldWidget.
In theory, the Urlizer changes could impact an existing Django app which both (1) uses the urlize or urlizetrunc template filter on text containing IDN URLs, and (2) needs the resulting links to work with a "legacy URI resolver" user agent that either doesn't understand %-encoded UTF-8 or doesn't perform IDNA encoding. If both of those are true, any urlized IDN links would probably be broken (not navigable) in that user agent after this change. (Of course, even before this change, any IDN links the app renders directly in its HTML—not by urlizing plaintext—are already broken for that legacy user agent.)
More info
IDNA encoding was added to smart_urlquote() in #13704 (2010), because "urlquote … incorrectly handles domain names with unicode characters in them." Unfortunately, that ticket doesn't include examples of the incorrect results, and I haven't tried to get Django 1.2 working to test it myself. My best guess is that %-encoded UTF-8 wasn't considered "correct" back then. (It's just fine now, per the standards cited above.)
For reference, here's how Django's punycode() (IDNA 2003) and the third-party idna package (IDNA 2008) handle the IDNs from the earlier examples (Django 5.1.4; Python 3.12.4):
# IDNA 2003: from django.utils.encoding import punycode punycode("עִתוֹן.example.il") # 'xn--cdbk7fubl3c.example.il' punycode("މިހާރު.example.mv") # UnicodeError: Violation of BIDI requirement 3 # encoding with 'idna' codec failed # IDNA 2008: import idna idna.encode("עִתוֹן.example.il") # b'xn--cdbk7fubl3c.example.il' idna.encode("މިހާރު.example.mv") # b'xn--hqbgq5jdp.example.mv' # Unicode code points: assert "עִתוֹן" == '\u05e2\u05b4\u05ea\u05d5\u05b9\u05df' assert "މިހާރު" == '\u0789\u07a8\u0780\u07a7\u0783\u07aa'
Change History (1)
comment:1 by , 5 weeks ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Version: | 5.1 → dev |
Hello Mike, thank you so much for the carefully written ticket and for the exhaustive list of details added to this report, which considerably ease the triaging.
Accepting following the presented rationale and also because this aligns with conversations had between Mike and the Security Team.