Opened 4 weeks ago

Closed 4 weeks ago

Last modified 2 weeks ago

#36131 closed New feature (wontfix)

URLValidator not correctly validating URLs — at Version 9

Reported by: Ludwig Kraatz Owned by: Ludwig Kraatz
Component: Core (Other) Version: dev
Severity: Normal Keywords: URL Validator
Cc: Ludwig Kraatz Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description (last modified by Ludwig Kraatz)

Abstract

An URL is a way of describing a Resource.

https://resource -> is a valid URL.

Why do i raise this as issue

An URL resource-descriptor is constructed like that [RFC 3986#section-3]:

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment

so: scheme, authority, rest...

The issue in djangos URLValidation I want to address, is a over-specification and 'selective circumvention of wrongful parsing' when it comes to the -host- compnent of the authority part.

What djangos URLValidator currently does:
host_re = "( FQDN-REGEX | localhost )"

Basically, django parses IP-OR-FQDN-OR-LOCALHOST-URLs.

This is basically the 'selective circumvention of wrongful parsing' i mentioned earlier. By ”| localhost" the URL field "feels" more okay, because all the obvious URLs on localhost that exist, now pass. But there is so much more than "localhost" besides FQDN as used for "(global) DNS URLs".

The RFC also acknowledges this. It is recommending using a syntax for hosts that conforms to the DNS syntax.

https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2

A host identified by a registered name is a sequence of characters
   usually intended for lookup within a locally defined host or service
   name registry, though the URI's scheme-specific semantics may require
   that a specific registry (or fixed name table) be used instead.  The
   most common name registry mechanism is the Domain Name System (DNS).
   A registered name intended for lookup in the DNS uses the syntax

   defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
   Such a name consists of a sequence of domain labels separated by ".",
   each domain label starting and ending with an alphanumeric character
   and possibly also containing "-" characters.  The rightmost domain
   label of a fully qualified domain name in DNS may be followed by a
   single "." and should be if it is necessary to distinguish between
   the complete domain name and some local domain.

      reg-name    = *( unreserved / pct-encoded / sub-delims )

   If the URI scheme defines a default for host, then that default
   applies when the host subcomponent is undefined or when the
   registered name is empty (zero length).  For example, the "file" URI
   scheme is defined so that no authority, an empty host, and
   "localhost" all mean the end-user's machine, whereas the "http"
   scheme considers a missing authority or empty host invalid.

   This specification does not mandate a particular registered name
   lookup technology and therefore does not restrict the syntax of reg-
   name beyond what is necessary for interoperability.  Instead, it
   delegates the issue of registered name syntax conformance to the
   operating system of each application performing URI resolution, and
   that operating system decides what it will allow for the purpose of
   host identification.  A URI resolution implementation might use DNS,
   host tables, yellow pages, NetInfo, WINS, or any other system for
   lookup of registered names.  However, a globally scoped naming
   system, such as DNS fully qualified domain names, is necessary for
   URIs intended to have global scope.  URI producers should use names
   that conform to the DNS syntax, even when use of DNS is not
   immediately apparent, and should limit these names to no more than
   255 characters in length.

What is said in many ways:

  • local host resolution is completely okay.
  • no "." is required as, a sequence (which is not further specified to length restrictions) can consist of 1, which would lack a "." seperator
  • host names that are -compatible-, are valid.

[RFC 6762 Multicast DNS # Section 3]

It is unimportant whether a name ending with ".local." occurred
   because the user explicitly typed in a fully qualified domain name
   ending in ".local.", or because the user entered an unqualified
   domain name and the host software appended the suffix ".local."
   because that suffix appears in the user's search list.

It is stated clearly, that a user can describe a resource with the implication, that if its not a fully qualified domain name, the TLD .local is to be assumed. As such - the URL, which is what the user would be referencing, was to be able to deal with more non-FQDN than just "localhost". This is in the context of Multicast DNS, which seems more than close enough to be considered relevant, when talking about URLs - as the URL RFC was so closely described around DNS.

[RFC 3986 URI/URL # 1.1]

 URIs that
   identify in relation to the end-user's local context should only be
   used when the context itself is a defining aspect of the resource,
   such as when an on-line help manual refers to a file on the end-
   user's file system (e.g., "file:///etc/hosts").
  • clearly states, that URI's are valid, even if they clearly only 'make sense' in a end-users local context.

As such - restricting django URLs to only Fully Qualified Domain Names/IPs, (except localhost.. for whatever reason except inconsitency :-* ) - is a restriction that contradicts that notion.

What i am proposing:

fully allowing for URLs as per rfc3986#section-3.2.2 - with a regex solution for localhost (and whatever else is possible) instead of a hardcoded < "magicnumber"-80%-"solution" >

To be Commited to django repository and pull requested. My earlier pull request is more - a starting point for discussion.

Why this is necessary & usefull:

Single-label URLs might be used

  • in intranet situations
  • for URLs that represent services / schemes that do not comply to FQDNaming conventions
  • for local testing (local DNS resolution that is not based on FQDN)
  • mDNS [RFC 6762] solutions, operating under .local TLD (which as of that RFC can be ommitted in a local context)
  • the django validator is named URLValidator, not FQDN_IP_LOCALHOST_URLValidator

Further notes:

i already submitted a pull request - which probably isn't mature enough.. given i did not even check which tests would break..

EDIT: ignore the following. it was an oversight on my end.

but - there was one test, that should not have broken:

FAIL: test_urlfield_clean_invalid (forms_tests.field_tests.test_urlfield.URLFieldTest.test_urlfield_clean_invalid) [<object object at 0x000001C1038C1760>] (value='foo')

URL <= "foo" should not be valid, even with my little changes, replacing 'localhost' with hostname_re

It feels like there are some (- -)  missing - but i did not check.. i focused on providing a more solid ticket first..
So - if i am not mistaken, there is another issue besides what i propose. It seems, limiting hosts via FQDN was the thing, preventing missing URI-scheme's to be rejected by the validator, not a correct validation of uri-schemes themselves.

PS

its kindof late - i might polish this ticket tomorrow. if you feel like i'm drunk or disorganized - its just my brain thats screaming for relief. sry.

Change History (9)

comment:1 by Ludwig Kraatz, 4 weeks ago

Has patch: set
Patch needs improvement: set

comment:2 by Ludwig Kraatz, 4 weeks ago

EDIT: IGNORE THIS COMMENT. oversight on my end. URLField != URLValidator. #lazy scheme.

https://github.com/django/django/pull/19096

Test cofirmed: URL Validation fails for "localhost" => which will be accepted, even though its not a valid URL. (it lacks scheme://)

Bottom line: whole regEx / Validation seems off on multiple layers.

Last edited 4 weeks ago by Ludwig Kraatz (previous) (diff)

comment:3 by Antoliny, 4 weeks ago

Owner: set to Ludwig Kraatz
Status: newassigned

comment:4 by Sarah Boyce, 4 weeks ago

Resolution: invalid
Status: assignedclosed

This is rather hard to follow, http://localhost is a valid url and passes. localhost is not a valid url and fails. I do not follow the reasoning that there is an issue

comment:5 by Ludwig Kraatz, 4 weeks ago

hello=??

https://github.com/django/django/pull/19096

look at your own test cases. localhost is being allowed, even though we agree it should not. so that issue stands as it is. Your current code contradicts your own statement. It stands on its own - and has nothing really to do with this ticket on its own. so why you close it with that rational after it has been assigned to me - sry, i dont get it.

My ticket, initially, was about http://somethingOtherThanLocalhost -> is valid as well. But is falsefully rejected by the URLvalidator.

Both issues combined say: The whole RegEx URLValidator logic is off.
as mentioned - on multiples layers

  1. the issue of not conforming to reality, by not allowing https://resource or https://printer or whatever 'local' but non-localhost resource is to be referenced.
  2. the issue of being a wrong regex, as it allows things that are clearly not to be allowed.

please take some more time to reconsider closing this ticket in a hurry - this is not thought through.

in reply to:  4 comment:6 by Ludwig Kraatz, 4 weeks ago

Replying to Sarah Boyce:

This is rather hard to follow, http://localhost is a valid url and passes. localhost is not a valid url and fails. I do not follow the reasoning that there is an issue

i have to correct you here:

localhost is not valid but passes.

look closely at:
https://github.com/django/django/pull/19096

 FAIL: test_urlfield_clean_invalid (forms_tests.field_tests.test_urlfield.URLFieldTest.test_urlfield_clean_invalid) [<object object at 0x000002C811AD0FE0>] (value='localhost')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\hostedtoolcache\windows\Python\3.13.1\x64\Lib\unittest\case.py", line 58, in testPartExecutor
    yield
  File "C:\hostedtoolcache\windows\Python\3.13.1\x64\Lib\unittest\case.py", line 556, in subTest
    yield
  File "D:\a\django\django\tests\forms_tests\field_tests\test_urlfield.py", line 114, in test_urlfield_clean_invalid
    with self.assertRaisesMessage(ValidationError, msg):
    ^^^^^^^
  File "C:\hostedtoolcache\windows\Python\3.13.1\x64\Lib\contextlib.py", line 148, in __exit__
    next(self.gen)
    ^^^^^^^
  File "D:\a\django\django\django\test\testcases.py", line 807, in _assert_raises_or_warns_cm
    with func(expected_exception) as cm:
    ^^^^^^^^^^^^^^^
  File "C:\hostedtoolcache\windows\Python\3.13.1\x64\Lib\unittest\case.py", line 263, in __exit__
    self._raiseFailure("{} not raised".format(exc_name))
    ^^^^^^^
  File "C:\hostedtoolcache\windows\Python\3.13.1\x64\Lib\unittest\case.py", line 200, in _raiseFailure
    raise self.test_case.failureException(msg)
    ^^^^^^^^^^^^^^^
AssertionError: ValidationError not raised

TLDR:

  1. forms_tests.field_tests.test_urlfield.URLFieldTest.test_urlfield_clean_invalid)(value='localhost')
  2. AssertionError: ValidationError not raised
Last edited 4 weeks ago by Ludwig Kraatz (previous) (diff)

comment:7 by Ludwig Kraatz, 4 weeks ago

Host components, and whether they are being validated correctly or not.

This - is the core issue of my ticket.

+---------------------+
|  Local Environment  |
+---------------------+
       |
       |--[localhost]---> VALID
       |
       |--[printer]------> FALSELY INVALIDATED
       |                    (valid via local DNS, mDNS, or hosts)
       |                    (although it was explicitly claimed a valid utilization as of RFC6762)
       |
       |--[printer.local]-> VALID

+---------------------+
|  Public Environment  |
+---------------------+
       |
       |--[example.com]---> VALID

comment:8 by Sarah Boyce, 4 weeks ago

If you "look closely" at the test, you will notice it is testing URLField, not URLValidator.

URLField has the feature assume_scheme which means the normalized value is http://localhost and, therefore, URLValidator finds it valid.

If you can run the following code against URLValidator:

>>> from django.core.validators import URLValidator
>>> URLValidator()("http://localhost")
>>> URLValidator()("localhost")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path_to_django/django/core/validators.py", line 170, in __call__
    raise ValidationError(self.message, code=self.code, params={"value": value})
django.core.exceptions.ValidationError: <exception str() failed>

So, URLValidator raises a ValidationError when called against localhost.


Last edited 4 weeks ago by Sarah Boyce (previous) (diff)

comment:9 by Ludwig Kraatz, 4 weeks ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.
Back to Top