Opened 13 years ago

Closed 13 years ago

#17386 closed Uncategorized (wontfix)

Validation & Unicode Character 'ZERO WIDTH SPACE' (U+200B)

Reported by: Raymond Penners Owned by: nobody
Component: Forms Version: 1.3
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Once in a while users somehow manage to inject e-mail addresses into the system containing unicode zero width space characters. I am not sure how they do it -- it probably happens when copy/pasting from a document of some sorts. Nevertheless, form validation does not reject such e-mail addresses:

>>> from django.core.validators import validate_email
>>> email=u'test@hotmail.co\u200bm'
>>> validate_email(email)
>>> # No ValidationError ?

These e-mail addresses get accepted and cause trouble later on (try sending mail to them, or hashing them for gravatar uses). Either:
a) Raise a ValidationError for such e-mail addresses, or
b) Automatically strip this character

Downside of a) is that the user is most likely unaware of this invisible character. He wouldn't know what character to remove where, even if instructed by an error message.

Change History (5)

comment:1 by Raymond Penners, 13 years ago

For what it is worth, I've only encountered hotmail e-mail addresses suffering from this problem:

confidential@hot\u200bmail.com
confidential@hotmail.c\u200bom

comment:2 by Aymeric Augustin, 13 years ago

I suppose this character is inserted as an anti-spam mechanism, precisely to defeat copy-paste.

Django won't alter user input silently — it's a bad practice that can backfire in interesting ways. And I'm not in favor of defeating a purposeful (although debatable) anti-spam mechanism.

Are non-ASCII characters acceptable in email addresses? If not Django should raise a ValidationError when an email address contains one, which would resolve this problem.

comment:3 by Aymeric Augustin, 13 years ago

Per RFC 3696, email addresses can use non-ASCII characters:

Any characters, or combination of bits (as octets), are permitted in DNS names.

Names will be encoded with IDNA when an ASCII representation is required.

The EmailValidator takes this into account:

class EmailValidator(RegexValidator):

    def __call__(self, value):
        try:
            super(EmailValidator, self).__call__(value)
        except ValidationError, e:
            # Trivial case failed. Try for possible IDN domain-part
            if value and u'@' in value:
                parts = value.split(u'@')
                try:
                    parts[-1] = parts[-1].encode('idna')
                except UnicodeError:
                    raise e
                super(EmailValidator, self).__call__(u'@'.join(parts))
            else:
                raise

However, \u200b encodes to nothing with IDNA:

>>> u'-\u200b-'.encode('idna') == '--'
True
>>> len(u'-\u200b-'.encode('idna'))
2

I spent some time fighting with various online encoders and couldn't confirm or infirm whether this is a valid result.

Anyway, that's the reason why the email address is valid, after IDNA encoding of the domain part.

comment:4 by Raymond Penners, 13 years ago

Django won't alter user input silently — it's a bad practice that can backfire in interesting ways.
And I'm not in favor of defeating a purposeful (although debatable) anti-spam mechanism.

In this case it is debatable on whether that character is in fact user input, as the user inputting the e-mail address is totally unaware of that character being sent to the server.

That character is apparently meant to trick robots in such away that they won't recognize the e-email address. However, a user cannot tell the difference between two e-mail addresses, one with, and one without the character.

Therefore:

  • It would be indeed be wrong to raise a ValidationError, as the user wouldn't know what to do -- he literally does not see the problem.
  • It would be wrong to accept the accept the value as is, as two "equal" e-mail addresses do not pass the equality test (==, iexact), causing all sorts of trouble in any Django app comparing e-mail addresses.

As for altering input silently: multiple representations of the same date are all mapped to a single representation under the hood, so why don't we do the same for multiple representations of the same e-mail address?

comment:5 by Aymeric Augustin, 13 years ago

Resolution: wontfix
Status: newclosed

Upon further thought, I don't believe this qualifies as a bug in Django. I don't see enough reasons to justify special casing \u200b, and I don't think Django can do something that will fit everyone.

In order to resolve this problem in your project, you can:

  • add a clean_email method in your form that does cleaned_data['email'] = cleaned_data['email'].replace('\u200b', '')
  • run a batch cleanup of your data : for obj in MyModel.objects.all(): obj.email = obj.email.replace('\u200b', ''); obj.save()
Note: See TracTickets for help on using tickets.
Back to Top