Opened 13 years ago
Closed 13 years ago
#17386 closed Uncategorized (wontfix)
Validation & Unicode Character 'ZERO WIDTH SPACE' (U+200B)
Reported by: | Raymond Penners | Owned by: | nobody |
---|---|---|---|
Component: | Forms | Version: | 1.3 |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Unreviewed | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Once in a while users somehow manage to inject e-mail addresses into the system containing unicode zero width space characters. I am not sure how they do it -- it probably happens when copy/pasting from a document of some sorts. Nevertheless, form validation does not reject such e-mail addresses:
>>> from django.core.validators import validate_email >>> email=u'test@hotmail.co\u200bm' >>> validate_email(email) >>> # No ValidationError ?
These e-mail addresses get accepted and cause trouble later on (try sending mail to them, or hashing them for gravatar uses). Either:
a) Raise a ValidationError for such e-mail addresses, or
b) Automatically strip this character
Downside of a) is that the user is most likely unaware of this invisible character. He wouldn't know what character to remove where, even if instructed by an error message.
Change History (5)
comment:1 by , 13 years ago
comment:2 by , 13 years ago
I suppose this character is inserted as an anti-spam mechanism, precisely to defeat copy-paste.
Django won't alter user input silently — it's a bad practice that can backfire in interesting ways. And I'm not in favor of defeating a purposeful (although debatable) anti-spam mechanism.
Are non-ASCII characters acceptable in email addresses? If not Django should raise a ValidationError when an email address contains one, which would resolve this problem.
comment:3 by , 13 years ago
Per RFC 3696, email addresses can use non-ASCII characters:
Any characters, or combination of bits (as octets), are permitted in DNS names.
Names will be encoded with IDNA when an ASCII representation is required.
The EmailValidator
takes this into account:
class EmailValidator(RegexValidator): def __call__(self, value): try: super(EmailValidator, self).__call__(value) except ValidationError, e: # Trivial case failed. Try for possible IDN domain-part if value and u'@' in value: parts = value.split(u'@') try: parts[-1] = parts[-1].encode('idna') except UnicodeError: raise e super(EmailValidator, self).__call__(u'@'.join(parts)) else: raise
However, \u200b
encodes to nothing with IDNA:
>>> u'-\u200b-'.encode('idna') == '--' True >>> len(u'-\u200b-'.encode('idna')) 2
I spent some time fighting with various online encoders and couldn't confirm or infirm whether this is a valid result.
Anyway, that's the reason why the email address is valid, after IDNA encoding of the domain part.
comment:4 by , 13 years ago
Django won't alter user input silently — it's a bad practice that can backfire in interesting ways.
And I'm not in favor of defeating a purposeful (although debatable) anti-spam mechanism.
In this case it is debatable on whether that character is in fact user input, as the user inputting the e-mail address is totally unaware of that character being sent to the server.
That character is apparently meant to trick robots in such away that they won't recognize the e-email address. However, a user cannot tell the difference between two e-mail addresses, one with, and one without the character.
Therefore:
- It would be indeed be wrong to raise a ValidationError, as the user wouldn't know what to do -- he literally does not see the problem.
- It would be wrong to accept the accept the value as is, as two "equal" e-mail addresses do not pass the equality test (==, iexact), causing all sorts of trouble in any Django app comparing e-mail addresses.
As for altering input silently: multiple representations of the same date are all mapped to a single representation under the hood, so why don't we do the same for multiple representations of the same e-mail address?
comment:5 by , 13 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Upon further thought, I don't believe this qualifies as a bug in Django. I don't see enough reasons to justify special casing \u200b
, and I don't think Django can do something that will fit everyone.
In order to resolve this problem in your project, you can:
- add a
clean_email
method in your form that doescleaned_data['email'] = cleaned_data['email'].replace('\u200b', '')
- run a batch cleanup of your data :
for obj in MyModel.objects.all(): obj.email = obj.email.replace('\u200b', ''); obj.save()
For what it is worth, I've only encountered hotmail e-mail addresses suffering from this problem: