#22561 closed Bug (fixed)
EmailMessage should respect RFC2822 on max line length
Reported by: | notsqrt | Owned by: | Henrik Levkowetz |
---|---|---|---|
Component: | Core (Mail) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | petr.hroudny@…, bugs@…, michal@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Follow-up of thread Email encoding (DKIM, long lines, etc..) on django-users
RFC
The RFC2822 states that:
"Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."
This statement has not been modified in 2008 in the updated version : RFC5322
History
For utf-8 encoded emails, Python uses:
- shortest of "quoted-printable" and "base64" for the email subject
- "base64" for the body
# stdlib, identical in python2.7 and python3.3 : email/charset.py CHARSETS = { 'utf-8': (SHORTEST, BASE64, 'utf-8'), }
The historical reason seems to be that support for 8bit characters in emails was not largely adopted, hence the need to encode them into ASCII.
Back in 2007, in ticket 3472 (changeset 5143), it was decided to always use "quoted-printable", because using base64 seems to negatively affect spam scores.
# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from # some spam filters. Charset.add_charset('utf-8', Charset.SHORTEST, Charset.QP, 'utf-8')
In 2011, in ticket 11212 (changeset 16178, django 1.4), it was decided to remove "quoted-printable", and let python automatically switch between 7-bit or 8-bit encodings, based on the fact that 8-bit emails were widely supported, and MTAs were in charge of the downgrading to 7-bit if necessary.
Charset.add_charset('utf-8', Charset.SHORTEST, None, 'utf-8')
The (unintended?) side-effect of using base64 or "quoted-printable" was in fact a guarantee to have short lines in emails (for instance, rfc for quoted-printable rfc2045 states that max-length is 76 characters).
Summary of invoqued reasons for these choices
- base64 is too big (bandwidth)
- base64 is not supported by all clients
- base64 has a negative effect on spam scores (cf SpamAssassin's rule on unnecessarily using base64 encoding to disguise text, but this rule also states that "This does not apply to text in the UTF-8 or big5 character sets.")
- quoted-printable is no longer necessary, since MTAs and email clients have adopted 8bit support
Current state
Django
There was an additional ticket 12422, but not relevant to this ticket.
The current code in django/core/mail/message.py looks like:
# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from # some spam filters. utf8_charset = Charset.Charset('utf-8') utf8_charset.body_encoding = None # Python defaults to BASE64
Clients
Email clients like Gmail seem to wrap lines at 80 characters for text/plain, and switch to "Content-Transfer-Encoding: quoted-printable" for text/html and text/plain if there are non-ascii characters.
Importance
Mail Transfer Agent like Postfix often split lines that do not respect the RFC by inserting "\r\n " at the 998-th position of the line.
DKIM signatures of emails are based on the unmodified body, but the signature validation by receivers is based on the modified body, resulting in a check failure.
Apart from my own django projects, I have seen long lines in html emails sent by Sentry, for instance.
Choices
For reference, Perl library MIME-Lite recommends:
Use encoding: | If your message contains: ------------------------------------------------------------ 7bit | Only 7-bit text, all lines <1000 characters 8bit | 8-bit text, all lines <1000 characters quoted-printable | 8-bit text or long lines (more reliable than "8bit") base64 | Largely non-textual data: a GIF, a tar file, etc.
One way or another, we have to guarantee that email lines are <1000 characters.
base64 and quoted-printable do that for us.
No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.
I am not aware of other encodings that can be used for this, nor of reliable ways to split long lines.
On django-users, Russ Magee warned about possible downstream consequences.
Other references
http://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html
relevant discussion on trac's trac
SpamAssassin's rule on quoted-printable messages not respecting the 76-max line length rule.
Change History (17)
comment:1 by , 11 years ago
Cc: | added |
---|
comment:2 by , 11 years ago
comment:3 by , 10 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Type: | Uncategorized → Bug |
Russ seemed to accept the problem on the mailing list.
comment:4 by , 10 years ago
Apart from breaking DKIM, this behaviour also affects appearance of emails containing long lines (spaces appear to be added, at least in Gmail) and breaks inline HTML or CSS.
Note that long lines are commonplace when CSS rules are automatically being inlined.
follow-up: 7 comment:5 by , 10 years ago
No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.
I don't get your point here, how would we break html code by splitting lines? Content inside <pre>
?
comment:6 by , 9 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:7 by , 9 years ago
Replying to claudep:
I don't get your point here, how would we break html code by splitting lines? Content inside
<pre>
?
Putting a newline in the middle of a tag (href with a loooooong url?) would do it.
This issue is causing me some pain too.
comment:8 by , 9 years ago
Cc: | added |
---|
comment:9 by , 9 years ago
Please note that SpamAsassin
adds points only when base64 is used for encodings that do not require it: http://wiki.apache.org/spamassassin/Rules/MIME_BASE64_TEXT
A message with long lines do require some kind of encoding (not to violate the RFC), so if SA adds scores for such letters, it is a bug in SA (and contradiction with its documentation).
To be sure, you can check that message body has long lines and enable base64 encoding flag only in that case. This will definitely save bandwidth in case of well-formatted message bodies.
comment:10 by , 9 years ago
Cc: | added |
---|
follow-up: 12 comment:11 by , 9 years ago
Has patch: | set |
---|---|
Version: | 1.6 → master |
I added this PR to fallback to QP encoding when the body has lines longer than 998. Would this be a good solution?
comment:12 by , 9 years ago
comment:13 by , 9 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:15 by , 8 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
It appears the fix does not work properly on Cyrillic strings.
Tried it out with following snippet:
https://gist.github.com/ppokrovsky/d06d0d9e3c8f55bf15984ecd22954683
with body set to test_body_lat
, the has_long_lines
flag in django.core.mail.message.SafeMIMEText()
is set to True
and therefore correctly applies 'quoted-printable' encoding, while when body is set to test_body_ru
, it leaves has_long_lines
as False
therefore leaving default charset, which results in broken email body.
follow-up: 17 comment:16 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Instead of reopening the fixed ticket, could you please create a new one (where you can mention this ticket)?
I already have a pull request ready to fix your issue.
comment:17 by , 8 years ago
Replying to Claude Paroz:
Instead of reopening the fixed ticket, could you please create a new one (where you can mention this ticket)?
I already have a pull request ready to fix your issue.
#27696
Appreciated
Quoted-printable should only be used to downconvert emails to 7bit-only, not to workaround their RFC incompliance regarding line lengths.
Please note that QP works decently just for languages based on ASCII with only a few accentuated characters, but performs miserably for all others.
Thus reintroducing any form of 7bit downconversion is not the proper solution to this problem.