Opened 11 years ago

Closed 8 years ago

Last modified 8 years ago

#22561 closed Bug (fixed)

EmailMessage should respect RFC2822 on max line length

Reported by: notsqrt Owned by: Henrik Levkowetz
Component: Core (Mail) Version: dev
Severity: Normal Keywords:
Cc: petr.hroudny@…, bugs@…, michal@… Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Follow-up of thread Email encoding (DKIM, long lines, etc..) on django-users

RFC

The RFC2822 states that:
"Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."

This statement has not been modified in 2008 in the updated version : RFC5322

History

For utf-8 encoded emails, Python uses:

  • shortest of "quoted-printable" and "base64" for the email subject
  • "base64" for the body
# stdlib, identical in python2.7 and python3.3 : email/charset.py
CHARSETS = {
   'utf-8':       (SHORTEST,  BASE64, 'utf-8'),
}

The historical reason seems to be that support for 8bit characters in emails was not largely adopted, hence the need to encode them into ASCII.

Back in 2007, in ticket 3472 (changeset 5143), it was decided to always use "quoted-printable", because using base64 seems to negatively affect spam scores.

# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from
# some spam filters.
Charset.add_charset('utf-8', Charset.SHORTEST, Charset.QP, 'utf-8')

In 2011, in ticket 11212 (changeset 16178, django 1.4), it was decided to remove "quoted-printable", and let python automatically switch between 7-bit or 8-bit encodings, based on the fact that 8-bit emails were widely supported, and MTAs were in charge of the downgrading to 7-bit if necessary.

Charset.add_charset('utf-8', Charset.SHORTEST, None, 'utf-8')

The (unintended?) side-effect of using base64 or "quoted-printable" was in fact a guarantee to have short lines in emails (for instance, rfc for quoted-printable rfc2045 states that max-length is 76 characters).

Summary of invoqued reasons for these choices

  • base64 is too big (bandwidth)
  • base64 is not supported by all clients
  • base64 has a negative effect on spam scores (cf SpamAssassin's rule on unnecessarily using base64 encoding to disguise text, but this rule also states that "This does not apply to text in the UTF-8 or big5 character sets.")
  • quoted-printable is no longer necessary, since MTAs and email clients have adopted 8bit support

Current state

Django

There was an additional ticket 12422, but not relevant to this ticket.

The current code in django/core/mail/message.py looks like:

# Don't BASE64-encode UTF-8 messages so that we avoid unwanted attention from
# some spam filters.
utf8_charset = Charset.Charset('utf-8')
utf8_charset.body_encoding = None  # Python defaults to BASE64

Clients

Email clients like Gmail seem to wrap lines at 80 characters for text/plain, and switch to "Content-Transfer-Encoding: quoted-printable" for text/html and text/plain if there are non-ascii characters.

Importance

Mail Transfer Agent like Postfix often split lines that do not respect the RFC by inserting "\r\n " at the 998-th position of the line.

DKIM signatures of emails are based on the unmodified body, but the signature validation by receivers is based on the modified body, resulting in a check failure.

Apart from my own django projects, I have seen long lines in html emails sent by Sentry, for instance.

Choices

For reference, Perl library MIME-Lite recommends:

   Use encoding:     | If your message contains:
   ------------------------------------------------------------
   7bit              | Only 7-bit text, all lines <1000 characters
   8bit              | 8-bit text, all lines <1000 characters
   quoted-printable  | 8-bit text or long lines (more reliable than "8bit")
   base64            | Largely non-textual data: a GIF, a tar file, etc.

One way or another, we have to guarantee that email lines are <1000 characters.
base64 and quoted-printable do that for us.
No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.

I am not aware of other encodings that can be used for this, nor of reliable ways to split long lines.

On django-users, Russ Magee warned about possible downstream consequences.

Other references

http://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html
relevant discussion on trac's trac
SpamAssassin's rule on quoted-printable messages not respecting the 76-max line length rule.

Change History (17)

comment:1 by petr.hroudny@…, 11 years ago

Cc: petr.hroudny@… added

comment:2 by phr, 11 years ago

Quoted-printable should only be used to downconvert emails to 7bit-only, not to workaround their RFC incompliance regarding line lengths.

Please note that QP works decently just for languages based on ASCII with only a few accentuated characters, but performs miserably for all others.

Thus reintroducing any form of 7bit downconversion is not the proper solution to this problem.

comment:3 by Tim Graham, 10 years ago

Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

Russ seemed to accept the problem on the mailing list.

comment:4 by Ralph Broenink, 10 years ago

Apart from breaking DKIM, this behaviour also affects appearance of emails containing long lines (spaces appear to be added, at least in Gmail) and breaks inline HTML or CSS.

Note that long lines are commonplace when CSS rules are automatically being inlined.

comment:5 by Claude Paroz, 10 years ago

No using them means that we have to find a reliable way to split long lines into shorter ones, but the risk is to break html code in the case of text/html emails.

I don't get your point here, how would we break html code by splitting lines? Content inside <pre>?

comment:6 by Henrik Levkowetz, 10 years ago

Owner: changed from nobody to Henrik Levkowetz
Status: newassigned

in reply to:  5 comment:7 by ris, 9 years ago

Replying to claudep:

I don't get your point here, how would we break html code by splitting lines? Content inside <pre>?

Putting a newline in the middle of a tag (href with a loooooong url?) would do it.

This issue is causing me some pain too.

comment:8 by ris, 9 years ago

Cc: bugs@… added

comment:9 by Mikhail Veltishchev, 9 years ago

Please note that SpamAsassin adds points only when base64 is used for encodings that do not require it: http://wiki.apache.org/spamassassin/Rules/MIME_BASE64_TEXT
A message with long lines do require some kind of encoding (not to violate the RFC), so if SA adds scores for such letters, it is a bug in SA (and contradiction with its documentation).

To be sure, you can check that message body has long lines and enable base64 encoding flag only in that case. This will definitely save bandwidth in case of well-formatted message bodies.

comment:10 by Michal Čihař, 9 years ago

Cc: michal@… added

comment:11 by Claude Paroz, 9 years ago

Has patch: set
Version: 1.6master

I added this PR to fallback to QP encoding when the body has lines longer than 998. Would this be a good solution?

in reply to:  11 comment:12 by notsqrt, 9 years ago

Seems a good solution to me !

Replying to claudep:

I added this PR to fallback to QP encoding when the body has lines longer than 998. Would this be a good solution?

comment:13 by Tim Graham, 9 years ago

Triage Stage: AcceptedReady for checkin

comment:14 by Claude Paroz <claude@…>, 9 years ago

Resolution: fixed
Status: assignedclosed

In 836d475a:

Fixed #22561 -- Prevented too long lines in email messages

Thanks NotSqrt for the excellent report and Tim Graham for the review.

comment:15 by Pavel Pokrovskiy, 8 years ago

Resolution: fixed
Status: closednew

It appears the fix does not work properly on Cyrillic strings.

Tried it out with following snippet:

https://gist.github.com/ppokrovsky/d06d0d9e3c8f55bf15984ecd22954683

with body set to test_body_lat , the has_long_lines flag in django.core.mail.message.SafeMIMEText() is set to True and therefore correctly applies 'quoted-printable' encoding, while when body is set to test_body_ru , it leaves has_long_lines as False therefore leaving default charset, which results in broken email body.

Last edited 8 years ago by Pavel Pokrovskiy (previous) (diff)

comment:16 by Claude Paroz, 8 years ago

Resolution: fixed
Status: newclosed

Instead of reopening the fixed ticket, could you please create a new one (where you can mention this ticket)?
I already have a pull request ready to fix your issue.

in reply to:  16 comment:17 by Pavel Pokrovskiy, 8 years ago

Replying to Claude Paroz:

Instead of reopening the fixed ticket, could you please create a new one (where you can mention this ticket)?
I already have a pull request ready to fix your issue.

#27696
Appreciated

Note: See TracTickets for help on using tickets.
Back to Top