Click the comments link on any story to see comments or add your own.
Subscribe to this blog
10 Jul 2011
In our last installment we discussed MIME, Unicode and UTF-8, and IDNA, three things that have brought the Internet and e-mail out of the ASCII and English only era and closer to fully handling all languages. Today we'll look at the surprisingly difficult problems involved in fixing the last bit, internationalized e-mail addresses.
All of the extensions described up to this point have been backward compatible with the existing mail delivery infrastructure. MIME encodes all of the extended characters and files as ASCII text, so although you need a MIME-aware mail program on your PC (or web mail or whatever), the mail systems through which the mail passes don't have to know anything about MIME. Similarly, non-ASCII IDNA domains are encoded as funny looking ASCII A-labels starting with xn-- so, again, although applications have to know how to turn UTF-8 into A-labels, the underlying DNS software just sees the ASCII.
Although SMTP changes very slowly, it does have a process to add new optional features, with a way for clients and servers to tell each other which options they support and are using. In 1993, a new feature called 8BITMIME added a simple but surprisingly useful option to send mail with 8-bit characters. It didn't change any of the other rules, but it did have the effect of allowing any character set that uses the ASCII codes for carriage return, line feed, and period, (as do nearly all but EBCDIC), in mail message bodies. On computers using eight-bit bytes internally (all of them, these days) the code to support 8BITMIME is simple enough that all popular mail systems support it. So if we want to send eight-bit character code in messages, SMTP can handle it.
The main remaining bit of international e-mail that SMTP can't yet handle is the actual e-mail addresses, both the ones in the message headers such as To: and From: and the ones in the SMTP session. If a mail system is going to be fully international, it also needs to allow non-ASCII text in prompts, error messages, and anywhere else that strings are passed around for presentation to users. This turns out to affect pretty much every piece of software in the e-mail ecosystem, the MUAs (user programs), submission agents that inject mail from user programs, MTAs that move mail from one site to another, and POP and IMAP servers that let MUAs pick up incoming mail. The changes to POP and IMAP and user programs to be language independent are relatively straightforward so I'm going to concentrate on the tricky parts: the format of mail messages, and the SMTP delivery process.
For quite a while people attempted to invent ways to do non-ASCII addresses that were more or less backward compatible with legacy mail. Early work started in China and Japan, where ASCII-only mail was and is particularly unsatisfactory, and initially used simple approaches like allowing various eight-bit extended character sets in mail headers and addresses. Those character sets, such as ISO-2022-x, use sequences of control characters to switch among various "pages" of the character set, so that the meaning of any particular character depends on what shifts preceded it. These didn't work well for a variety of reasons: the shifts mean that a string's meaning depends on what shift state it's in, something that was often implicit or ambiguous, the same set of characters can be represented in many different ways by adding or deleting shifts, and the various experiments tended not to pay sufficient attention to what happened if mail was sent to a mail system that didn't have their extensions and didn't understand their extended character set.
UTF-8 doesn't have the shift problem, but it still uses characters outside the ASCII set that won't work with legacy mail systems. The next approach was to allow UTF-8 addresses, but encode them as ASCII. For domains, that's already done with A-labels and punycode; every non-ASCII domain has an ASCII version that starts with xn--. Unfortunately, that doesn't work with mailboxes (the part of the address before the @.) For one thing, any special characters one might use to mark encoded UTF-8 names have already been used somewhere to mean something else. Hyphens and plus signs are used to mark sub-addresses, exclamation points for uucp addresses, percent signs for source routing, slashes for X.400 compatible addresses, equal signs for VERP bounce addresses, and so on. Even if you could find a sequence of marker characters so arcane that nobody had ever used it, mailboxes are limited to 64 ASCII characters, and by the time you encoded UTF-8 as something along the lines of punycode, the length of the UTF-8 address would be limited to about 18 characters, which is rather short, particularly since people with non-ASCII addresses will want to use all of the same tricks that we ASCII users use with sub-addresses and other address extensions. So that doesn't work, either.
The most completely worked out experiment, documented in RFCs 5335 and 5336 added a new SMTP option called UTF8SMTP, in which email addresses can be almost arbitrary UTF-8, subject to the 64 byte limit on the mailbox, and the domain has to be one that can be turned into A-labels. Messages are sent using 8BITMIME, and UTF-8 can occur nearly anywhere in the message that ASCII can. Everyone with a UTF-8 address can also have an ASCII address, and one can address mail to both of them, with syntax like this:
To: <utf8mailbox@utf8domain <asciimailbox@asciidomain>>
When sending mail to a UTF8SMTP server, a client sends mail to the UTF-8 address, but provides the ASCII address as well. If the mail has to go to a server that does not handle UTF8SMTP, a complex set of downgrade rules turns any headers with UTF-8 into Downgraded-whatever: headers with MIME encoded versions of the UTF-8, replaces any UTF-8 addresses in the To: and From: lines with comments, and sends it along to the ASCII address.
While it was a magnificent piece of kludgery, after a while it became clear that this dual address system didn't work either. One reason was that it was very hard to keep the UTF-8 and ASCII addresses straight, so UTF-8 mail tended to leak into ordinary SMTP traffic, and downgraded mail into UTF-8 traffic. Furthermore, even though the dual addresses were intended as a temporary transition feature, on the Internet, no transition feature ever goes away, and Japanese and Chinese users quite reasonably did not want a system that would require that they have an ASCII address forever.
So the EAI working group threw away most of the kludges and is nearly done defining the permanent internationalized mail system for the Internet. In our next installment, we'll see how your future mail will work.
My other sites
© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.