Click the comments link on any story to see comments or add your own.
Subscribe to this blog
08 Jul 2011
Back when the Internet was young and servers came with shovels (for the coal), everyone on the net spoke English, and all the e-mail was in English. To represent text in a computer, each character needs to have a numeric code. The most common code set was (and is) ASCII, which is basically the codes used by the cheap, reliable Teletype printing terminals everyone used as their computer consoles. ASCII is a seven bit character code, code values 0 through 127, and it includes upper and lower case letters and a reasonable selection of punctuation adequate for written English. It also includes some obscure characters, such as @ which was chosen for the middle of e-mail addresses in part because it was on the ASCII keyboard and otherwise not much used.
But nearly every other written language requires characters outside the ASCII set. On the modern Internet, mail users live in every country in the world and write in a vast array of languages, and e-mail has been slowly evolving to handle everyone else's language. In today's note I'll describe the changes already made to Internet mail to handle other languages, and in the next message I'll describe the work in progress to handle the last missing parts.
In 1992, the first major extension to mail was MIME, Multipurpose Internet Mail Extensions. Whereas before, the body of a mail message had been an unstructured block of ASCII text, MIME provided a way to treat the mail body as a group of structured blocks of data, some of which might be text, and others of which might be pictures, documents, or any other sort of data that might be stored in a file. MIME provides standard ways to encode data that isn't seven-bit ASCII, so when you attach a picture or other file to a mail message, you're using MIME.
MIME also provides ways to include parameters on the MIME headers describing each block of data, such as the type of data (e.g., a JPEG image or a PDF document), and for text blocks, the character set in which it is encoded, like this:
Content-type: text/plain; charset=us-ascii
Content-type: text/plain; charset=UTF-8
The default character set remained ASCII (now known as US-ASCII, since it's a US standard), but you could use any of several others.
MIME also provides a way to specify character sets in many message headers such as the Subject: line. It supports the same character sets and encodings as text bodies. For example, this subject line is encoded in UTF-8 (see below), and has a character whose code is the bytes C2 B0 as hex numbers, which happens to be a raised small letter "o".
MIME encoding can be used in most mail headers, but it can't be used in e-mail addresses, which still have to be (mostly) ASCII.
Character sets, Unicode, and UTF-8
Since ASCII is a seven bit code, and most computers have eight bit memory, the obvious way to add more characters to ASCII was to extend it to eight bits, and assign non-English characters to positions 128 through 255. People did this with great enthusiasm, inventing large numbers of eight-bit extensions to ASCII. The ISO standardized over a dozen of them, known as ISO-8859-1 through ISO-8859-15, and there are many others defined by Microsoft, IBM, and others. Although eight bit extended ASCII codes have the virtue of being compact, it rapidly became painful to deal with them, both because of the profusion of incompatible codes, and because in many cases it wasn't clear which extended ASCII a file was using. And of course, no eight bit code was adequate for Chinese and other non-alphabetic languages.
Starting in the late 1980s, a project called Unicode attempted to create one giant character set including every character that anyone ever used. Somewhat surprisingly, the project gained broad industry support and has become a rousing success. The original plan was to use 16 bit characters, but as time went on and more languages and characters were added, the total number of characters is now over 100,000, so native Unicode is usually stored in 32 bit words. While few systems support all 100,000 characters, Unicode has the desirable property that either a system supports a character or it doesn't, and there's no ambiguity about codes that might represent different characters in different contexts.
Since computers still use 8 bit bytes, Ken Thompson of Bell Labs and Unix fame came up with a clever encoding known as UTF-8 that represents each Unicode characters as a sequence of 1 or more bytes. The first 128 Unicode characters are the 128 ASCII characters, and UTF-8 encodes those 128 characters as themselves, so any string of ASCII characters is also a string of UTF-8. At this point, there is little reason to use any extended ASCII character set but Unicode and UTF-8, other than for backwards compatibility with old software. Most non-ASCII MIME text in current e-mail uses UTF-8.
Internationlized Domain Names
For the same reasons that Internet users want to use non-ASCII characters in their mail, they also want to use the same non-ASCII characters in the domain names they use to name web servers, mail hosts, and everything else on the Internet. Although the domain name system (DNS) internally uses eight bit bytes and could in principle have just used UTF-8 names, there was and is a vast amount of DNS support software that only handles ASCII, and Unicode in some cases allows several different ways to write the same character (e.g., for an accented é, a plain e and a separate accent, or a combined accented letter), which would make DNS lookups, which only do exact matches, less reliable. The DNS community came up with a kludge, an encoding of much but not all of Unicode in ASCII known as punycode. If a name in the DNS is of the form xn--stuff, the stuff is punycode representing a UTF-8 string. Punycode has complex rules about what can be encoded, intended to pick unique representations for the characters that have several Unicode versions. The name starting with xn-- is known as an A-Label, the corresponding Unicode name is a U-label, and the whole system is known as IDNA, Internationalized Domain Names in Applications. IDNA is widely supported in web browsers which will turn Unicode names in their address bars into A-labels before looking them up and there are quite a lot of IDNA domain names in China, India, and Arabic speaking countries.
The part of an e-mail address after the @ is a domain name, so you can use A-labels there, too. But the part of the e-mail address before the @, the mailbox, is not a domain name, and for a variety of reasons not worth describing in detail, is not amenable to encoding as punycode or an A-label, although it's reasonable to assume that the mailbox is written in UTF-8. So to have proper internationalized mail, we need an extension to the mail system to handle UTF-8 in the mailbox part of the address, as well as in the other nooks and crannies that MIME doesn't handle.
In our next installment, we'll see how the IETF is planning to do just that.
My other sites
© 2005-2015 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.