Click the comments link on any story to see comments or add your own.
Subscribe to this blog
03 May 2018
Recently I've been working on EAI mail, looking at what software is available (Gmail and Outlook/Hotmail both handle it now) and what work remains to be done. A surprisingly tricky part is assigning EAI addresses to users.
In traditional ASCII mail, the local part of the address, what goes before the @ sign, can be any printable ASCII characters. Although an address like %i()/;~email@example.com is valid, and mail systems will handle it, users don't want addresses like that. A good address is one that is easy to remember, easy to tell someone over the phone, and easy to type.
Mail systems all give senders some help when interpreting addresses. If an address is Bob@example, they'll accept bob@ or BOB@. If the address is joe.smith@, they'll accept Joe.Smith@ and often variations in punctuation like joesmith@ without the dots.
The flip side of this is that you don't assign different addresses that are too similar. While it is techincally possible that BOB@ and bob@ could deliver to different mailboxes, nobody does that. Similarly, nobody makes joesmith@ and joe.smith@ different. (They may not both work, but if they do, they're the same mailbox.)
The domain (the part of the address after the @ sign) has to follow the DNS rules, which don't allow any fuzzy matching other than ASCII upper and lower case.
How does all this extend into EAI mail?
EAI extends ASCII addresses in a straightforward way -- in addition to any printable ASCII characters the local part can contain any printable UTF-8 characters, and the domain can be UTF-8 U-labels. As before, users will have an easier time if mail systems assign addresses conservatively and interpret addresses on incoming mail liberally.
The PRECIS working group at the IETF defined string classes for different applications. The Identifier class works well for mailbox names, codepoints that are (roughly) letters and digits in various languages.
It also provided rules to prepare UTF-8 strings for use. Unicode often provides multiple ways to represent exactly the same character, e.g., a single codepoint for an accented character é or separate e and accent codepoints. It often also has variant characters that look different but mean approximately or exactly the same thing, such as full-width and half-width versions of characters, Latin digits 12345 and Arabic digits ١٢٣٤٥, or traditional and simplified Chinese characters. To prepare a string, software maps variant codepoints into preferred ones, usually precomposed characters such as é. Mail systems should assign mailbox names in prepared form, but they can and should accept addresses in incoming mail in any form since they can prepare them as they receive them. (This is different from the DNS where DNS servers only do exact matches so the client has to do any preparation.)
There's no reason that a mail system's fuzzy matching has to stop where PRECIS and ASCII addresses did. The Latin and Arabic digits aren't the same for PRECIS, but it's easy enough for a mail system to map them together and to ensure that it doesn't issue two mailboxes with digits that collide. In Latin languages with accented or multiple forms of characters (such as the Turkish dotless ı) a conservative mail system would avoid assigning addresses that differ only in the form of a letter, accept all versions of the letter, even ones that aren't valid or equivalent in the user's language. For example, even though Turkish speakers wouldn't write i for ı, correspondents who don't speak Turkish might, and it's easier all around if the slightly misspelled address works. Similarly, in Scandinavian languages the letters O Ø Ö are different, but it'd be a good idea to accept the wrong versions in incoming addresses.
Mail systems have only recently started to assign EAI addresses, and I'm not yet aware of any of them doing fuzzy matching on incoming addresses. But for the same reason we have found it a good idea to allow jimsmith@ for jim.smith@ in ASCII mail, EAI mail systems will have to figure out how to adapt to however their correspondents type the EAI addresses.
My other sites
© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.