In traditional ASCII mail, the local part of the address, what
goes before the @ sign, can be any printable ASCII characters.
Although an address like %i()/;~f@examp1e.com is valid, and
mail systems will handle it, users don't want addresses like
that.
A good address is one that is easy to remember, easy to tell someone over the phone,
and easy to type.
Mail systems all give senders some help
when interpreting addresses. If an address is Bob@example,
they'll accept bob@ or BOB@. If the address is joe.smith@,
they'll accept Joe.Smith@ and often variations in punctuation
like joesmith@ without the dots.
The flip side of this is that you don't assign different addresses
that are too similar. While it is techincally possible that BOB@
and bob@ could deliver to different mailboxes, nobody does that.
Similarly, nobody makes joesmith@ and joe.smith@ different.
(They may not both work, but if they do, they're the same mailbox.)
The domain (the part of the address after the @ sign) has to follow the
DNS rules, which don't allow any fuzzy matching other than ASCII upper
and lower case.
How does all this extend into EAI mail?
EAI extends ASCII addresses in a straightforward way -- in addition to
any printable ASCII characters the local part can contain any printable
UTF-8 characters, and the domain can be UTF-8 U-labels.
As before, users will have an easier time if mail systems assign addresses
conservatively and interpret addresses on incoming mail liberally.
The PRECIS working group at the IETF defined string classes for different
applications.
The Identifier class works well for mailbox names, codepoints that are
(roughly) letters and digits in various languages.
It also provided rules to prepare UTF-8 strings for use.
Unicode often provides multiple ways to represent exactly the same
character, e.g., a single codepoint for an accented character é or separate e and accent
codepoints.
It often also has variant characters that look different but mean approximately or exactly the
same thing, such as full-width and half-width versions of characters,
Latin digits 12345 and Arabic digits ١٢٣٤٥,
or traditional and simplified Chinese characters.
To prepare a string, software maps variant codepoints into preferred ones,
usually precomposed characters such as é.
Mail systems should assign mailbox names in prepared form, but they can and should accept
addresses in incoming mail in any form since they can prepare them as they receive them.
(This is different from the DNS where DNS servers only do exact matches so the client
has to do any preparation.)
There's no reason that a mail system's fuzzy matching has to stop where PRECIS and
ASCII addresses did.
The Latin and Arabic digits aren't the same for PRECIS, but it's easy enough for a
mail system to map them together and to ensure that it doesn't issue two mailboxes
with digits that collide.
In Latin languages with accented or multiple forms of characters (such as the Turkish
dotless ı) a conservative mail system would avoid assigning addresses that
differ only in the form of a letter, accept all versions of the letter,
even ones that aren't valid or equivalent in the user's language.
For example, even though Turkish speakers wouldn't write i for ı, correspondents
who don't speak Turkish might, and it's easier all around if the slightly
misspelled address works.
Similarly, in Scandinavian languages the letters O Ø Ö are different,
but it'd be a good idea to accept the wrong versions in incoming addresses.
Mail systems have only recently started to assign EAI addresses, and I'm not yet aware of
any of them doing fuzzy matching on incoming addresses.
But for the same reason we have found it a good idea to
allow jimsmith@ for jim.smith@ in ASCII mail, EAI mail systems
will have to figure out how to adapt to however their correspondents type the EAI addresses.
posted at: 14:24 ::