Click the comments link on any story to see comments or add your own.
Subscribe to this blog
28 Jan 2018
Unicode's goal, which it meets quite well, is that whatever text you want to represent in whatever language, dead or alive, Unicode can represesnt the characters or symbols it uses. Any computer with a set of Unicode type faces and suitable layout software can display that text. In effect, Unicode is primarily a typesetting language.
Over in the domain name system, we also use Unicode to represent non-ASCII identifiers. That turns out to be a problem, because an identifier needs a unique form, something that doesn't matter for typesetting.
For a name in the DNS, and for most other kinds of identifiers, if a user sees an identifier in use somewhere, she needs to be able to type or otherwise enter that identifier so that what she typed produces the same bits as the stored identifier. In some cases (see mailboxes, below) the rule is slightly relaxed so that given two strings, the computer can decide whether they identify the same thing.
Unicode is full of homoglyphs, characters or groups of characters that look the same, but have different internal forms. We (mostly meaning the Unicode consortium, IETF, and ICANN) have come up with three ways to minimize the homoglyph problem and try to limit Unicode internationalized domain names (IDNs) so that two IDNs that look the same actually are the same.
Homoglyphs are nothing new. Those of us old enough to remember manual typewriters remember that they often had only the digits 2 through 9, and we used lower case letter l and upper case O for 1 and 0. It didn't matter, because the meaning was obvious from context. But when used as identifiers in the DNS, there's no context, and a name like "operator: is not the same as "0perator: or "0perat0r".
In some cases Unicode offers multiple ways to write the exact same character, such as á which can be written as two glyphs, "a" followed by "combining acute accent", or as a single precomposed glyph "a with acute accent". Unicode defines several normalization forms, one of which consists of characters that are as composed as possible, known as Normalization Form C (NFC.) The IETF's Internationalized Domain Names for Applications (IDNA) requires that all IDNs be in NFC, and that input Unicode be converted to NFC before being used as a domain name. This only handles composition where the two forms appear exactly the same, not forms where the forms look similar but not identical.
A related but different issue is different scripts. Unicode defines a script as a set of characters used to write one or more languages. Famiiar scripts include Latin (used to write most European languges), Cyrillic, and Greek, and Arabic. Different scripts often have characters that look the same, e.g. the latin letter "o", Cyrillic "o", and Greek omicron.
Most domain registries have a list of scripts in which they will accept registrations, and each registered name usually has to be in a single script. In some cases names are restricted to a single language in a single script (e.g., French or Portuguese which use different accents), or a mixture of compatible scripts, notably Japanese names which allow Katakana, Hiragana, Han (Kanji ideographs), and Latin. This largely deters homograph attacks at the registration level other than some arcane examples where people have constructed what look like English names entirely from homographs in Cyrillic or Greek.
All ICANN contracted registries are supposed to file their tables of permitted characters for each language in an IANA repository and many have.
Registry script rules are generally only enforced for the name directly registered, and not for
anything below it, so you can see names like
Language generation rules and bundling
The last level of confusion is among characters that don't necessarily look the same but in some sense mean the same thing. Examples include traditional and simplified Chinese characters, and in some European languages, vowels with and without accents. In script tables, one character can be listed as a variant of another, and registries have rules about them. Some forbid registration of names that differ only in characters that are variants, while others "bundle" names so that a registrant can get some or all variants of a name.
Variants have their limits; they can't express character sequences of different length such as the German ö and ß which are usually equivalent to "oe" or "ss", but they avoid a lot of problems particularly in Chinese and Japanese.
So who cares?
The reason I went through all this is twofold. One is related to the DNS: there are good reasons that the characters in Unicode DNS labels are limited, and you can't use, to revisit a recent argument, emoji. If you want to use emoji in text messages or other contexts that are like typesetting, that's fine. But they make dreadful identifiers since there are lots and lots of emoji that look almost the same, frequently deliberately so. For most emoji that look like people, you can add modifier glyphs for any of five skin tones, and male or female gender. You can make several emoji display as a super-emoji group, say man and heart and woman as 👩❤️👨 which looks cute but is a challenge to type since it's a sequence of six glyphs that have to be entered in the right order: woman, combine, heart, alternate-version, combine, man.
If the emoji for slightly frowning face 🙁 and slightly frowning face with open mouth 😦 look nearly identical, it makes no difference in a text message, but it makes them terrible identifiers. Imagine you registered one, built a web site around it, and then a competitor registered the other. How can you explain to your customers which is the real one?
To avoid this problem, in principle people could create an emoji script table that groups together similar-looking emoji as variants, and otherwise limits the allowable emoji to ones that look different enough that people could reliably recognize them if they saw them in an ad on the side of a bus. But nobody will. It's not worth anyone's time since emoji DNS names are at most a gimmick.
The other reason is that DNS labels are not the only place on the Internet where we have text identifiers. Two other familiar ones are the path in URLs, the part after the domain name, and the mailbox in an e-mail address. Mailboxes in particular are a challenge, since only the system hosting the mailbox knows the meaning of an address and although every mail system does some kind of fuzzy match, the fuzz varies a lot. For ASCII mailboxes, everyone does upper/lower case folding, some ignore dots, some trim off suffixes after hyphens or plus signs, some do other things, but it entirely depends on the mail system. Systems with Unicode addresses will do similar things, but it's a lot harder since the details of case folding are highly language specific (even among languages written in latin characters), and there are a lot of things that might be considered to be like dots or hyphens.
While the DNS character rules can be a useful guide to designing rules for other applications, it's unlikely they can be applied directly (e.g., DNS names never ignore dots, mailboxes sometimes do.) We still have a lot to learn about what's a usable identifier in what contexts.
My other sites
© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.