Saturday, April 25, 2009

Internationalized Domain Names

Internationalized Domain Names (IDNs) are converted into punycode in most browsers by default, although it depends somewhat on the TLD. (For example, firefox will show the punycode domain name for all TLDs except those specified here.) In the quest for the domain-name-to-rule-them-all, I have been searching for domains that are readable in both unicode and punycode. For example, http://ωaψward.gr/ converts to http://xn--award-beef.gr. Pretty much a waste of time, but there are a few interesting side-effects that I noticed.

To start with, I started playing around manually, using a simple conversion app I wrote in Java, using the IDN class. After a few minutes playing around it was fairly clear that it was never going to happen with this approach. Generating unicode URLs using two valid english words - i.e. xn--apple-banana - was a faster approach. It still involved far too much effort, but introducing a few filters helped to cull out urls that are obviously not words. Some of the more interesting ones were:
  • http://www.ωaψward.gr --> http://xn--award-beef.gr
  • кill.gr --> xn--ill-bed.gr
  • ႦႲႶႻႸႽ.org --> xn--endymion.org (endyion is SO a word)
  • 汤.cn --> xn--ftw (汤, means soup in chinese, and soup is clearly ftw)
Then there were a few weird ones which deserve special mention:
  • f̹ace̸̸bo̸ok.com--> xn--facebook-deface.com (facebook defaced, get it?)
  • ̱yahoo.cn --> xn--yahoo-end.cn ( Yahoo ended by homograph attack!)
As it turns out however, the man blocks you making up cool domain names, with bourgeoisneo-capitalist rules. The main restrictions are IDNA (RFC 3490) & NAMEPREP (RFC 3491) which prohibit a lot of the unicode goodness. All a bit TLDR but basically it means "Bad luck Ghengis Botherder, no MONGOLIAN VOWEL SEPARATOR for you".

I tried registering http://www.ωaψward.gr on greek registrars since it doesn't mix unicode from different languages, but I was thwarted, mainly by my complete lack of greek & verified by visa completely failing. (wtf is that shit anyway...srsly) If anyone more clueful than me can shed some light on unicode domain rules that would be cool (Yes Chris Weber, I am talking to you.)