Internationalized domain names, are you ready?

Since may 11 TLD's (top-level domainnames) have been added. In order for this to work successfully, a lot of applications will have to be fixed.

Many email-validation scripts might use an approach like this:

  1. $ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$/i', $email);

This one is pretty simple, it matches the most common address formats, as long as the tld (.com, nl, .uk, etc) is under 6 characters. For a bit more sophistication you might want to ensure that the tld is a bit more valid:

  1. $ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$/i',$email);

Note: both these regexes were taken from regular-expression.info. The top google hit, and decent examples.

The new TLD's use non-ascii characters, and they might become aliases for existing top-level domains, or new tld's altogether. Here are the currently working examples:

At first sight these look like regular utf-8, characters, but if you look at the sourcecode of this page, you'll notice that it's actually encoded differently.

The korean url http://실례.테스트, is actually encoded as http://xn--9n2bp8q.xn--9t4b11yi5a/. This is called Punycode.

If you want support for these new urls (and thus domainnames in emails), you should have support for punycode. You will likely receive UTF-8 encoded domainnames for email address (example@실례.테스트), but internally you must make sure that you only deal with the punycode representation.

This translating is also what modern browsers do. If you were to paste "http://xn--9n2bp8q.xn--9t4b11yi5a/" directly in the firefox address bar, it will show you the UTF-8 characters instead. Firefox will re-encode to punycode though and use that format for HTTP requests.

The best way really to check for valid email addresses is to use a very liberal regex, but verify with a simple MX record lookup if a mailserver exists for the given domain. This example is an expansion on the first regex.

  1. $email = 'example@xn--9n2bp8q.xn--9t4b11yi5a';
  2.  
  3. if(preg_match('/^[A-Z0-9._%+-]+@([A-Z0-9.-]+\.[A-Z0-9-]{2,})$/i', $email,$matches)) {
  4. $hostname = $matches[1];
  5. if (!getmxrr($hostname, $hosts)) {
  6. echo "Host has an MX record\n";
  7. } else {
  8. echo "Host does not exist or does not have an MX record\n";
  9. }
  10. } else {
  11. echo "Email address did not match regular expression\n";
  12. }

The preceeding code does not convert UTF-8 to punycode though. There's not yet an easy native way in PHP to do this, but Pear's Net_IDNA2 provides a way. The implementation seems very complex though, and leaves me wondering if there's an easier way to go about it.


9 Responses to Internationalized domain names, are you ready?

  1. 6833 Thiago Belem 2010-10-23 2:02 pm

    Validate non-ascii characters has been always a pain! There's no "right" solution to it. :/

    About the code... Where you defined $hosts? (Used on getmxrr function).

    Thanks for the article & advice! :)

  2. 6875 Evert 2010-10-23 3:14 pm

    Thaigo,

    That argument is a reference, so in that case PHP makes it on the spot.

  3. 6876 Mathb 2010-10-23 3:53 pm

    In the last code I think that there is a small issue, from php.net documentation about getmxrr there is a note that recommends no to use this method for that purpose:

    Note:
    This function should not be used for the purposes of address verification. Only the mailexchangers found in DNS are returned, however, according to » RFC 2821 when no mail exchangers are listed, hostname itself should be used as the only mail exchanger with a priority of 0.

    About the conversion, the package is in beta and has no future plans. Too bad.

  4. 6878 kae verens 2010-10-23 9:02 pm

    maybe I'm missing something, but why use a regexp at all when there's filter_var() ?

    of course, combine with the MX lookup as you suggested.

  5. 6879 Michael Gauthier 2010-10-23 10:10 pm

    PHP's intl extension also provides functions to handle IDNA strings. See http://www.php.net/manual/en/ref.intl.idn.php

  6. 6880 David 2010-10-24 9:18 am

    I have yet to come across a regex for email address validation that actually checks if the address is valid according to RFC2822.

  7. 6882 Evert 2010-10-24 4:27 pm

    @Mathb,

    That's a very good point..

    @Michael,

    That's excellent, I wasn't aware of the idna functions/

    @David,

    I'm sure you're aware, being fully RFC2822-compliant is silly.

  8. 6913 Evert Pot’s Blog: Internationalized domain names, are you ready? | Development Blog With Code Updates : Developercast.com 2010-10-25 3:37 pm

    ...Development" rel="category tag">Development   In a new post to his blog Evert Pot looks at internationalized domain names and where they could cause issues some of the current validation in PHP applications. Since may 11 ...

  9. 6985 Hari K T 2010-10-28 1:07 pm

    I wonder whether the php's inbuilt function doesn't work ?

    if (filter_var( 'joe@example.com' , FILTER_VALIDATE_EMAIL)) {

    You can see the details http://php.net/manual/en/filter.examples.validation.php

    I love to hear whether it has any drawbacks or why you opted this :( .

    Thanks for your article. Let me know your comments.

Leave a Reply



About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.