Since may 11 TLD's (top-level domainnames) have been added. In order for this to work successfully, a lot of applications will have to be fixed.
Many email-validation scripts might use an approach like this:
- $ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$/i', $email);
This one is pretty simple, it matches the most common address formats, as long as the tld (.com, nl, .uk, etc) is under 6 characters. For a bit more sophistication you might want to ensure that the tld is a bit more valid:
- $ok = preg_match('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$/i',$email);
Note: both these regexes were taken from regular-expression.info. The top google hit, and decent examples.
The new TLD's use non-ascii characters, and they might become aliases for existing top-level domains, or new tld's altogether. Here are the currently working examples:
- http://مثال.إختبار - Arabic.
- http://例子.测试 - Chinese (simplified)
- http://例子.測試 - Chinese (traditional)
- http://παράδειγμα.δοκιμή - greek
- http://उदाहरण.परीक्षा Hindi
- http://例え.テスト - Japanese
- http://실례.테스트 - Korean
- http://مثال.آزمایشی - Persian
- http://пример.испытание - Russian
At first sight these look like regular utf-8, characters, but if you look at the sourcecode of this page, you'll notice that it's actually encoded differently.
The korean url http://실례.테스트, is actually encoded as http://xn--9n2bp8q.xn--9t4b11yi5a/. This is called Punycode.
If you want support for these new urls (and thus domainnames in emails), you should have support for punycode. You will likely receive UTF-8 encoded domainnames for email address (example@실례.테스트), but internally you must make sure that you only deal with the punycode representation.
This translating is also what modern browsers do. If you were to paste "http://xn--9n2bp8q.xn--9t4b11yi5a/" directly in the firefox address bar, it will show you the UTF-8 characters instead. Firefox will re-encode to punycode though and use that format for HTTP requests.
The best way really to check for valid email addresses is to use a very liberal regex, but verify with a simple MX record lookup if a mailserver exists for the given domain. This example is an expansion on the first regex.
- $email = 'example@xn--9n2bp8q.xn--9t4b11yi5a';
-
- if(preg_match('/^[A-Z0-9._%+-]+@([A-Z0-9.-]+\.[A-Z0-9-]{2,})$/i', $email,$matches)) {
- $hostname = $matches[1];
- if (!getmxrr($hostname, $hosts)) {
- echo "Host has an MX record\n";
- } else {
- echo "Host does not exist or does not have an MX record\n";
- }
- } else {
- echo "Email address did not match regular expression\n";
- }
The preceeding code does not convert UTF-8 to punycode though. There's not yet an easy native way in PHP to do this, but Pear's Net_IDNA2 provides a way. The implementation seems very complex though, and leaves me wondering if there's an easier way to go about it.

Validate non-ascii characters has been always a pain! There's no "right" solution to it. :/
About the code... Where you defined $hosts? (Used on getmxrr function).
Thanks for the article & advice! :)
Thaigo,
That argument is a reference, so in that case PHP makes it on the spot.
In the last code I think that there is a small issue, from php.net documentation about getmxrr there is a note that recommends no to use this method for that purpose:
Note:
This function should not be used for the purposes of address verification. Only the mailexchangers found in DNS are returned, however, according to » RFC 2821 when no mail exchangers are listed, hostname itself should be used as the only mail exchanger with a priority of 0.
About the conversion, the package is in beta and has no future plans. Too bad.
maybe I'm missing something, but why use a regexp at all when there's filter_var() ?
of course, combine with the MX lookup as you suggested.
PHP's intl extension also provides functions to handle IDNA strings. See http://www.php.net/manual/en/ref.intl.idn.php
I have yet to come across a regex for email address validation that actually checks if the address is valid according to RFC2822.
@Mathb,
That's a very good point..
@Michael,
That's excellent, I wasn't aware of the idna functions/
@David,
I'm sure you're aware, being fully RFC2822-compliant is silly.
...Development" rel="category tag">Development In a new post to his blog Evert Pot looks at internationalized domain names and where they could cause issues some of the current validation in PHP applications. Since may 11 ...
I wonder whether the php's inbuilt function doesn't work ?
if (filter_var( 'joe@example.com' , FILTER_VALIDATE_EMAIL)) {
You can see the details http://php.net/manual/en/filter.examples.validation.php
I love to hear whether it has any drawbacks or why you opted this :( .
Thanks for your article. Let me know your comments.