Robust email address validator - with address suggestions!

I'm sure you've seen the simple email address format validation function; they're usually a simple regular expressing that just check the address portion (the user@example.org bit). That's really only a bit of the validation that should be done. The RFC822 specs detail that the format of email addresses can be much larger, for example, it could be something like "Andrew Collington & Co." <a.collington@example.org>, and, of course, the simple regex on that would fail. But even a check on the address format isn't often enough... The user could enter a correctly formatted email address but simply have mis-spelled the address... they may accidentally type in user@yahooo.com, or user@hitmail.co.uk rather than hotmail.co.uk, and things like that. In which case you may want to check the MX and/or A record to see if its a valid domain. And whilst you're doing that, why not check to see if it's a commonly used email host that maybe they've typed in wrong?

So here is a class that will allow you to do all that in one easy method call:

PHP:
  1. <?php
  2.  
  3. define('EMAILCHECK_INVALID_ADDRESS', 0);
  4. define('EMAILCHECK_INVALID_DNS_MX'1);
  5. define('EMAILCHECK_INVALID_DNS_A',   2);
  6. define('EMAILCHECK_GUESS_THRESHOLD', 80);
  7.  
  8. class MailCheck
  9. {
  10.     /**
  11.      * Validate an email address format against RFC822 specs.
  12.      *
  13.      * This method will check an email address against the RFC822 specs, as
  14.      * well as validating the MX record if required.  Optionally you can have
  15.      * the email address route portion return (rather than the boolean true
  16.      * value).  This can be handy if you have an email address such as:
  17.      *
  18.      *    "Andrew Collington & Co." <a.collington@example.org>
  19.      *
  20.      * and you just want to use the actual address portion.  By default the
  21.      * method will return true/false else route/false.
  22.      *
  23.      * Adapted from PHP code (Changes (c) 2005 Padraic Brady) which is a
  24.      * translation of Perl code (Copyright 1997 O'Reilly & Associates, Inc.).
  25.      *
  26.      * Based on optimised email regex in Perl Copyright 1997 O'Reilly &
  27.      * Associates, Inc. The "Mastering Regular Expressions" Email Regex
  28.      * (from book on page 295 et seq).
  29.      *
  30.      * @param string $email The email address to validate
  31.      * @param boolean $checkmx Check the MX Record if $email is valid
  32.      * @return bool|string
  33.      * @access public
  34.      * @see <http://www.faqs.org/rfcs/rfc822.html>
  35.      * @see <http://examples.oreilly.com/regex/email-opt.pl>
  36.      */
  37.     public static function format($email, $checkmx = false, $returnroute = false)
  38.     {
  39.         // Some things for avoiding backslashitis later on.
  40.         $esc        = '\\\\';               $Period      = '\.';
  41.         $space      = '\040';               $tab         = '\t';
  42.         $OpenBR     = '\[';                 $CloseBR     = '\]';
  43.         $OpenParen  = '\(';                 $CloseParen  = '\)';
  44.         $NonASCII   = '\x80-\xff';          $ctrl        = '\000-\037';
  45.         $CRlist     = '\n\015'// note: this should really be only \015.
  46.  
  47.         // Items 19, 20, 21
  48.         $qtext = "[^{$esc}{$NonASCII}{$CRlist}\"]";
  49.         $dtext = "[^{$esc}{$NonASCII}{$CRlist}{$OpenBR}{$CloseBR}]";
  50.         $quoted_pair = " {$esc} [^{$NonASCII}] ";
  51.  
  52.         // Items 22 and 23, comment.
  53.         // Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
  54.         $ctext = " [^{$esc}{$NonASCII}{$CRlist}()] ";
  55.  
  56.         // $Cnested matches one non-nested comment.
  57.         // It is unrolled, with normal of $ctext, special of $quoted_pair.
  58.         $Cnested = "{$OpenParen}{$ctext}*(?: {$quoted_pair} {$ctext}* )*{$CloseParen}";
  59.  
  60.         // $comment allows one level of nested parentheses
  61.         // It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
  62.         $comment = "{$OpenParen}{$ctext}*(?:(?: {$quoted_pair} | {$Cnested} ){$ctext}*)*{$CloseParen}";
  63.  
  64.         // $X is optional whitespace/comments.
  65.         // Nab whitespace.  If comment found, allow more spaces.
  66.         $X = "[{$space}{$tab}]*(?: {$comment} [{$space}{$tab}]* )*";
  67.  
  68.         // Item 10: atom
  69.         $atom_char   = "[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]";
  70.         // some number of atom characters not followed by something that could be part of an atom
  71.         $atom = "{$atom_char}+(?!{$atom_char})";
  72.  
  73.         // Item 11: doublequoted string, unrolled.
  74.         $quoted_str = "\"{$qtext} *(?: {$quoted_pair} {$qtext} * )*\"";
  75.  
  76.         // Item 7: word is an atom or quoted string
  77.         $word = "(?:{$atom}|{$quoted_str})";
  78.  
  79.         // Item 12: domain-ref is just an atom
  80.         $domain_ref = $atom;
  81.  
  82.         // Item 13: domain-literal is like a quoted string, but [...] instead of  "..."
  83.         $domain_lit  = "{$OpenBR}(?: {$dtext} | {$quoted_pair} )*{$CloseBR}";
  84.  
  85.         // Item 9: sub-domain is a domain-ref or domain-literal
  86.         $sub_domain  = "(?:{$domain_ref}|{$domain_lit}){$X}";
  87.  
  88.         // Item 6: domain is a list of subdomains separated by dots.
  89.         $domain = "{$sub_domain}(?:{$Period} {$X} {$sub_domain})*";
  90.  
  91.         // Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
  92.         $route = "\@ {$X} {$domain}(?: , {$X} \@ {$X} {$domain} )*:{$X}";
  93.  
  94.         // Item 6: local-part is a bunch of $word separated by periods
  95.         $local_part = "{$word} {$X}(?:{$Period} {$X} {$word} {$X})*";
  96.  
  97.         // Item 2: addr-spec is local@domain
  98.         $addr_spec  = "{$local_part} \@ {$X} {$domain}";
  99.  
  100.         // Item 4: route-addr is <route? addr-spec>
  101.         // parenthases around the route_addr to capture this in the final regexpr
  102.         $route_addr = "<({$X}(?: {$route} )?{$addr_spec})>";
  103.  
  104.         // Item 3: phrase... like ctrl, but without tab
  105.         $phrase_ctrl = '\000-\010\012-\037';
  106.  
  107.         // Like atom-char, but without listing space, and uses phrase_ctrl.
  108.         // Since the class is negated, this matches the same as atom-char plus space and tab
  109.         $phrase_char = "[^()<>\@,;:\".{$esc}{$OpenBR}{$CloseBR}{$NonASCII}{$phrase_ctrl}]";
  110.  
  111.         // We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
  112.         // because we take care of it manually.
  113.         $phrase = "{$word}{$phrase_char} *(?:(?: {$comment} | {$quoted_str} ){$phrase_char} *)*";
  114.  
  115.         // Item #1: mailbox is an addr_spec or a phrase/route_addr
  116.         $mailbox = "{$X}(?:{$addr_spec}|{$phrase}  ({$route_addr}))";
  117.  
  118.         // perform actual regex check to our recieved email address
  119.         $matches = array();
  120.         $isValid = preg_match('/^'.$mailbox.'$/xS', $email, $matches);
  121.         $route = $matches[count($matches) - 1];
  122.  
  123.         // check the MX Record is needs be
  124.         if ($isValid && $checkmx) {
  125.             list(, $host) = explode('@', $route);
  126.             return (@checkdnsrr($host, 'MX')) ? (($returnroute) ? $route : true) : false;
  127.         }
  128.  
  129.         // finally return status
  130.         return ($isValid && $returnroute) ? $route : $isValid;
  131.     }
  132.  
  133.     /**
  134.      * Check email addresses for validity
  135.      *
  136.      * You can pass either a string of an array of strings.  The string
  137.      * contains the email address you want to check.  The method will check
  138.      * that the format is valid and then check the A and MX record.  If
  139.      * either the A or MX record fail then it will attempt to make some
  140.      * suggestions of popular email hosts.
  141.      *
  142.      * @param string|array $email Email address of array of email addresses
  143.      * @return array
  144.      * @access public
  145.      */
  146.     public static function address($email)
  147.     {
  148.         // default metaphone conversions
  149.         $mphones = array(
  150.             'ALKK'     => 'aol.co.uk',
  151.             'ALKM'     => 'aol.com',
  152.             'EKSMPLKM' => 'example.com',
  153.             'HTMLKK'   => 'hotmail.co.uk',
  154.             'HTMLKM'   => 'hotmail.com',
  155.             'KKLMLKM'  => 'googlemail.com',
  156.             'KMLKM'    => 'gmail.com',
  157.             'MSNKK'    => 'msn.co.uk',
  158.             'MSNKM'    => 'msn.com',
  159.             'YHKK'     => 'yahoo.co.uk',
  160.             'YHKM'     => 'yahoo.com'
  161.         );
  162.         $emails = $suggestions = array();
  163.  
  164.         // validate email address format
  165.         if (is_array($email)) {
  166.             $email = array_unique($email);
  167.             foreach ($email as $e) {
  168.                 $route = self::format($e, false, true);
  169.                 if (!$route) {
  170.                     $emails[$e]['error'][EMAILCHECK_INVALID_ADDRESS] = true;
  171.                 } else {
  172.                     list($emails[$e]['user'], $emails[$e]['domain']) = explode('@', $route);
  173.                 }
  174.             }
  175.         } else {
  176.             $route = self::format($email, false, true);
  177.             if (!$route) {
  178.                 $emails[$email]['error'] = EMAILCHECK_INVALID_ADDRESS;
  179.             } else {
  180.                 list($emails[$email]['user'], $emails[$email]['domain']) = explode('@', $route);
  181.             }
  182.         }
  183.  
  184.         // check domains
  185.         foreach ($emails as $orig => $e) {
  186.             if ($e['domain']) {
  187.                 if (empty($e['error'])) {
  188.                     $valid = checkdnsrr($e['domain'], 'MX');
  189.                     if (!$valid) {
  190.                         $emails[$orig]['error'][EMAILCHECK_INVALID_DNS_MX] = true;
  191.                     }
  192.                 }
  193.                 if (empty($e['error'])) {
  194.                     $valid = checkdnsrr($e['domain'], 'A');
  195.                     if (!$valid) {
  196.                         $emails[$orig]['error'][EMAILCHECK_INVALID_DNS_A] = true;
  197.                     }
  198.                 }
  199.                 // failing anything, get suggestions
  200.                 if (!empty($emails[$orig]['error']) && !isset($e['error'][EMAILCHECK_INVALID_ADDRESS])) {
  201.                     $emeta = metaphone($e['domain']);
  202.                     $suggestions = array();
  203.                     foreach ($mphones as $code => $addr) {
  204.                         $percent = 0;
  205.                         $lev = levenshtein($emeta, $code);
  206.                         $sim = similar_text($emeta, $code, $percent);
  207.                         $score = round($percent + max(-$lev + $sim, 0));
  208.                         if ($score>= EMAILCHECK_GUESS_THRESHOLD) {
  209.                             $suggestions[$addr] = $score;
  210.                         }
  211.                     }
  212.                     if (!empty($suggestions)) {
  213.                         arsort($suggestions);
  214.                         $emails[$orig]['suggestion'] = array_keys($suggestions);
  215.                     }
  216.                 }
  217.             }
  218.         }
  219.  
  220.         // send back results
  221.         return $emails;
  222.     }
  223. }
  224.  
  225. ?>

If I call it like this:

PHP:
  1. $validate = MailCheck::address('example@yahooo.co');
  2. print_r($validate);

I will get the following returned:

Array
(
    [example@yahooo.co] => Array
        (
            [domain] => yahooo.co
            [user] => example
            [error] => Array
                (
                    [1] => 1
                    [2] => 1
                )
            [suggestion] => Array
                (
                    [0] => yahoo.com
                    [1] => yahoo.co.uk
                )
        )
)

Which is basically saying that while the address example@yahooo.co is a valid format, the domain yahooo.co doesn't have a valid MX or A record, and it thinks that perhaps the user meant to use yahoo.com or yahoo.co.uk.

The address method can take an array of email addresses as well as a single string.

A further example of use is:

PHP:
  1. <?php
  2.  
  3. require('MailCheck.php');
  4. $valid = MailCheck::address(array('user1@yahoo.com', 'user2@somethingmadeup.com', 'user3 is wrong'));
  5. foreach ($valid as $address => $status) {
  6.     echo $address, ' is ', (empty($status['error'])) ? 'valid' : 'not valid', "<br />\n";
  7. }
  8.  
  9. ?>

Which echos:

user1@yahoo.com is valid
user2@somethingmadeup.com is not valid
user3 is wrong is not valid

3 Responses to “Robust email address validator - with address suggestions!”


  1. 1 glytch

    thanks for posting this! much better than a regular expression or piddly bit of javascript

    what php markup plugin do you use? it’s excellent!
    my current markup plugin stretches the code all across the screen :=/

  2. 2 robswan

    Just the class I needed this week :) Perfec’ ,cheers Andy!

  3. 3 Andy

    Glad you like the class and that it’s been useful!

    Glytch; I’m using the ‘iG:Syntax Hiliter‘ plug-in.

Leave a Reply

You must login to post a comment.




Mp3 sparks Allofmp3