Robust email address validator – with address suggestions!

Last modified date

Comments: 3

I’m sure you’ve seen the simple email address format validation function; they’re usually a simple regular expressing that just check the address portion (the user@example.org bit). That’s really only a bit of the validation that should be done. The RFC822 specs detail that the format of email addresses can be much larger, for example, it could be something like “Andrew Collington & Co.” <a.collington@example.org>, and, of course, the simple regex on that would fail. But even a check on the address format isn’t often enough… The user could enter a correctly formatted email address but simply have mis-spelled the address… they may accidentally type in user@yahooo.com, or user@hitmail.co.uk rather than hotmail.co.uk, and things like that. In which case you may want to check the MX and/or A record to see if its a valid domain. And whilst you’re doing that, why not check to see if it’s a commonly used email host that maybe they’ve typed in wrong?

So here is a class that will allow you to do all that in one easy method call:


[php]
*
* and you just want to use the actual address portion. By default the
* method will return true/false else route/false.
*
* Adapted from PHP code (Changes (c) 2005 Padraic Brady) which is a
* translation of Perl code (Copyright 1997 O’Reilly & Associates, Inc.).
*
* Based on optimised email regex in Perl Copyright 1997 O’Reilly &
* Associates, Inc. The “Mastering Regular Expressions” Email Regex
* (from book on page 295 et seq).
*
* @param string $email The email address to validate
* @param boolean $checkmx Check the MX Record if $email is valid
* @return bool|string
* @access public
* @see
* @see
*/
public static function format($email, $checkmx = false, $returnroute = false)
{
// Some things for avoiding backslashitis later on.
$esc = ‘\\\\’; $Period = ‘\.’;
$space = ‘\040’; $tab = ‘\t’;
$OpenBR = ‘\[‘; $CloseBR = ‘\]’;
$OpenParen = ‘\(‘; $CloseParen = ‘\)’;
$NonASCII = ‘\x80-\xff’; $ctrl = ‘\000-\037’;
$CRlist = ‘\n\015’; // note: this should really be only \015.

// Items 19, 20, 21
$qtext = “[^{$esc}{$NonASCII}{$CRlist}\”]”;
$dtext = “[^{$esc}{$NonASCII}{$CRlist}{$OpenBR}{$CloseBR}]”;
$quoted_pair = ” {$esc} [^{$NonASCII}] “;

// Items 22 and 23, comment.
// Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext = ” [^{$esc}{$NonASCII}{$CRlist}()] “;

// $Cnested matches one non-nested comment.
// It is unrolled, with normal of $ctext, special of $quoted_pair.
$Cnested = “{$OpenParen}{$ctext}*(?: {$quoted_pair} {$ctext}* )*{$CloseParen}”;

// $comment allows one level of nested parentheses
// It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
$comment = “{$OpenParen}{$ctext}*(?:(?: {$quoted_pair} | {$Cnested} ){$ctext}*)*{$CloseParen}”;

// $X is optional whitespace/comments.
// Nab whitespace. If comment found, allow more spaces.
$X = “[{$space}{$tab}]*(?: {$comment} [{$space}{$tab}]* )*”;

// Item 10: atom
$atom_char = “[^($space)<>\@,;:\”.$esc$OpenBR$CloseBR$ctrl$NonASCII]”;
// some number of atom characters not followed by something that could be part of an atom
$atom = “{$atom_char}+(?!{$atom_char})”;

// Item 11: doublequoted string, unrolled.
$quoted_str = “\”{$qtext} *(?: {$quoted_pair} {$qtext} * )*\””;

// Item 7: word is an atom or quoted string
$word = “(?:{$atom}|{$quoted_str})”;

// Item 12: domain-ref is just an atom
$domain_ref = $atom;

// Item 13: domain-literal is like a quoted string, but […] instead of “…”
$domain_lit = “{$OpenBR}(?: {$dtext} | {$quoted_pair} )*{$CloseBR}”;

// Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain = “(?:{$domain_ref}|{$domain_lit}){$X}”;

// Item 6: domain is a list of subdomains separated by dots.
$domain = “{$sub_domain}(?:{$Period} {$X} {$sub_domain})*”;

// Item 8: a route. A bunch of “@ $domain” separated by commas, followed by a colon.
$route = “\@ {$X} {$domain}(?: , {$X} \@ {$X} {$domain} )*:{$X}”;

// Item 6: local-part is a bunch of $word separated by periods
$local_part = “{$word} {$X}(?:{$Period} {$X} {$word} {$X})*”;

// Item 2: addr-spec is local@domain
$addr_spec = “{$local_part} \@ {$X} {$domain}”;

// Item 4: route-addr is
// parenthases around the route_addr to capture this in the final regexpr
$route_addr = “< ({$X}(?: {$route} )?{$addr_spec})>“;

// Item 3: phrase… like ctrl, but without tab
$phrase_ctrl = ‘\000-\010\012-\037’;

// Like atom-char, but without listing space, and uses phrase_ctrl.
// Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char = “[^()<>\@,;:\”.{$esc}{$OpenBR}{$CloseBR}{$NonASCII}{$phrase_ctrl}]”;

// We’ve worked it so that $word, $comment, and $quoted_str to not consume trailing $X
// because we take care of it manually.
$phrase = “{$word}{$phrase_char} *(?:(?: {$comment} | {$quoted_str} ){$phrase_char} *)*”;

// Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = “{$X}(?:{$addr_spec}|{$phrase} ({$route_addr}))”;

// perform actual regex check to our recieved email address
$matches = array();
$isValid = preg_match(‘/^’.$mailbox.’$/xS’, $email, $matches);
$route = $matches[count($matches) – 1];

// check the MX Record is needs be
if ($isValid && $checkmx) {
list(, $host) = explode(‘@’, $route);
return (@checkdnsrr($host, ‘MX’)) ? (($returnroute) ? $route : true) : false;
}

// finally return status
return ($isValid && $returnroute) ? $route : $isValid;
}

/**
* Check email addresses for validity
*
* You can pass either a string of an array of strings. The string
* contains the email address you want to check. The method will check
* that the format is valid and then check the A and MX record. If
* either the A or MX record fail then it will attempt to make some
* suggestions of popular email hosts.
*
* @param string|array $email Email address of array of email addresses
* @return array
* @access public
*/
public static function address($email)
{
// default metaphone conversions
$mphones = array(
‘ALKK’ => ‘aol.co.uk’,
‘ALKM’ => ‘aol.com’,
‘EKSMPLKM’ => ‘example.com’,
‘HTMLKK’ => ‘hotmail.co.uk’,
‘HTMLKM’ => ‘hotmail.com’,
‘KKLMLKM’ => ‘googlemail.com’,
‘KMLKM’ => ‘gmail.com’,
‘MSNKK’ => ‘msn.co.uk’,
‘MSNKM’ => ‘msn.com’,
‘YHKK’ => ‘yahoo.co.uk’,
‘YHKM’ => ‘yahoo.com’
);
$emails = $suggestions = array();

// validate email address format
if (is_array($email)) {
$email = array_unique($email);
foreach ($email as $e) {
$route = self::format($e, false, true);
if (!$route) {
$emails[$e][‘error’][EMAILCHECK_INVALID_ADDRESS] = true;
} else {
list($emails[$e][‘user’], $emails[$e][‘domain’]) = explode(‘@’, $route);
}
}
} else {
$route = self::format($email, false, true);
if (!$route) {
$emails[$email][‘error’] = EMAILCHECK_INVALID_ADDRESS;
} else {
list($emails[$email][‘user’], $emails[$email][‘domain’]) = explode(‘@’, $route);
}
}

// check domains
foreach ($emails as $orig => $e) {
if ($e[‘domain’]) {
if (empty($e[‘error’])) {
$valid = checkdnsrr($e[‘domain’], ‘MX’);
if (!$valid) {
$emails[$orig][‘error’][EMAILCHECK_INVALID_DNS_MX] = true;
}
}
if (empty($e[‘error’])) {
$valid = checkdnsrr($e[‘domain’], ‘A’);
if (!$valid) {
$emails[$orig][‘error’][EMAILCHECK_INVALID_DNS_A] = true;
}
}
// failing anything, get suggestions
if (!empty($emails[$orig][‘error’]) && !isset($e[‘error’][EMAILCHECK_INVALID_ADDRESS])) {
$emeta = metaphone($e[‘domain’]);
$suggestions = array();
foreach ($mphones as $code => $addr) {
$percent = 0;
$lev = levenshtein($emeta, $code);
$sim = similar_text($emeta, $code, $percent);
$score = round($percent + max(-$lev + $sim, 0));
if ($score >= EMAILCHECK_GUESS_THRESHOLD) {
$suggestions[$addr] = $score;
}
}
if (!empty($suggestions)) {
arsort($suggestions);
$emails[$orig][‘suggestion’] = array_keys($suggestions);
}
}
}
}

// send back results
return $emails;
}
}

?>[/php]

If I call it like this:

[php]$validate = MailCheck::address(‘example@yahooo.co’);
print_r($validate);[/php]

I will get the following returned:

Array
(
    [example@yahooo.co] => Array
        (
            [domain] => yahooo.co
            [user] => example
            [error] => Array
                (
                    [1] => 1
                    [2] => 1
                )
            [suggestion] => Array
                (
                    [0] => yahoo.com
                    [1] => yahoo.co.uk
                )
        )
)

Which is basically saying that while the address example@yahooo.co is a valid format, the domain yahooo.co doesn’t have a valid MX or A record, and it thinks that perhaps the user meant to use yahoo.com or yahoo.co.uk.

The address method can take an array of email addresses as well as a single string.

A further example of use is:

[php] $status) {
echo $address, ‘ is ‘, (empty($status[‘error’])) ? ‘valid’ : ‘not valid’, “
\n”;
}

?>[/php]

Which echos:

user1@yahoo.com is valid
user2@somethingmadeup.com is not valid
user3 is wrong is not valid

Share

3 Responses

  1. thanks for posting this! much better than a regular expression or piddly bit of javascript

    what php markup plugin do you use? it’s excellent!
    my current markup plugin stretches the code all across the screen :=/

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.