Robust email address validator – with address suggestions!

I’m sure you’ve seen the simple email address format validation function; they’re usually a simple regular expressing that just check the address portion (the user@example.org bit). That’s really only a bit of the validation that should be done. The RFC822 specs detail that the format of email addresses can be much larger, for example, it could be something like “Andrew Collington & Co.” <a.collington@example.org>, and, of course, the simple regex on that would fail. But even a check on the address format isn’t often enough… The user could enter a correctly formatted email address but simply have mis-spelled the address… they may accidentally type in user@yahooo.com, or user@hitmail.co.uk rather than hotmail.co.uk, and things like that. In which case you may want to check the MX and/or A record to see if its a valid domain. And whilst you’re doing that, why not check to see if it’s a commonly used email host that maybe they’ve typed in wrong?

So here is a class that will allow you to do all that in one easy method call:

<?php

define('EMAILCHECK_INVALID_ADDRESS', 0);
define('EMAILCHECK_INVALID_DNS_MX',  1);
define('EMAILCHECK_INVALID_DNS_A',   2);
define('EMAILCHECK_GUESS_THRESHOLD', 80);

class MailCheck
{
    /**
     * Validate an email address format against RFC822 specs.
     *
     * This method will check an email address against the RFC822 specs, as
     * well as validating the MX record if required.  Optionally you can have
     * the email address route portion return (rather than the boolean true
     * value).  This can be handy if you have an email address such as:
     *
     *    "Andrew Collington & Co." <a.collington@example.org>
     *
     * and you just want to use the actual address portion.  By default the
     * method will return true/false else route/false.
     *
     * Adapted from PHP code (Changes (c) 2005 Padraic Brady) which is a
     * translation of Perl code (Copyright 1997 O'Reilly & Associates, Inc.).
     *
     * Based on optimised email regex in Perl Copyright 1997 O'Reilly &
     * Associates, Inc. The "Mastering Regular Expressions" Email Regex
     * (from book on page 295 et seq).
     *
     * @param string $email The email address to validate
     * @param boolean $checkmx Check the MX Record if $email is valid
     * @return bool|string
     * @access public
     * @see <http://www.faqs.org/rfcs/rfc822.html>
     * @see <http://examples.oreilly.com/regex/email-opt.pl>
     */
    public static function format($email, $checkmx = false, $returnroute = false)
    {
        // Some things for avoiding backslashitis later on.
        $esc        = '\\\\';               $Period      = '\.';
        $space      = '\040';               $tab         = '\t';
        $OpenBR     = '\[';                 $CloseBR     = '\]';
        $OpenParen  = '\(';                 $CloseParen  = '\)';
        $NonASCII   = '\x80-\xff';          $ctrl        = '\000-\037';
        $CRlist     = '\n\015';  // note: this should really be only \015.

        // Items 19, 20, 21
        $qtext = "[^{$esc}{$NonASCII}{$CRlist}\"]";
        $dtext = "[^{$esc}{$NonASCII}{$CRlist}{$OpenBR}{$CloseBR}]";
        $quoted_pair = " {$esc} [^{$NonASCII}] ";

        // Items 22 and 23, comment.
        // Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
        $ctext = " [^{$esc}{$NonASCII}{$CRlist}()] ";

        // $Cnested matches one non-nested comment.
        // It is unrolled, with normal of $ctext, special of $quoted_pair.
        $Cnested = "{$OpenParen}{$ctext}*(?: {$quoted_pair} {$ctext}* )*{$CloseParen}";

        // $comment allows one level of nested parentheses
        // It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
        $comment = "{$OpenParen}{$ctext}*(?:(?: {$quoted_pair} | {$Cnested} ){$ctext}*)*{$CloseParen}";

        // $X is optional whitespace/comments.
        // Nab whitespace.  If comment found, allow more spaces.
        $X = "[{$space}{$tab}]*(?: {$comment} [{$space}{$tab}]* )*";

        // Item 10: atom
        $atom_char   = "[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]";
        // some number of atom characters not followed by something that could be part of an atom
        $atom = "{$atom_char}+(?!{$atom_char})";

        // Item 11: doublequoted string, unrolled.
        $quoted_str = "\"{$qtext} *(?: {$quoted_pair} {$qtext} * )*\"";

        // Item 7: word is an atom or quoted string
        $word = "(?:{$atom}|{$quoted_str})";

        // Item 12: domain-ref is just an atom
        $domain_ref = $atom;

        // Item 13: domain-literal is like a quoted string, but [...] instead of  "..."
        $domain_lit  = "{$OpenBR}(?: {$dtext} | {$quoted_pair} )*{$CloseBR}";

        // Item 9: sub-domain is a domain-ref or domain-literal
        $sub_domain  = "(?:{$domain_ref}|{$domain_lit}){$X}";

        // Item 6: domain is a list of subdomains separated by dots.
        $domain = "{$sub_domain}(?:{$Period} {$X} {$sub_domain})*";

        // Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
        $route = "\@ {$X} {$domain}(?: , {$X} \@ {$X} {$domain} )*:{$X}";

        // Item 6: local-part is a bunch of $word separated by periods
        $local_part = "{$word} {$X}(?:{$Period} {$X} {$word} {$X})*";

        // Item 2: addr-spec is local@domain
        $addr_spec  = "{$local_part} \@ {$X} {$domain}";

        // Item 4: route-addr is <route? addr-spec>
        // parenthases around the route_addr to capture this in the final regexpr
        $route_addr = "< ({$X}(?: {$route} )?{$addr_spec})>";

        // Item 3: phrase... like ctrl, but without tab
        $phrase_ctrl = '\000-\010\012-\037';

        // Like atom-char, but without listing space, and uses phrase_ctrl.
        // Since the class is negated, this matches the same as atom-char plus space and tab
        $phrase_char = "[^()<>\@,;:\".{$esc}{$OpenBR}{$CloseBR}{$NonASCII}{$phrase_ctrl}]";

        // We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
        // because we take care of it manually.
        $phrase = "{$word}{$phrase_char} *(?:(?: {$comment} | {$quoted_str} ){$phrase_char} *)*";

        // Item #1: mailbox is an addr_spec or a phrase/route_addr
        $mailbox = "{$X}(?:{$addr_spec}|{$phrase}  ({$route_addr}))";

        // perform actual regex check to our recieved email address
        $matches = array();
        $isValid = preg_match('/^'.$mailbox.'$/xS', $email, $matches);
        $route = $matches[count($matches) - 1];

        // check the MX Record is needs be
        if ($isValid && $checkmx) {
            list(, $host) = explode('@', $route);
            return (@checkdnsrr($host, 'MX')) ? (($returnroute) ? $route : true) : false;
        }

        // finally return status
        return ($isValid && $returnroute) ? $route : $isValid;
    }

    /**
     * Check email addresses for validity
     *
     * You can pass either a string of an array of strings.  The string
     * contains the email address you want to check.  The method will check
     * that the format is valid and then check the A and MX record.  If
     * either the A or MX record fail then it will attempt to make some
     * suggestions of popular email hosts.
     *
     * @param string|array $email Email address of array of email addresses
     * @return array
     * @access public
     */
    public static function address($email)
    {
        // default metaphone conversions
        $mphones = array(
            'ALKK'     => 'aol.co.uk',
            'ALKM'     => 'aol.com',
            'EKSMPLKM' => 'example.com',
            'HTMLKK'   => 'hotmail.co.uk',
            'HTMLKM'   => 'hotmail.com',
            'KKLMLKM'  => 'googlemail.com',
            'KMLKM'    => 'gmail.com',
            'MSNKK'    => 'msn.co.uk',
            'MSNKM'    => 'msn.com',
            'YHKK'     => 'yahoo.co.uk',
            'YHKM'     => 'yahoo.com'
        );
        $emails = $suggestions = array();

        // validate email address format
        if (is_array($email)) {
            $email = array_unique($email);
            foreach ($email as $e) {
                $route = self::format($e, false, true);
                if (!$route) {
                    $emails[$e]['error'][EMAILCHECK_INVALID_ADDRESS] = true;
                } else {
                    list($emails[$e]['user'], $emails[$e]['domain']) = explode('@', $route);
                }
            }
        } else {
            $route = self::format($email, false, true);
            if (!$route) {
                $emails[$email]['error'] = EMAILCHECK_INVALID_ADDRESS;
            } else {
                list($emails[$email]['user'], $emails[$email]['domain']) = explode('@', $route);
            }
        }

        // check domains
        foreach ($emails as $orig => $e) {
            if ($e['domain']) {
                if (empty($e['error'])) {
                    $valid = checkdnsrr($e['domain'], 'MX');
                    if (!$valid) {
                        $emails[$orig]['error'][EMAILCHECK_INVALID_DNS_MX] = true;
                    }
                }
                if (empty($e['error'])) {
                    $valid = checkdnsrr($e['domain'], 'A');
                    if (!$valid) {
                        $emails[$orig]['error'][EMAILCHECK_INVALID_DNS_A] = true;
                    }
                }
                // failing anything, get suggestions
                if (!empty($emails[$orig]['error']) && !isset($e['error'][EMAILCHECK_INVALID_ADDRESS])) {
                    $emeta = metaphone($e['domain']);
                    $suggestions = array();
                    foreach ($mphones as $code => $addr) {
                        $percent = 0;
                        $lev = levenshtein($emeta, $code);
                        $sim = similar_text($emeta, $code, $percent);
                        $score = round($percent + max(-$lev + $sim, 0));
                        if ($score >= EMAILCHECK_GUESS_THRESHOLD) {
                            $suggestions[$addr] = $score;
                        }
                    }
                    if (!empty($suggestions)) {
                        arsort($suggestions);
                        $emails[$orig]['suggestion'] = array_keys($suggestions);
                    }
                }
            }
        }

        // send back results
        return $emails;
    }
}

?>

If I call it like this:

$validate = MailCheck::address('example@yahooo.co');
print_r($validate);

I will get the following returned:

Array
(
    [example@yahooo.co] => Array
        (
            [domain] => yahooo.co
            [user] => example
            [error] => Array
                (
                    [1] => 1
                    [2] => 1
                )
            [suggestion] => Array
                (
                    [0] => yahoo.com
                    [1] => yahoo.co.uk
                )
        )
)

Which is basically saying that while the address example@yahooo.co is a valid format, the domain yahooo.co doesn’t have a valid MX or A record, and it thinks that perhaps the user meant to use yahoo.com or yahoo.co.uk.

The address method can take an array of email addresses as well as a single string.

A further example of use is:

<?php

require('MailCheck.php');
$valid = MailCheck::address(array('user1@yahoo.com', 'user2@somethingmadeup.com', 'user3 is wrong'));
foreach ($valid as $address => $status) {
    echo $address, ' is ', (empty($status['error'])) ? 'valid' : 'not valid', "<br />\n";
}

?>

Which echos:

user1@yahoo.com is valid
user2@somethingmadeup.com is not valid
user3 is wrong is not valid
Did you like this? Share it:

3 thoughts on “Robust email address validator – with address suggestions!

  1. thanks for posting this! much better than a regular expression or piddly bit of javascript

    what php markup plugin do you use? it’s excellent!
    my current markup plugin stretches the code all across the screen :=/

Leave a Reply