Subject: Re: Anybody willing to share their cl-ppcre regex for validating email address
From: rpw3@rpw3.org (Rob Warnock)
Date: Sat, 05 Jan 2008 07:17:03 -0600
Newsgroups: comp.lang.lisp
Message-ID: <M4WdnZ6ypp_SHuLanZ2dnUVZ_tCrnZ2d@speakeasy.net>
Tony Garnock-Jones  <you.can.find.me.through@google.easily> wrote:
+---------------
| Geoffrey Summerhayes wrote:
| > RFC 822:
| 
| Does 2822 make things any simpler?
+---------------

Possibly a little bit, but not in any significiant way in the area
I think you're asking about. RFC 2822 introduced the "dot-atom"
production which simplified the description of when periods were
allowed in unquoted local-parts:

    Some of the structured header field bodies also allow the period
    character (".", ASCII value 46) within runs of atext. An additional
    "dot-atom" token is defined for those purposes.
    ...
    atom            =       [CFWS] 1*atext [CFWS]
    dot-atom        =       [CFWS] dot-atom-text [CFWS]
    dot-atom-text   =       1*atext *("." 1*atext)

    Both atom and dot-atom are interpreted as a single unit, comprised of
    the string of characters that make it up.  Semantically, the optional
    comments and FWS surrounding the rest of the characters are not part
    of the atom; the atom is only the run of atext characters in an atom,
    or the atext and "." characters in a dot-atom.

and then "addr-spec" was tweaked to use "dot-atom":

    3.4.1. Addr-spec specification
    An addr-spec is a specific Internet identifier that contains a
    locally interpreted string followed by the at-sign character ("@",
    ASCII value 64) followed by an Internet domain.  The locally
    interpreted string is either a quoted-string or a dot-atom.  If the
    string can be represented as a dot-atom (that is, it contains no
    characters other than atext characters or "." surrounded by atext
    characters), then the dot-atom form SHOULD be used and the
    quoted-string form SHOULD NOT be used. Comments and folding white
    space SHOULD NOT be used around the "@" in the addr-spec.


    addr-spec       =       local-part "@" domain
    local-part      =       dot-atom / quoted-string / obs-local-part
    domain          =       dot-atom / domain-literal / obs-domain
    domain-literal  =       [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
    dcontent        =       dtext / quoted-pair
    dtext           =       NO-WS-CTL /     ; Non white space controls
			    %d33-90 /       ; The rest of the US-ASCII
			    %d94-126        ;  characters not including "[",
					    ;  "]", or "\"

Also, the "route" syntax in a "addr-spec" was deprecated, see
"4.4 Obsolete Addressing" and the "obs-angle-addr" production.

Finally, "CFWS" [comment and/or folding-white-space] was removed
from being allowed around the dots within a "word" and around the
"@" between a "local-part" and a "domain". [I think. If I'm reading
"4.4" correctly.]

Unfortunately, while these simplifications apply to what you may *send*,
they do *NOT* apply to what you must still be prepared to *receive*:

    3.1. Introduction
    ...
    In some of the definitions, there will be nonterminals whose names
    start with "obs-".  These "obs-" elements refer to tokens defined in
    the obsolete syntax in section 4.  In all cases, these productions
    are to be ignored for the purposes of generating legal Internet
    messages and MUST NOT be used as part of such a message.  However,
    when interpreting messages, these tokens MUST be honored as part of
    the legal syntax.  In this sense, section 3 defines a grammar for
    generation of messages, with "obs-" elements that are to be ignored,
    while section 4 adds grammar for interpretation of messages.

This means that you must still be prepared to *parse* all the
old, ugly syntax, which means that it's really no simplification
at all, practically speaking. (*sigh*)


-Rob

-----
Rob Warnock			<rpw3@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607