Code Monkey home page Code Monkey logo

Comments (12)

chconnor avatar chconnor commented on June 14, 2024 1

Oops, just noticing that your demo emails don't have comments in the local-part! Sorry.
I think the issue is that we can't handle nested comments because of regex recursing issues... lemme double check that...

from email-rfc2822-validator.

RohanNagar avatar RohanNagar commented on June 14, 2024 1

Sorry, I actually just took a further look on "John Smith" and it looks like a problem with how my website displays results.

The actual address being tested is Joe Smith <[email protected]>, which should actually be valid (quoted identifiers). I'll need to fix that in my library.

Sorry for the false alarm on that one!

from email-rfc2822-validator.

RohanNagar avatar RohanNagar commented on June 14, 2024 1

Thank you both for the kind words!

In terms of performance, JMail is on average almost 3x faster than email-rfc2822. You can scroll down to the bottom of https://www.rohannagar.com/jmail/ to see the average times.

Implementation Average Time
JMail 9201.171867088608 ns
Apache Commons 19448.029620253164 ns
Javax Mail 19917.172215189872 ns
email-rfc2822 29966.67430379747 ns

To calculate those, I run validate on each email address 100 times, averaging all of the times together.

// test 100 times to get an avg
int repetitions = 100;
List<Long> times = new ArrayList<>(repetitions);

for (int i = 0; i < repetitions; i++) {
  long start = System.nanoTime();
  predicate.test(email); // this runs the validate method
  long end = System.nanoTime();

  times.add(end - start);
}

double avg = times.stream().mapToDouble(n -> n).average().orElse(0.0);

from email-rfc2822-validator.

RohanNagar avatar RohanNagar commented on June 14, 2024 1

Quick update that I published version 1.2.0 of JMail which fixes issues that came up during this discussion - emails with quoted identifiers are now correctly handled, as well as empty quoted strings in the local-part.

from email-rfc2822-validator.

chconnor avatar chconnor commented on June 14, 2024

Thanks Rohan -- first, I've been out of the email world for a long time, so I'm probably stuck in 2822 and haven't paid attention to the new RFCs, so apologies in advance if any of this is wrong:

Nice to see someone taking email validation seriously! Your results do seem to show a number of problems with our library. I'm not involved in maintaining it these days, so maybe @bbottema will have time to take a look at some of it.

Some things I noticed or wondered about in the list of demo addresses:

  • I don't know why "Joe Smith" is validated by our library. That surprised me.
  • Your library (and others) failed 1234567890123456789012345678901234567890123456789012345678901234+x@example.com -- what am I missing? I would have thought that valid.
  • it does look like our library doesn't parse domains very well (too permissive). It should be noted that the library wasn't originally developed to be a strict 2822 enforcer, but rather a way to clean up and extract usable addresses... it can serve both roles, but I think we're seeing the weaknesses here.
  • I'm unclear why your library failed email@[3.256.255.23] -- AFAICT the "dtext" of 2822 allows almost anything inside the braces? Same question for many such addresses, such as first.last@[IPv6::], but maybe there are new RFCs that enforce the contents of the domain-literal? Our library will validate x@[anything in here], which explains a lot of the discrepancies
  • x@x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456 looks valid to me, but your library failed it?
  • test@123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345
    678901234567890123456789012345678901234567890123456789012.com also looks valid to me? Is there a hostname size limit? I notice a couple others that you fail (that we don't) that have long TLDs or SLDs -- is that RFC?
  • ""@test.org -- looking at quoted-string it does seem like qcontent is allowed to be empty?
  • We don't implement the obs- tokens, which I think explains why very.unusual."@"[email protected] and much."more\ unusual"@example.com are considered valid by your library but not ours, as well as many others in the list... I believe that addr-spec allows either dot-atom or quoted-string, which don't support those addresses, but apparently obs-local-part does via the "word" token. Kudos for going to distance on the obs- tokens. :-)
  • there are several addresses that have spaces in the right-hand side... IIRC there were some decisions made to be non-2822 compliant on a couple of points regarding whitespace in the domain; some info here. One of those "2822 allows it but nobody wants it" situations, so I think your library is more correct in that way.
  • your library validates cal(woo(yay)hoopla)@iamcal.com (and similar addresses) and ours fails; I think it is in fact invalid? () not allowed unless quoted, no?
  • There seem to be a lot of cases where non-ascii is present in the left-hand side... at least in my days I believe that was not allowed unless they were encoded (RFC-2047 last time I was involved; I'm sure it's different now). Perhaps that explains the differences between libraries in those cases.
  • the last seven examples on the list all look like a bug in our library (apparently something wrong with quoted-pair not getting parsed correctly?). I glanced through our Dragons.java and didn't see anything obviously wrong, but again, it's been a long time and I don't unfortunately have time to really debug it.

Did you implement with a lexer?

Thanks for reaching out.

from email-rfc2822-validator.

RohanNagar avatar RohanNagar commented on June 14, 2024

@chconnor Thanks for taking a look! I'll try to answer each of these:

I don't know why "Joe Smith" is validated by our library. That surprised me.

That surprised me as well, seems like it should have been caught.

Your library (and others) failed 1234567890123456789012345678901234567890123456789012345678901234+x@example.com -- what am I missing? I would have thought that valid.

Not sure if this was defined in RFC 2822, but in RFC 5321, the local part cannot be more than 64 octets (this email is 66).

it does look like our library doesn't parse domains very well (too permissive). It should be noted that the library wasn't originally developed to be a strict 2822 enforcer, but rather a way to clean up and extract usable addresses... it can serve both roles, but I think we're seeing the weaknesses here.

Definitely makes sense, and strictly enforcing any of these email address RFCs is hard!

I'm unclear why your library failed email@[3.256.255.23] -- AFAICT the "dtext" of 2822 allows almost anything inside the braces? Same question for many such addresses, such as first.last@[IPv6::], but maybe there are new RFCs that enforce the contents of the domain-literal? Our library will validate x@[anything in here], which explains a lot of the discrepancies

This may be true. My library failed it because the IPv4 address was invalid, specified in RFC 5321 here. However, I wasn't able to find anything explicitly denying or allowing anything within brackets in the domain. I'll have to dig into that further.

x@x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456789.x23456 looks valid to me, but your library failed it?

I believe this is also because of RFC 5321, restricting the size of the domain to 255 octets.

test@123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345
678901234567890123456789012345678901234567890123456789012.com also looks valid to me? Is there a hostname size limit? I notice a couple others that you fail (that we don't) that have long TLDs or SLDs -- is that RFC?

Yes, again with the octet limit in RFC 5321 each domain label can only be 63 octets.

""@test.org -- looking at quoted-string it does seem like qcontent is allowed to be empty?

Thanks, after looking into it I think you're actually right and I should fix this.

We don't implement the obs- tokens, which I think explains why very.unusual."@"[email protected] and much."more\ unusual"@example.com are considered valid by your library but not ours, as well as many others in the list... I believe that addr-spec allows either dot-atom or quoted-string, which don't support those addresses, but apparently obs-local-part does via the "word" token. Kudos for going to distance on the obs- tokens. :-)

That makes sense as to the difference there. Thanks!

there are several addresses that have spaces in the right-hand side... IIRC there were some decisions made to be non-2822 compliant on a couple of points regarding whitespace in the domain; some info here. One of those "2822 allows it but nobody wants it" situations, so I think your library is more correct in that way.

Got it! I agree that nobody should want that :)

your library validates cal(woo(yay)hoopla)@iamcal.com (and similar addresses) and ours fails; I think it is in fact invalid? () not allowed unless quoted, no?

Comments should be allowed outside of quoted strings. See RFC 2822: https://datatracker.ietf.org/doc/html/rfc2822#section-3.2.3

Excerpt:

White space characters, including white space used in folding
   (described in section 2.2.3), may appear between many elements in
   header field bodies.  Also, strings of characters that are treated as
   comments may be included in structured field bodies as characters
   enclosed in parentheses.  The following defines the folding white
   space (FWS) and comment constructs.

   Strings of characters enclosed in parentheses are considered comments
   so long as they do not appear within a "quoted-string", as defined in
   section 3.2.5.  Comments may nest.

In RFC 5322 there are clear examples of comments allowed outside of quotes.

There seem to be a lot of cases where non-ascii is present in the left-hand side... at least in my days I believe that was not allowed unless they were encoded (RFC-2047 last time I was involved; I'm sure it's different now). Perhaps that explains the differences between libraries in those cases.

Yes that definitely explains the difference in those cases. With internationalized email addresses, non-ascii characters can be allowed. See RFC 6530

he last seven examples on the list all look like a bug in our library (apparently something wrong with quoted-pair not getting parsed correctly?). I glanced through our Dragons.java and didn't see anything obviously wrong, but again, it's been a long time and I don't unfortunately have time to really debug it.

๐Ÿ‘๐Ÿฝ

Did you implement with a lexer?

Yes, I did! Regex for email address validation seemed really ugly and difficult to understand for me, so I wanted to give it a shot.

I essentially iterate through the email address once, invalidating if I encounter something invalid, and building a meaningful Email object that allows users of the library to easily access things like localPart(), domain(), topLevelDomain(), etc.

Also, going through this made me realize I want to add descriptions for each of these demo addresses on the website so it's easier to figure out why something should be valid/invalid ๐Ÿ™‚

from email-rfc2822-validator.

chconnor avatar chconnor commented on June 14, 2024

@chconnor Thanks for taking a look! I'll try to answer each of these:
Not sure if this was defined in RFC 2822, but in RFC 5321, the local part cannot be more than 64 octets (this email is 66).

Ah yeah, that looks to be a new (and welcome) restriction with the later RFCs. AFAIK 2822 only restricts line length (e.g. for email headers) so you could argue that, in practice, email addresses had to be under <~990 characters, but we didn't implement that, AFAIK.

your library validates cal(woo(yay)hoopla)@iamcal.com (and similar addresses) and ours fails; I think it is in fact invalid? () not allowed unless quoted, no?

Comments should be allowed outside of quoted strings. See RFC 2822: https://datatracker.ietf.org/doc/html/rfc2822#section-3.2.3

(EDIT: see following comment first.)

I don't think they are allowed in local-part, though, even accounting for the obs-local-part? I.e. I think you have to do "blah(comment)blah"@blah.com. You can do (comment)[email protected] or (comment)blah(comment)@blah.com, but I think if local-part itself has parens in it it has to be quoted or escaped?

In RFC 5322 there are clear examples of comments allowed outside of quotes.

(...just noting that all those examples have comments outside the local-part.)

Did you implement with a lexer?

Yes, I did! Regex for email address validation seemed really ugly and difficult to understand for me, so I wanted to give it a shot.

Yeah the regex is definitely a grind. +1 for a lexer. Glad someone finally did it. :-)

Also, going through this made me realize I want to add descriptions for each of these demo addresses on the website so it's easier to figure out why something should be valid/invalid slightly_smiling_face

Yeah, given that 99.999999% of people have no earthly idea the crazy stuff that is allowed by email specs I think documentation is always a good idea. :-)

from email-rfc2822-validator.

chconnor avatar chconnor commented on June 14, 2024

Yep, we can't do nesting of comments.

from email-rfc2822-validator.

chconnor avatar chconnor commented on June 14, 2024

@bbottema Could this line be why "John Smith" matched? My java regex chops are long gone, but does that line allow for matching mailboxName with an empty uniqueAddrSpec? I can't otherwise see how "John Smith" would validate as an address, since afaict we always require (correctly) at least the @ to be present.

from email-rfc2822-validator.

bbottema avatar bbottema commented on June 14, 2024

I'd have to do some debugging to verify that. Maybe I'll get some time to spend it this week.

from email-rfc2822-validator.

chconnor avatar chconnor commented on June 14, 2024

Oh, good. I was wondering why javax was validating it also. :-)

Looks like you have a great library developing. There have been a few alternative libraries developed over the years but this is the first one that really seems to go the distance on the RFCs and which I would consider using instead of our own. Nice work.

@bbottema we might need to finally remove the "The world's only more-or-less-2822-compliant Java-based email address extractor / verifier" tagline. :-) And maybe we could add a "see also" link or something.

from email-rfc2822-validator.

bbottema avatar bbottema commented on June 14, 2024

I would even considering archiving this project, as a lexer is a far better way of doing things (including documentation and debugging). By a landslide. Also this particular library not only seems to be correcter compared to our own validations, but more up to date with newer RFC's as well. I'd say jmail supersedes this library on all accounts.

I would like to do a performance comparison though, but I suspect a lexer to be a lot faster. I think the artifact footprint is about the same, both having no 3rd party dependencies.

from email-rfc2822-validator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.