Firstly, I really appreciate this library. It's a lightweight HTML parser and it does the job pretty well, for the most part. The only downside is that there's not any broken node correction, so HTML documents can end up being a broken mess if the site is terrible (which a lot are, apparently.) I believe it could be easy to implement a simple correction for broken nodes (by ending the node right before the parent node ends), but that's not really the scope right now.
Attribute operators
I added 2 customized attribute operators to a (currently private) fork which give a little more help to finding attributes more conditionally.
AttributeSelectorMode.Contains
already allows you to check if a nodes' attribute value has the value anywhere inside of it using standard HTML identifiers, but optionally using the jQuery approach to search such like div[class!=\"optional\"]
. It sounds like redundancy, but it may be beneficial since I have observed errors using the HTML-like approach (where a class with an underscore, such like test_color
in a div fails selector parsing), so a secondary entry point could help in those situations. The operator !
may be counter-intuitive, but I don't quite know another operator since the other one is used in the second suggestion.
AttributeSelectorMode.ContainsAny
a new addition allows you to input a HTML-style list of values to search for in the attribute value. div[class?=\"optional conditional\"]
will return any div
node with the attribute value that contains either optional
or conditional
. I see one helpful use in the case that HTML returned by a php script could return transformative HTML that may contain conditional
in the needed div objects instead of optional
. Currently, I have it set up to not use AttributeSelector.Value
, but a string array AttributeSelector.Values
.
Filtering empty text nodes (as an option)
I have been poking around filtering of the HTML document on parsing, such as removing nodes that are empty and only returning the head, body, or div nodes. It seems a bit unnecessary, but it does cut down on search time by only adding required nodes. I suggest adding filtering for empty text nodes (any node that is 0 in length or contains only whitespace characters). It's a minor thing that could be helpful, and a simple string.IsNullOrWhiteSpace(text)
check right before the parent node gets the new node could clean up a lot of fluff, but it does introduce a problem people may face where they may expect the whitespace. It as an option, though, would solve that one.
Asynchronous
Finally, on the subject of asynchronous, there are already some asynchronous methods available but are limited to NET Standard. I only actually added asynchronous support to HtmlDocument
, and simply used Task.Run
for operations since that's a quick (and maybe dirty way) of adding asynchronous support.
Those are just a few main suggestions I have that could help improve the library based on my usage and individual needs.