The htmlmonkey from softcircuits

htmlmonkey's Introduction

SoftCircuits

SoftCircuits develops software for the Windows platform. We have expertise building website applications, desktop applications and .NET libraries.

As can be seen from our repositories on GitHub, we have special set of skills processing and parsing files. Our repositories include an HTML parser (HtmlMonkey), a CSV parser (CsvParser), a fixed-width parser (SoftCircuits.FixedWidthParser), an INI file parser (IniFileParser) and an interpreted language written from the ground up in C# (Silk).

Jonathan Wood

SoftCircuits is headed up by Jonathan Wood, who is a software developer/hiker/dog and animal lover living out of Salt Lake City, Utah. He is currently available for additional work/consulting!

htmlmonkey's People

Contributors

Stargazers

Watchers

htmlmonkey's Issues

[FEATURE REQUEST] More attribute operators for Find methods, filtering empty text nodes, and asynchronous HTML loading

Firstly, I really appreciate this library. It's a lightweight HTML parser and it does the job pretty well, for the most part. The only downside is that there's not any broken node correction, so HTML documents can end up being a broken mess if the site is terrible (which a lot are, apparently.) I believe it could be easy to implement a simple correction for broken nodes (by ending the node right before the parent node ends), but that's not really the scope right now.

Attribute operators

I added 2 customized attribute operators to a (currently private) fork which give a little more help to finding attributes more conditionally.

AttributeSelectorMode.Contains already allows you to check if a nodes' attribute value has the value anywhere inside of it using standard HTML identifiers, but optionally using the jQuery approach to search such like div[class!=\"optional\"]. It sounds like redundancy, but it may be beneficial since I have observed errors using the HTML-like approach (where a class with an underscore, such like test_color in a div fails selector parsing), so a secondary entry point could help in those situations. The operator ! may be counter-intuitive, but I don't quite know another operator since the other one is used in the second suggestion.

AttributeSelectorMode.ContainsAny a new addition allows you to input a HTML-style list of values to search for in the attribute value. div[class?=\"optional conditional\"] will return any div node with the attribute value that contains either optional or conditional. I see one helpful use in the case that HTML returned by a php script could return transformative HTML that may contain conditional in the needed div objects instead of optional. Currently, I have it set up to not use AttributeSelector.Value, but a string array AttributeSelector.Values.

Filtering empty text nodes (as an option)

I have been poking around filtering of the HTML document on parsing, such as removing nodes that are empty and only returning the head, body, or div nodes. It seems a bit unnecessary, but it does cut down on search time by only adding required nodes. I suggest adding filtering for empty text nodes (any node that is 0 in length or contains only whitespace characters). It's a minor thing that could be helpful, and a simple string.IsNullOrWhiteSpace(text) check right before the parent node gets the new node could clean up a lot of fluff, but it does introduce a problem people may face where they may expect the whitespace. It as an option, though, would solve that one.

Asynchronous

Finally, on the subject of asynchronous, there are already some asynchronous methods available but are limited to NET Standard. I only actually added asynchronous support to HtmlDocument, and simply used Task.Run for operations since that's a quick (and maybe dirty way) of adding asynchronous support.

Those are just a few main suggestions I have that could help improve the library based on my usage and individual needs.

HtmlNode.Find and HtmlNodes.Find?

public static IEnumerable Find(this IEnumerable nodes, string expression);
public static IEnumerable Find(this HtmlNode node, string expression);

Having issues using Find to search for a div with specific class string

Here's the div I'm trying to search for:

This is how I'm attempting to find it:

var parent = document.Find("div.l-article__story l-main__story");

But Find is returning no results. Am I doing something wrong?

Ignore newline characters while parsing

I'm using HttpClient, which sometimes returns a document including line breaks(\r\n). This throws off the parser in multiple ways. It considers most of the newline characters as their own elements, and sometimes even manages to mess up the parsed tag. Even the included example document is parsed incorrectly.

I had to build the test program using .NET 5.0, instead of 6.0, and for my own program, I'm using .NET Framework 4.7.2
I'm also using Windows 10
Is this an issue specific to Windows (perhaps the \r character) and can it be solved, or do I just have to remove the newline characters from the input string manually?

Recommend Projects

softcircuits / htmlmonkey Goto Github PK

htmlmonkey's Introduction

SoftCircuits

Jonathan Wood

htmlmonkey's People

Contributors

Stargazers

Watchers

Forkers

htmlmonkey's Issues

[FEATURE REQUEST] More attribute operators for Find methods, filtering empty text nodes, and asynchronous HTML loading

Attribute operators

Filtering empty text nodes (as an option)

Asynchronous

HtmlNode.Find and HtmlNodes.Find?

Having issues using Find to search for a div with specific class string

Ignore newline characters while parsing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent