Code Monkey home page Code Monkey logo

htmlmonkey's Introduction

SoftCircuits

SoftCircuits develops software for the Windows platform. We have expertise building website applications, desktop applications and .NET libraries.

As can be seen from our repositories on GitHub, we have special set of skills processing and parsing files. Our repositories include an HTML parser (HtmlMonkey), a CSV parser (CsvParser), a fixed-width parser (SoftCircuits.FixedWidthParser), an INI file parser (IniFileParser) and an interpreted language written from the ground up in C# (Silk).

Jonathan Wood

SoftCircuits is headed up by Jonathan Wood, who is a software developer/hiker/dog and animal lover living out of Salt Lake City, Utah. He is currently available for additional work/consulting!

htmlmonkey's People

Contributors

murrty avatar softcircuits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

htmlmonkey's Issues

[FEATURE REQUEST] More attribute operators for Find methods, filtering empty text nodes, and asynchronous HTML loading

Firstly, I really appreciate this library. It's a lightweight HTML parser and it does the job pretty well, for the most part. The only downside is that there's not any broken node correction, so HTML documents can end up being a broken mess if the site is terrible (which a lot are, apparently.) I believe it could be easy to implement a simple correction for broken nodes (by ending the node right before the parent node ends), but that's not really the scope right now.

Attribute operators

I added 2 customized attribute operators to a (currently private) fork which give a little more help to finding attributes more conditionally.

AttributeSelectorMode.Contains already allows you to check if a nodes' attribute value has the value anywhere inside of it using standard HTML identifiers, but optionally using the jQuery approach to search such like div[class!=\"optional\"]. It sounds like redundancy, but it may be beneficial since I have observed errors using the HTML-like approach (where a class with an underscore, such like test_color in a div fails selector parsing), so a secondary entry point could help in those situations. The operator ! may be counter-intuitive, but I don't quite know another operator since the other one is used in the second suggestion.

AttributeSelectorMode.ContainsAny a new addition allows you to input a HTML-style list of values to search for in the attribute value. div[class?=\"optional conditional\"] will return any div node with the attribute value that contains either optional or conditional. I see one helpful use in the case that HTML returned by a php script could return transformative HTML that may contain conditional in the needed div objects instead of optional. Currently, I have it set up to not use AttributeSelector.Value, but a string array AttributeSelector.Values.

Filtering empty text nodes (as an option)

I have been poking around filtering of the HTML document on parsing, such as removing nodes that are empty and only returning the head, body, or div nodes. It seems a bit unnecessary, but it does cut down on search time by only adding required nodes. I suggest adding filtering for empty text nodes (any node that is 0 in length or contains only whitespace characters). It's a minor thing that could be helpful, and a simple string.IsNullOrWhiteSpace(text) check right before the parent node gets the new node could clean up a lot of fluff, but it does introduce a problem people may face where they may expect the whitespace. It as an option, though, would solve that one.

Asynchronous

Finally, on the subject of asynchronous, there are already some asynchronous methods available but are limited to NET Standard. I only actually added asynchronous support to HtmlDocument, and simply used Task.Run for operations since that's a quick (and maybe dirty way) of adding asynchronous support.

Those are just a few main suggestions I have that could help improve the library based on my usage and individual needs.

HtmlNode.Find and HtmlNodes.Find?

public static IEnumerable Find(this IEnumerable nodes, string expression);
public static IEnumerable Find(this HtmlNode node, string expression);

Ignore newline characters while parsing

I'm using HttpClient, which sometimes returns a document including line breaks(\r\n). This throws off the parser in multiple ways. It considers most of the newline characters as their own elements, and sometimes even manages to mess up the parsed tag. Even the included example document is parsed incorrectly.
image
I had to build the test program using .NET 5.0, instead of 6.0, and for my own program, I'm using .NET Framework 4.7.2
I'm also using Windows 10
Is this an issue specific to Windows (perhaps the \r character) and can it be solved, or do I just have to remove the newline characters from the input string manually?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.