fb55 / htmlparser2 Goto Github PK

View Code? Open in Web Editor NEW

4.3K 51.0 368.0 5.89 MB

The fast & forgiving HTML and XML parser

Home Page: https://feedic.com/htmlparser2

License: MIT License

HTML 1.12% TypeScript 98.88%

html-parser javascript dom htmlparser2 html xml parser

htmlparser2's Introduction

htmlparser2

The fast & forgiving HTML/XML parser.

htmlparser2 is the fastest HTML parser, and takes some shortcuts to get there. If you need strict HTML spec compliance, have a look at parse5.

Installation

npm install htmlparser2

A live demo of htmlparser2 is available on AST Explorer.

Ecosystem

Name	Description
htmlparser2	Fast & forgiving HTML/XML parser
domhandler	Handler for htmlparser2 that turns documents into a DOM
domutils	Utilities for working with domhandler's DOM
css-select	CSS selector engine, compatible with domhandler's DOM
cheerio	The jQuery API for domhandler's DOM
dom-serializer	Serializer for domhandler's DOM

Usage

htmlparser2 itself provides a callback interface that allows consumption of documents with minimal allocations. For a more ergonomic experience, read Getting a DOM below.

import * as htmlparser2 from "htmlparser2";

const parser = new htmlparser2.Parser({
    onopentag(name, attributes) {
        /*
         * This fires when a new tag is opened.
         *
         * If you don't need an aggregated `attributes` object,
         * have a look at the `onopentagname` and `onattribute` events.
         */
        if (name === "script" && attributes.type === "text/javascript") {
            console.log("JS! Hooray!");
        }
    },
    ontext(text) {
        /*
         * Fires whenever a section of text was processed.
         *
         * Note that this can fire at any point within text and you might
         * have to stitch together multiple pieces.
         */
        console.log("-->", text);
    },
    onclosetag(tagname) {
        /*
         * Fires when a tag is closed.
         *
         * You can rely on this event only firing when you have received an
         * equivalent opening tag before. Closing tags without corresponding
         * opening tags will be ignored.
         */
        if (tagname === "script") {
            console.log("That's it?!");
        }
    },
});
parser.write(
    "Xyz <script type='text/javascript'>const foo = '<<bar>>';</script>",
);
parser.end();

Output (with multiple text events combined):

--> Xyz
JS! Hooray!
--> const foo = '<<bar>>';
That's it?!

This example only shows three of the possible events. Read more about the parser, its events and options in the wiki.

Usage with streams

While the Parser interface closely resembles Node.js streams, it's not a 100% match. Use the WritableStream interface to process a streaming input:

import { WritableStream } from "htmlparser2/lib/WritableStream";

const parserStream = new WritableStream({
    ontext(text) {
        console.log("Streaming:", text);
    },
});

const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));

Getting a DOM

The DomHandler produces a DOM (document object model) that can be manipulated using the DomUtils helper.

import * as htmlparser2 from "htmlparser2";

const dom = htmlparser2.parseDocument(htmlString);

The DomHandler, while still bundled with this module, was moved to its own module. Have a look at that for further information.

Parsing Feeds

htmlparser2 makes it easy to parse RSS, RDF and Atom feeds, by providing a parseFeed method:

const feed = htmlparser2.parseFeed(content, options);

Performance

After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.

At the time of writing, the latest versions of all supported parsers show the following performance characteristics on GitHub Actions (sourced from here):

htmlparser2        : 2.17215 ms/file ± 3.81587
node-html-parser   : 2.35983 ms/file ± 1.54487
html5parser        : 2.43468 ms/file ± 2.81501
neutron-html5parser: 2.61356 ms/file ± 1.70324
htmlparser2-dom    : 3.09034 ms/file ± 4.77033
html-dom-parser    : 3.56804 ms/file ± 5.15621
libxmljs           : 4.07490 ms/file ± 2.99869
htmljs-parser      : 6.15812 ms/file ± 7.52497
parse5             : 9.70406 ms/file ± 6.74872
htmlparser         : 15.0596 ms/file ± 89.0826
html-parser        : 28.6282 ms/file ± 22.6652
saxes              : 45.7921 ms/file ± 128.691
html5              : 120.844 ms/file ± 153.944

How does this module differ from node-htmlparser?

In 2011, this module started as a fork of the htmlparser module. htmlparser2 was rewritten multiple times and, while it maintains an API that's mostly compatible with htmlparser, the projects don't share any code anymore.

The parser now provides a callback interface inspired by sax.js (originally targeted at readabilitySAX). As a result, old handlers won't work anymore.

The DefaultHandler was renamed to clarify its purpose (to DomHandler). The old name is still available when requiring htmlparser2 and your code should work as expected.

The RssHandler was replaced with a getFeed function that takes a DomHandler DOM and returns a feed object. There is a parseFeed helper function that can be used to parse a feed from a string.

Security contact information

To report a security vulnerability, please use the Tidelift security contact. Tidelift will coordinate the fix and disclosure.

`htmlparser2` for enterprise

Available as part of the Tidelift Subscription.

The maintainers of htmlparser2 and thousands of other packages are working with Tidelift to deliver commercial support and maintenance for the open source dependencies you use to build your applications. Save time, reduce risk, and improve code health, while paying the maintainers of the exact dependencies you use. Learn more.

htmlparser2's People

Stargazers

Watchers

Forkers

hamtie siddmahen matthewmueller baudehlo benoitzugmeyer fasterize gotomypc unroll-me myndzi yasuno45 jugglinmike demux burl andreasmadsen abarre xingyunshisui reijovosu xcoderzach jcdarwin thebennos pinpickle callumlocke patrick-steele-idem pihizi kpdecker mereskin-zz saary amesianx jhewt zynesis caleyd heshiming edeustace tkrugg browniefed duncanbeevers minodisk devongovett javascript-forks chbrown elffikk ackar arthurgerbelot danhooper jeromew apo-j shawnhilgart alibaba-archive warcrusher haikuowuya dailymotion georgephillips jaggedsoft donnut frontapp henrybryant dustinhayes anvoz minikey showjoy html-shell hongee joncasey zhouhesheng bezoerb devlato nkzawa lonjoy raine sikuli librasama broadly mohitdeshwal derekbreden mickael-van-der-beek mvhaen jasonsanjose mail-apps eongoo jfahrenkrug shushanfx egis leonfedotov simonfan 4front sjn1978 uhoreg nivalamata wataori my-forks shuky orisomething hellocomrade iryusa boutell modulexcite neo-nie rayleesg tfg-urjc-2017 rubyrabelle

htmlparser2's Issues

missing parser event on duplicate closing tag

Hi while I was meddling with the coding, I notice there no parser handling for <> the second ">"
I find it useful to have parser which can modify the elements and output the result. I wouldn't request something, without contribute back. Therefore I created a element tree parser modifier. How do I post a copy of the script for your review ? I try drag and drop here but it gives me the message it don't support "txt" or "js" file type yet.

When 'lowerCaseTags', consider passing back lower case tags to callbacks

As the caller I assumed that the callbacks would be lowercased as the work was already done within the parser to lowercase. Otherwise, the caller is forced to lowercase tags again.

npm install htmlparser2 seems to hang

Hey,

I can run npm install [email protected], but running just npm install htmlparser2 seems to hang. Can you verify that this is working on your box?

Thanks,
Matt

Feed parser freaks out when CDATA is in description

Hello,

I've got a snippet of code that process a feed and more or less looks like:

var htmlparser = require('htmlparser2');
var request = require('request');

var handler = new htmlparser.FeedHandler( function( error, feed ){
  console.log( feed.items.length );
});

parser = new htmlparser.Parser( handler, { xmlMode: true } );

request('https://news.ycombinator.com/rss', function (error, response, body) {
  if (!error && response.statusCode == 200) {
    parser.parseComplete( body.toString() );
  }
})

The only problem is the feed.items.length, which should 30 but it does return just 2. I tried to investigate a little bit more and I realised that if from the original feed I edit the description tag with non CDATA content it does work and I can get all the 30 items (I'm loading the file locally). This is the content of the item tag:

<item>
  <title>I gave away my xbox 360 today</title>
  <link>https://plus.google.com/u/0/105363132599081141035/posts/W3ys5fKnz5t</link>
  <comments>https://news.ycombinator.com/item?id=5506571</comments>
  <description><![CDATA[<a href="https://news.ycombinator.com/item?id=5506571">Comments</a>]]></description>
</item>

Not sure about what's going on, but looks like this is related to opened/closed CDATA sections.

Thanks
Daniele

< in textarea is not parsed correctly

Parser looks for a tag even inside a text-area where there may be source.

htmlparser2.esproj is in npm registry

Seems like there's some extra stuff that ended up in the registry. Not a big problem at all, but it'd be nice to either ignore this directory or remove it from the registry in the next version.

bug after "drop the carriage return"

When parsing this file https://github.com/AndreasMadsen/article/blob/master/test/reallife/source/09198e90b6a14acfef0d4044606b8fd5801648f98763bf967f181aabaf59804d.html#L920-923 I don't get the highlighted <img ... > after 263775f.

However as you will see here the <img> (big Obama picture) does render in Chrome at least.

So I would ask you to support the \r tag anyway.

Support self-closing tags and other HTML constraints on tags

I am currently using "htmlparser.js" created by John Resig some time ago. I was hoping to switch to something actively maintained. However it seems there are a few features that are missing.

For example the parser I mentioned understands self-closing tags, and seems to scan for other elements that haven't been closed.

So for example calling HTMLtoXML(<p>Hello<p>World') will yield "<p>Hello</p><p>World</p>" while htmlparser2 (with the right handler) would give me "<p>Hello<p>World</p></p>".

Different output for example in readme

With the example in the readme and htmlparser2 v3.0.5 from npm, I get this output instead:

--> Xyz
JS! Hooray!
--> var foo = '
--> <<bar>>';
--> < /  script>
That's it?!

Non-break space

Please, replace non-breaking space to normal space:
https://github.com/fb55/htmlparser2/blob/master/lib/CollectingHandler.js#L4

It breaks htmlparser2 when it is browserified and runs in Chrome.

Running it in a browser

There are some benefits in being able to run the parser in a browser even though browsers do html parsing themselves.
Exemples of this might be parsing user input, loading and parsing html via XHR without incurring the cost of loading images and executing Javascript etc... I have been able to tweak the source so it runs in the browser again but it would be nice if this was supported from the get go.

Don't explode on tag mismatches

tl;dr don't do this

Okay first off, let me just say that htmlparser2 is the bomb. I throw some pretty disgusting HTML at it and it chews through it like a boss.

Except for one problem. Sometimes I throw REALLY ugly HTML at it. Specifically, imagine someone took a reasonably valid html page and threw an opening comment ("<!--") somewhere around 90% down the source, and then never closed it. Overwriting otherwise valid HTML content. Sure, the page wouldn't render properly (mainly the footer content just gets lost), but browsers still render most of the page.

htmlparser2 just breaks. Specifically, it breaks on like 39: https://github.com/FB55/node-htmlparser/blob/master/lib/DomHandler.js#L39

The parser should just call it a day and wrap up the DOM tree up to that point instead of exploding. Otherwise my whole server crashes with an exception and I don't get ANY parsed data back. (Yes, I know I could work around the crashing part.)

So, please, could we not throw an exception there and instead die gracefully, albeit prematurely? (Y'know, like browsers do?)

Thanks!

Replace feeds in tests/Documents

The current documents were taken from third parties and should be replaced with files with clear licensing terms.

Bug in script handling

I found this when crawling http://www.chicagotribune.com/news/local/suburbs/orland_park_homer_glen/community/chi-ugc-article-palos-medical-groups-dr-kanesha-bryant-prov-2-2013-05-15,0,3152151.story

They have a line containing

var str = "<script src=\'about:blank\' type=\'text/javascript\'></"+"script>"

there seams to break things.

Here is a simple testcase:

var Parser = require('htmlparser2').Parser;

var stream = new Parser({
  onopentag: function (tagname, attr) {
    console.log('open: ' + tagname);
  },

  ontext: function (text) {
    console.log('text: ' + text);
  },

  onclosetag: function (tagname) {
    console.log('close: ' + tagname);
  }
});

stream.write('<body>');
stream.write('<script type="text/javascript" language="JavaScript">');
stream.write('var str = "<script src=\'about:blank\' type=\'text/javascript\'></"+"script>"');
stream.write('document.write(str);');
stream.write('</script>');
stream.write('</head>');
stream.end();

The output is:

open: body
open: script
text: var str = "
text: <script src='about:blank' type='text/javascript'>
close: script
text: "
text: document.write(str);
close: body

but it should be:

open: body
open: script
text: var str = "<script src='about:blank' type='text/javascript'></"+"script>"
text: document.write(str);
close: script
close: body

Feature: Render the parsed DOM object back to HTML

I'm not sure what your plans are for the project, but I think going the other way (ie. DOM Object --> HTML) is really useful.

This is the renderer I've been using, I could help integrate it into the project if you'd like:

https://github.com/MatthewMueller/cheerio/blob/master/src/renderer.coffee

xhtml self-closing tags cause problems for subsequent tags

Since 3.2.4, xhtml self-closing tags are doing weird things (unless they happen to be html "void" elements). Sometimes they will engulf the next tag that follows them, and sometimes they will remove the tag that follows them altogether.

<script> tagname handling

The <script> tagname is not case insensitive. A tag <SCRIPT> will see anything starting with a < as a new tag.

Example: https://gist.github.com/3899198

html parser doesn't handle cdata

<![CDATA[
This should be CDATA...
]]>

results in

[ { data: '[CDATA[\nThis should be CDATA...\n]]',
    type: 'comment' } ]

when it should result in

[ { data: '\nThis should be CDATA...\n',
    type: 'cdata' } ]

using version 3.1.5

Strict mode?

For XML mainly.

DomUtils.getElements() \w tag_contains does not perform as expected

First, the _contains is misleading as it does not check wether the tag contains the value, rather it checks for an exact match. So if you try something like this:

var domUtils = require("htmlparser2").DomUtils;
domUtils.getElements({ tag_contains: "cookie" }, dom, true);

you won't get anything if the tag you wanted contains cookies.

Second, when you do enter the exact text, you get the data node rather than the node that the data node belongs to. For example:

domUtils.getElements({ tag_contains: "cookies" }, dom, true);

Would return

[ { data: 'cookies', type: 'text' } ]

instead of:

[ { type: 'tag', name: 'p', children: [ { data: 'cookies', type: 'text' } ] } ]

svg parsing

referencing #75 as some basic svg shapes were already added as inline, but the list is rather incomplete. Inline svg is probably going to be a lot more common with increasing browser support. At least polyline and polygon are missing, but there might be more. As @fb55 mentioned there are several other issues with changing the inline list, perhaps we can discuss this issue here and come to a more complete fix.

ParentNode needed ...

Hi, i recently began to evaluate your htmlparser alternative for my project and i need to access a "parent" Element from a current Element.

In your DomUtils file you are using "elem.parent" within your "removeElement" function but i can not find anything else for a parent element. Is it a legacy of pollution or is it planned to integrate?

Greets and thanks in advanced for your feedback,

Chris

Feature request: Ability to handle self-closing tags and CDATA in non-XML mode

For my use case I needed support for a hybrid parser that would allow for XML constructs to be recognized in HTML code. After looking at the source code I felt that adding more options could provide more fine-grained control over parsing without adding too much complexity to the code. Please review my Pull Request which added two new options:

recognizeSelfClosing: If set to true then self-closing tags will result in the tag being closed even if xmlMode is not set to true
recognizeCDATA: If set to true then CDATA text will result in the ontext event being fired even if xmlMode is not set to true

In theory, xmlMode could be used as a way to control the more fine-grained options, but I wanted to minimize code change.

Make this project findable via Google

It seems you are solving a problem I have. So far I went with John Resig's abandoned HTML parser, which is the only close result on Google when searching for "Javascipt htmlparser". By chance I went for "node htmlparser" instead, found the predecessor of this project, and by chance again looked at the network graph to see that you are the only line with constant recent commits.

Do make this project more wide known. As a suggestion, maybe add "Javascript" to your description (maybe "JS" isn't really doing it).

Handle special case with JavaScript strings

var htmlparser = require('htmlparser2');
var parser = new htmlparser.Parser({
  onopentag: function(name, attribs) {
    console.log('open tag: ' + name);
  },
  ontext: function(text) {
    console.log('text: ' + text);
  },
  onclosetag: function(name) {
    console.log('close tag: ' + name);
  }
});
parser.write("Xyz <script type='text/javascript'>var foo = '</script><<bar>>';< /  script>");
parser.end();

results in:

text: Xyz
open tag: script
text: var foo = '
close tag: script
open tag: <bar
text: >';
close tag: <bar

xmlMode Incompleteness/Inconsistencies

Certain tags, such as link and meta are treated like HTML tags when the parser is set to xmlMode and they shouldn't be. For example:

var htmlparser2 = require("./lib/index"),
    DomUtils = htmlparser2.DomUtils;

var handler = new htmlparser2.DomHandler({xmlMode: true}),
    parser = new htmlparser2.Parser(handler);

parser.parseComplete('<link>foo</link>');

console.log(handler.dom);
console.log(DomUtils.getOuterHTML(handler.dom[0]));
console.log(DomUtils.getInnerHTML(handler.dom[0]));

returns:

 [ { type: 'tag', name: 'link' }, { data: 'foo', type: 'text' } ]
<link></link>

rather than what I expected:

[ { type: 'tag', name: 'link' , children: [ { data: 'foo', type: 'text' } ] ]
<link>foo</link>
foo

Incorrect parsing of inline tags when capitalized

<p>foo</p>
<hr>
<p>bar</p>

correctly produces:

[ { type: 'tag',
    name: 'p',
    attribs: {},
    children: [ { data: 'foo', type: 'text' } ] },
  { data: '\n', type: 'text' },
  { type: 'tag',
    name: 'hr',
    attribs: {},
    children: [] },
  { data: '\n', type: 'text' },
  { type: 'tag',
    name: 'p',
    attribs: {},
    children: [ { data: 'bar', type: 'text' } ] } ]

while

<p>foo</p>
<HR>
<p>bar</p>

incorrectly results in:

[ { type: 'tag',
    name: 'p',
    attribs: {},
    children: [ { data: 'foo', type: 'text' } ] },
  { data: '\n', type: 'text' },
  { type: 'tag',
    name: 'HR',
    attribs: {},
    children: 
     [ { data: '\n', type: 'text' },
       { type: 'tag',
         name: 'p',
         attribs: {},
         children: [ { data: 'bar', type: 'text' } ] } ] } ]

Since HTML tag names are entirely case insensitive I think it would be better to lower case them so that <hr> and <HR> would both result in identical parsed output?

Use tags for version releases

Bower relies on tags and points at this repository. Can you please add tags for 3.x releases?

Browserify bundle is over 300 KB

I realize this may not be a goal of the project, but the file size is probably too big for the browser.

Maybe there's a large require somewhere that could be swapped out easily.

Add ability to treat any tag as a "special" tag

Currently you can make <script> and <code> tags completely ignore their contents. The ability to do this with any tag would be really nice. Looking through the tokenizer, this doesn't look possible without doing a lot of modification to the tokenizer every time you wanted to add a new tag to ignore.
I managed to create a hack-ish way of doing this with a parser using parser._token._index and _sectionStart but I just know I'll run into issues with auto closing tags at some point. I may try to implement this myself but it would probably be better if someone (willing to) with an in depth knowledge of the tokenizer could attempt something like this.

Only finds first attribute when there is no whitespace between attributes

When the attributes are written without whitespace between them, only the first attribute is found.

Example:

<div class="first attribute"title="second attribute"></div>

Returns:

{
   class: "first attribute"
}

How to get the start index of the current tag ?

First of all, you did a remarquable work on this project. We use it in production in our server at my company Fasterize.

Actually, we used a fork of your parser (v2.3.1) (https://github.com/fasterize/node-htmlparser/tree/position). In this fork, we added a patch to know the start index and end index of the current tag. It's very useful in our software to replace a tag.

I would like to upgrade to the latest version of htmlparser2 but for that I need to rebase this patch. Can you let me know where is the best way to get those indexes? Is it in the tokenizer ?

Are you interested in putting this patch in the master ? I would like to avoid to maintain a fork only for that.

Don't skip > at the beginning of an input

cheeriojs/cheerio#103

new Parser({ontext:console.log}).parseComplete(">a>") // -> "a>"

Cannot distinguish tags closed with /> from tags closed with </tag>

I write a lot of filters for HTML, so htmlparser2 is great for parsing it. I'd like to do simple passthrough without being opinionated about anything I'm not specifically interested in changing. However this is difficult because I can't tell whether the original markup was using the self-closing style or not based on the arguments to onclosetag. This unnecessarily complicates the write side of my filter and requires my code to be aware of which tags are safe and traditional to self-close and which are not.

One solution would be a second argument to onclosetag indicating whether the self-closing style was used.

Thanks!

Alternate handling of out of order tag closing

Ran into a problematic website with mismatched open and close tags.

It appears the following code in Parser.prototype.onclosetag tries to make better sense out of mismatched close tags by popping the open tags off the stack until reaching the tag being closed:

    var pos = this._stack.lastIndexOf(name);
    if(pos !== -1){
        if(this._cbs.onclosetag){
            pos = this._stack.length - pos;
            while(pos--) this._cbs.onclosetag(this._stack.pop());
        }
        else this._stack.length = pos;
    ...

Sending the re-ordered tags to the browser then caused the rendering to look awful.

The question would be, is it reasonable to just attempt to reconstruct the original flawed order this by pulling out the tag?

  var pos = this._stack.lastIndexOf(name);
  if(pos !== -1){
      this._stack.splice(pos,1);
      if (this._cbs.onclosetag) {
          this._cbs.onclosetag(name);
      }

With this modification, ff, chrome, ie all now rendered properly with this "flawed" html. Being unaware of the potential impact to other code and scenarios that might expect the current tag closing behavior, I put this up for discussion.

"onattribvalue" not implemented

The method "onattribvalue" is referenced in "Tokenizer.js" but not implemented on "Parser.js" as expected.

This causes the following error:

TypeError: Object #<Parser> has no method 'onattribvalue' at Tokenizer._handleTrailingData (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Tokenizer.js:827:13) at Tokenizer.end (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Tokenizer.js:807:8) at Parser.end (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Parser.js:297:18)

Scan more than the head of the tag stack for tags closing other tags

As described here, <p><a>a<p>b is currently handled as <p><a>a<p>b</p></a></p>.

"/" considered an attribute for self-enclosing tags (ex. <br />)

Parsing <br /> results in:

{
  type: 'tag',
  name: 'br',
  attribs: { '/': '/' },
  children: []
}

Conditional comments aren't parsed correctly

Hi there,

This is an issue that came up in: cheeriojs/cheerio#13

Here's the problematic syntax:

<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->

Basically the parser messes up on the <!--[if lt IE7]>...

So we get:

{
  raw: '[if lt IE 7]>',
  data: '[if lt IE 7]>',
  type: 'comment',
  ...
}

It doesn't actually break which is good but its an issue in every page that includes HTML5 Boilerplate.

Let me know if you need some additional info with this issue.

Thanks!
Matt

Invalid DOM when having <param> tags

when this is parsed doc and param are stored as siblings and doc is no child of param

<param><doc>doc</doc></param>

but when param is renamed to parameter doc is a child of parameter

<parameter><doc>doc</doc></parameter>

XML parsing - CDATA use results in broken tree

Most information regarding the problem is available here: cheeriojs/cheerio#131 (comment)

XML to reproduce issue is available here: https://gist.github.com/4248909

As per this comment, you can see the parsed tree is butchered somehow pretty bad: cheeriojs/cheerio#131 (comment)

Removing the CDATA with '<root>' + page.substr(55).replace(/<!\[CDATA\[([^\]]+)]\]>/ig, "$1") + '</root>' results in a working parsing tree as shown here: cheeriojs/cheerio#131 (comment)
Not an ideal solution though.

Closer to html specs ...

Hi, its me again. I just wonder if you plan to get closer to the html specs like using "attributes" instead of "attribs", "nodeValue" instead of "value" or "parentNode" instead of "parent" etc. as object properties. If you are interested, i already made several working changes to my local code and like to contribute to your project.

Greets,

Chris

Translate dom into string method

It will be cool, if DomHandler provides method to translate dom structure into string (reverse operation of parse).

var htmlparser = require('htmlparser2');
var handler = new htmlparser.DefaultHandler();
var parser = new htmlparser.Parser(handler);

var html = 'some html here';
parser.parseComplete(html);

// should be true
console.log(handler.getHtml() == html);

It could be done with follow code for now:

html = htmlparser.DomUtils.getInnerHTML({ children: handler.dom })
// or 
html = handler.dom.map(htmlparser.DomUtils.getOuterHTML, htmlparser.DomUtils).join('')

But to find this solution you need going deep into source code.
So translate method would be useful to turn dom back into html.

[disabled optimization for Tokenizer.write, reason: optimized too many times]

In an attempt to optimize one of my modules, I discovered Tokenizer.write isn't optimized by v8.

by using the --trace_deopt I can see that v8 optimize it but dude to bailouts #488, #13, #166, #396, #388, #287, #348, #154, #14, #280 it gets deoptimized until v8 gives up.

I will spend some time on this, but I just wanted to know if you have seen this before?

To debug it your self checkout AndreasMadsen/article@fafe3b4 and run node --expose-gc --trace_deopt tools/benchmark.js

Migrating from 1.x to 2.x

Hi there,

I've been looking to pull in your changes but it seems like quite a bit has changed from 1.x to 2.x. Could you give a basic summary? Offhand, I've noticed:

Different methods to parse the html
raw attribute is gone
tags no longer have data or raw attributes

What else has changed?

Thanks!
Matt

Describe how to use it in the browser

I am trying to package this library into one single distributable file that exposes the htmlparser name globally under the window object.

It seems that browserify doesn't cater to that problem but requires the main program to use the require() construct, too. I don't believe that this is an option for me right now.

The best solution I've found so far is

$ browserify -r ./lib/index > htmlparser.js

and then in the browser do

> htmlparser = require('./lib/index');

The next best thing I can think about is wrapping each file into a function (module style), "appending" its export to an overall module variable, which again sits inside a closure which is than returned and assigned to the global variable `htmlparser``.

Am I missing the point of browserify and the like?

Migrating from 1.x to 2.x

Hi there,

I've been looking to pull in your changes but it seems like quite a bit has changed from 1.x to 2.x. Could you give a basic summary of what's changed? Offhand, I've noticed:

Different methods to parse the html
raw attribute is gone
tags no longer have data or raw attributes

What else has changed?

Thanks!
Matt

Brackets break attributes

Just a note for anyone with a lot of time:

<div id=">"> foo </div>

Expected behavior: Return a div with an id of >, followed by the string foo.
Result: A div with an id of ", followed by the string "> foo.

The problem is that there isn't a possibility for a look-ahead. Besides, the markup is clearly broken. The current result should be acceptable in most cases.

Edit: Apparently, that bug is well-known.

string within script tag which look like comment breaks text node into 3

the following code

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    ontext: function(text){
        console.log(text);
    },
});
parser.write('<script>var x = "<!--123-->";</script>');
parser.done();

outputs

var x = "
";

jslint warns about the variables with `_`

When htmlparser2 is used in applications and if the variables with _ in them are used, jslint throws errors like this Unexpected dangling '_' in '_attribname'. Can the variable names be changed?

fb55 / htmlparser2 Goto Github PK

htmlparser2's Introduction

htmlparser2

Installation

Ecosystem

Usage

Usage with streams

Getting a DOM

Parsing Feeds

Performance

How does this module differ from node-htmlparser?

Security contact information

htmlparser2 for enterprise

htmlparser2's People

Stargazers

Watchers

Forkers

htmlparser2's Issues

Recommend Projects

Recommend Topics

Recommend Org

`htmlparser2` for enterprise