inikulin / parse5 Goto Github PK

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.

License: MIT License

JavaScript 1.34% Shell 0.01% TypeScript 98.65%

html-parsing html html5 serialization serializer parser whatwg

parse5's Introduction

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.

parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular, Lit, Cheerio, rehype and many more.

List of parse5 toolset packages

Online playground

Changelog

parse5's People

Contributors

Stargazers

Watchers

Forkers

trevorlinton rakesh-mohanta volkovasystems simontang kevnz cdoru darobin neverland ethan0417 mightyiam reterius85 leonfedotov genify mnjstwins alanclarke kublaj danmv garlicnation ksheedlo gagern sakagg alexandermoskovkin sylvainpolletvillard tom-nguyen rreverser c2ye mchorfa cbforks yyx990803 splade zhangxhbeta ralfstx pixelami robwormald itszb jtauthor nawalgupta zen-li mgechev patrick-steele-idem yorci sangloo jaylone chadkillingsworth tiltandco teleaziz sir-valentin pentode pik damianof kodefox patricknausha webdesus zeng-ge htmlparseerrorwg code-collections micomiq enterstudio dzearing avesha-2016 yogeshkad aorz adrienboulle atomer geppy sihorton piperchester hansl alexxnica kryndex bary822 fictitious zirro rchaser53 madlordory krassx fs-c ivawzh jmpergar dkoleary88 bathos mlynch s524797336 wgarrido pukeqi a10nik marionebl mollstam d474designs fleksin yanghuabei spacegiant jordanovski samuelli fxss5201 gund mialur akhileshbhople carreraphp meisl

parse5's Issues

Add CHANGELOG

stop parse5.SimpleApiParser

Is there a way to have a stopParse api_

In my usecase I am parsing <meta name="robots" content="noindex, nofollow"> and would like to stop the SimpleApiParser from continuing.

parsing fails for scripts w/ html strings

<!DOCTYPE html>
<html>
  <head></head><body><script>
      var a = {
        text: '<script src="asdf"></script>'
      };
  </script></body>
</html>

var parse5 = require('parse5');
var fs = require('fs');

var html = fs.readFileSync('index.html', 'utf8');

var parser = new parse5.Parser();

var document = parser.parse(html);
var html = document.childNodes[1];
var body = html.childNodes[html.childNodes.length - 1];
var script = body.childNodes[0];
var text = script.childNodes[0];
console.log(text.value);
// => '\n      var a = {\n        text: \'<script src="asdf">'

it's parsing the </script> inside the <script> as HTML, messing up the document =/

Provide location info for the attributes

The tricky thing to do. However, any PR is welcome.

Support: Integration in WYMeditor

Hi,

Thanks you for excellent software.

I'm considering replacing WYMeditor's own parser with parse5.

Firstly, I would appreciate your opinion this idea, in general.

The first question that I'm asking is "how well does parse5 work WYMeditor's supported browsers?" Our supported browser are, kind of, IE7 through 11, modern WebKit, Gecko and Blink browsers.

In practice, tests are performed on latest Chromium, latest Firefox and IE7 through 11.

I was hoping that there would be an in-browser testing framework but I only see the command line testing that uses nodeunit.

parse5 is supposed to be run in the browser, right? So how come there is in-browser unit testing? Or is there and I'm missing it?

Thanks

Release 1.0 and commit to semver

As part of making parse5 the default parser for jsdom, we would prefer having a stable semantically-versioned dependency. Right now parse5 is pre-1.0, which means

Major version zero (0.y.z) is for initial development. Anything may change at any time. The public API should not be considered stable.

To be considerate to our users, this means we would have to pin our parse5 version at an exact version, and manually upgrade every time. This is OK, but it would be preferable to release a 1.0 and then commit to semantic versioning, and then we could depend on 1.x and get any new features or bugfixes automatically, without fear that the API will break.

Rename TreeSerializer to Serializer

htmlparser2 adapter uses no html-encoding in text nodes

Compare this test script:

var parse5 = require('parse5');

var rawHtml = '&amp;lt;b&amp;gt;World&amp;lt;/b&amp;gt;';

var parser = new parse5.Parser(parse5.TreeAdapters.htmlparser2);
var dom = parser.parseFragment(rawHtml);

console.log(dom.children[0].data);

var htmlparser2 = require('htmlparser2');
var handler = new htmlparser2.DefaultHandler();
var parserInstance = new htmlparser2.Parser(handler, {
  xmlMode: false,
  lowerCaseTags: true,
  lowerCaseAttributeNames: true
});

parserInstance.includeLocation = false;
parserInstance.parseComplete(rawHtml);

console.log(handler.dom[0].data);

which produces this ouput:

&lt;b&gt;World&lt;/b&gt;
&amp;lt;b&amp;gt;World&amp;lt;/b&amp;gt;

Seems like text nodes already contain decoded data in parse5 and it's used as-is in the tree adapter?

Provide options to disable HTML entities processing in Parser/Serializer

it would be cool if parse5's serializer had an option to preserve the provided html without encoding html entities

Excessive RAM usage on startup.

I noticed parse5 uses 30mb of RAM by just being required. I managed to pin the usage and it seems its by the file lib/tokenization/named_entity_trie.js.

Switch to the latest version of the html5lib test suite

Apply latest spec changes by the way.

Usage in browser

Hi!
Really good parser!

Can I use it in browser?

Switch to minimal reporter for tests

With UAEmbeddableParser tests output is too huge, even Travis can't handle it. Switch to minimal reporter when caolan/nodeunit#278 will be merged.

parse()/parseFragment() inconsistency

Hi, If I use the htmlparser2 tree adapter and try to parse the following string "<head><title>The Title</title></head><body>Hello world</body>" it automatically adds an <html> extra tags which gives the following "<html><head><title>The Title</title></head><body>Hello world</body></html>".

And if use parseFragment() it strip out some tags to give "<title>The Title</title>Hello world".

Is there a way or an option to parse an "xml based" template without any auto adds or strips ?

Tree adapter for DOM

It would be nice if we could have an implementation-agnostic tree adaptor for DOM.
That way, one could use parse5 to load the document, then use all the DOM manipulating tools out there to tweak the data, and finally serialize things again using parse5. Or just do one of parsing and serialization using parse5, the other with something different.

<isindex> tags exhibit weird behaviour

In the process of integrating parse5 into jsdom I noticed this weird test case:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=utf-8">
<TITLE>NIST DOM HTML Test - ISINDEX</TITLE>
</HEAD>
<BODY onload="parent.loadComplete()">
<FORM ID="form1" ACTION="./files/getData.pl" METHOD="post">
<ISINDEX PROMPT="New Employee: ">
</FORM>
<ISINDEX PROMPT="Old Employee: ">
</BODY>
</HTML>

results in a (serialized) dom of:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; CHARSET=utf-8">
    <title>NIST DOM HTML Test - ISINDEX</title>
  </head>
  <body onload="parent.loadComplete()">
    <form id="form1" action="./files/getData.pl" method="post"></form>
    <form>
      <hr>
      <label>Old Employee: <input name="isindex"></label>
      <hr>
    </form>
  </body>
</html>

For comparsion, Chrome produces:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>
<meta http-equiv="Content-Type" content="text/html; CHARSET=utf-8">
<title>NIST DOM HTML Test - ISINDEX</title>
</head>
<body onload="parent.loadComplete()">
<form id="form1" action="./files/getData.pl" method="post">
<isindex prompt="New Employee: ">

<isindex prompt="Old Employee: ">



</isindex></isindex></form></body></html>

I have to admit, the source there is really deprecated. Is replacing the tag while parsing in the spec or should that be handled by the browser layer? It also just drops one of the

This is a searchable index. Enter search keywords:

tags...

Fragment parsing messes main document in jsdom if no intermidiate representation is used

I figured out how to fix this on my side with minimal effort.

Closing tags

I've received a DOM of HTML code!

How can I understand from it, whether the tag has the closing tag for itself or not?

What is the difference between default and htmlparser2 tree adapters?

We are hoping to make parse5 the default in jsdom (jsdom/jsdom#818). I am curious whether we should just use the htmlparser2 adapter, and change none of our code, or if we should adopt the default parse5 format? Are there benefits?

Compatible query selector library

Hey,

Giving Parse5 a try. I haven't yet been able to find a compatible CSS/querySelector-style query library to actually find objects in the returned DOM tree. Which have you found?

Thanks!

well-known self-closing tags not parsed as such

var parse5 = new (require("parse5")).SimpleApiParser({
    startTag: function(tagName,attrs,selfClosing){ console.log(tagName,selfClosing) },
    endTag: function(tagName){ console.log(tagName) }
});

parse5.parse('<img src="asdf"/>');  //-> img true
parse5.parse('<img src="asdf">');   //-> img false

<img> is self closing, with or without the XML closer.

More forgiving tag names

<{{tag}}>asdf</{{tag}}>

is currently parsed as text.

I realize that this is not standard HTML5, but it'd be nice to benefit from many of this lib's features when parsing HTML variants such as Handlebars templates.

Prepending string parser

For example, I have a html like this.

<script>
document.write('This is <h2>')
</script>
heading</h2>

I want to write an html parser with ability to execute script, something like zombiejs (for some reasons, I am not satisfying with it). Anyway, I will take care of the dom and sandbox part. But I need an interface to append string to parser due to the use of document.write().

Is it possible to do this with parse5?

Fix location info handling for the implicitly generated <html> and <body>

And add note to the README about location info for the implicitly generated tags.

Seems bug in parsing

These two codes are parsed in the same way!

<span class="copyright link">Copyright content</span>

<span class="copyright link">Copyright content</spane>

And after serialization of DOMs of these two examples I've received the same result:

<span class="copyright link">Copyright content</span>

Add regression benchmarking

Seems like we've slowed down after adoption agency refactoring. We need a regression benchmarks and determine the cause of slow down after adoption agency refactoring (is this due to context switch or non-instance calls)

html validator

It goes beyond the scope of parse5 a little, but because it's a compliant parser, if it could also provide the errors in the markup, it could be used as a validator.

mini dom implementation

are there any mini dom implementations that use dom5? in particular, i'd like to be able to replace a node with another parse5-parsed fragment (or something like that). i'm sure i can just go replacing nodes, but i'd rather use a well-tested lib. don't really care if its spec compliant or anything.

Using document.elementFromPoint()

I am trying to use the document.elementFromPoint() function which returns the top visible element by x and y coordinates - see http://dev.w3.org/csswg/cssom-view/#dom-document-elementfrompointx-y
However,

 var Parser = require('parse5').Parser;
 var parser = new Parser();
 var doc = parser.parse(ff_dom);
 doc.elementFromPoint(1,1);

gives me a "TypeError: undefined is not a function" so it seems that the document object that is returned after parsing does not support this function although (according to this https://developer.mozilla.org/en-US/docs/Web/API/Document/elementFromPoint) Chrome supports it since version 4.0 and Firefox since version 3.

Is it planned to extend the functionality of this library by elementFromPoint? Or am I even doing something wrong?

Form closing tag is moved to the end of an enclosing template

var Parser = require('parse5').Parser;
var Serializer = require('parse5').Serializer;

var parser = new Parser();
var serializer = new Serializer();

var input = '<template><form><input name="q"></form><div>second</div></template>';
var fragment = parser.parseFragment(input);

console.log(serializer.serialize(fragment));

Expected:

<template><form><input name="q"></form><div>second</div></template>

Actual:

<template><form><input name="q"><div>second</div></form></template>

stringifier is encoding inline <script> tags

var parse5 = require('parse5')
var serializer = new parse5.Serializer
serializer.serialize({
  "nodeName": "#document-fragment",
  "quirksMode": false,
  "childNodes": [
    {
      "nodeName": "script",
      "tagName": "script",
      "attrs": [],
      "namespaceURI": "http://www.w3.org/1999/xhtml",
      "childNodes": [
        {
          "nodeName": "#text",
          "value": "function test(t,n){return t&&n}"
        }
      ]
    }
  ]
})

returns:

<script>function test(t,n){return t&amp;&amp;n}</script>

how do i avoid encoding the contents of the script tag?

get line/col number of element?

Would be useful

parse5 and streaming

Is there a way to pipe a file into parse5 (this feature is available in htmlparser2).

similarly, is there a way to pause / resume the sax parser ? I believe that the getNextToken method could be used to decide when the parsing should pause/resume.

I have been using the html-tokenize streaming parser lately and its suite (html-select, trumpet) but reaching html5 conformity on html-tokenize is still a long way to go so I am trying to see how parse5 could fit in and be used with html-select

Problem running in Nashorn (JVM)

I'm trying to run Parse5 in the JVM using Nashorn and AvatarJS. Don't ask....

Due to a bug in Nashorn, Parse5 fails because there are too many objects in named_entity_trie.js

Is there a way I can customize this file, maybe reduce its size to support a limited set of characters?

Conditional comments

It would be great if your SAX parser could give the information about the conditional comments.

HTML5 Legacy Doctype Misparsed with htmlparser2 tree adapter

Hey,

I wanted to report an issue with the HTML5 legacy doctype parser.
Specifically:

<!DOCTYPE html SYSTEM "about:legacy-compat">

Responds with:

'Document declares a non-HTML5 DOCTYPE'

X-Ref twbs/bootlint#251

Incorrect AST

Here is a piece of code whose AST is not correct

<b>
  <i>
</b>

(there is a newline after the last tag)
The resultant AST is
root -> 1 child (html) -> 2 children (head,body)
body -> 2 children (b, i)
b -> 1 child (#text, i)
i -> 1 child (#text)

The key problem is that i is included twice (inside b and after b)

Provide DOM-element source code location information

It would be great if parse5 supported the startIndex property of htmlparser2's DOM.
See https://github.com/fb55/domhandler#option-withstartindices for a description.
It's very useful for e.g. lint tools that want to point out the source code locations of DOM elements.

Provide the way to handle document writes

@domenic @Sebmaster
Hi guys, I just found this conversation on twitter. And since I've deleted my twitter account and I didn't find related issue in jsdom issues, I decided to open conversation here. I don't think it's a good idea to get into Prerprocessor internals and in general it will not work modifying just this.html. I think the better way will be implementing this API on my side: provide some script invocation mechanism during parsing (as it's done in real user-agents), so script can accept unfinished tree. E.g.:

//parser.parse(html, scriptHandler)
var result = parser.parse(html, function(scriptContent, document, writeHtmlFunc, async)  {
       //writeHtmlFunc can be used to inject html, passed to document.write, into parser
       //async flag from async attr can be used to postpone script execution after parsing 
});

Please, let me know your opinion.

Bug in documentation

Here is a bug in the example!

Instead of

parser.parse('<body>Yo!</body>');

this

parser5.parse('<body>Yo!</body>');

is needed.

Parsing is incorrect when having <meta> tagName

I have used the parse5.SimpleApiParser but when having a that is self closing the parser:

Does not pass the selfClosing parameter to the startTag callback
Does not call the endTag callback

<meta charset="utf-8">

Support frameset

It's part of HTML5, but apparently not supported?

NPM version is out of date

Hey, could you update the npm version and publish it? The tree adapters and serializer are really needed.

Closing-self-closing tags

Can this situation be handled in SAX-parser?

</bla/>

It is a self-closing and closing tag.)

Or more possible variant:

<bla//>

For example, I want to check the validity...

Support for Web Browser?

I tell me if parse5 could be executedd too inside web browser?

tern https://github.com/marijnh/tern provides this cool feature. You can execute the JS inference engine

inside node
or web browser.

To fix problem with require, it use this code ::

(function(root, mod) {
  if (typeof exports == "object" && typeof module == "object") // CommonJS
    return mod(exports, require("./infer"), require("./signal"),
               require("acorn/acorn"), require("acorn/util/walk"));
  if (typeof define == "function" && define.amd) // AMD
    return define(["exports", "./infer", "./signal", "acorn/acorn", "acorn/util/walk"], mod);
  mod(root.tern || (root.tern = {}), tern, tern.signal, acorn, acorn.walk); // Plain browser env
})(this, function(exports, infer, signal, acorn, walk) {
...

See https://github.com/marijnh/tern/blob/master/lib/tern.js

Duplicate tags' attributes

Hi!

Here's the HTML code:

<title param="1" param="2"></title>

It will be parsed as:

{ nodeName: 'title',
  tagName: 'title',
  attrs: [ { name: 'param', value: '1' } ],
  namespaceURI: 'http://www.w3.org/1999/xhtml',
  childNodes: [],
  parentNode: 
   { nodeName: '#document-fragment',
     quirksMode: false,
     childNodes: [ [Circular] ] } }

As you can see, in output DOM there is no any info about the second duplicate attribute param="2"!

Yes, I know that there is not a real and global problem! But in my task this handling is really vital!
Can you handle this situation, I would like to receive this in output DOM:

{ nodeName: 'title',
  tagName: 'title',
  attrs: [ { name: 'param', value: '1' },
           { name: 'param', value: '2' } ],
  namespaceURI: 'http://www.w3.org/1999/xhtml',
  childNodes: [],
  parentNode: 
   { nodeName: '#document-fragment',
     quirksMode: false,
     childNodes: [ [Circular] ] } }

Avoid adding empty elements

Hey I was wondering if there is a way to NOT add elements automatically to the AST which are not there in HTML. I'll explain using an example
HTML:
<html>
<head>
</head>
</html>

Now the html doesn't have any body tag, so I don't want that in the parse tree, whereas right now html has 2 childNodes - head and body.

getTagName in adapters misnamed?

In jsdom 3.0.0, we removed tagName from Node and moved it to Element.

This seems to have cause jsdom/jsdom#1004. I traced it to the adapter code:

exports.getTagName = function (element) {
  return element.tagName.toLowerCase();
};

It seems this gets called with a text node, which no longer has a tagName.

If I switch it to nodeName, it seems to work.

Should the adapter hook be renamed to getNodeName?

Handle boolean attributes?

I am building an html2xhtml converter on top of parse5 (https://github.com/cburgmer/http2xhtml.js/blob/master/lib/converter.js).

It seems I need to handle boolean attributes (e.g. <input type="checkbox" checked>) myself.

parse5 will return such an attribute as {name: "checked", value: ""}. This does not allow me to distinguish from actual attributes with empty string values.

I could not find much more information on those kind of attributes, so I am posing this as a question: should the parser know about those attributes and return something accordingly?

Add <template> support.

At the time when parse5 was started templates were in draft. Since now they supported by all browsers except IE (surprise!) it's time to land them to parse5.