Code Monkey home page Code Monkey logo

parse5's Introduction

parse5's People

Contributors

43081j avatar bathos avatar chadkillingsworth avatar cheeaun avatar dependabot[bot] avatar dkoleary88 avatar dmartens avatar fb55 avatar gfx avatar inikulin avatar jmsjtu avatar meirionhughes avatar milahu avatar mvasilkov avatar nolanlawson avatar notslang avatar pmdartus avatar rreverser avatar sakagg avatar samouri avatar samuelli avatar sebmaster avatar squidfunk avatar stevenvachon avatar ursm avatar webdesus avatar wi1dcard avatar wooorm avatar yyx990803 avatar zirro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parse5's Issues

stop parse5.SimpleApiParser

Is there a way to have a stopParse api_

In my usecase I am parsing <meta name="robots" content="noindex, nofollow"> and would like to stop the SimpleApiParser from continuing.

parsing fails for scripts w/ html strings

<!DOCTYPE html>
<html>
  <head></head><body><script>
      var a = {
        text: '<script src="asdf"></script>'
      };
  </script></body>
</html>
var parse5 = require('parse5');
var fs = require('fs');

var html = fs.readFileSync('index.html', 'utf8');

var parser = new parse5.Parser();

var document = parser.parse(html);
var html = document.childNodes[1];
var body = html.childNodes[html.childNodes.length - 1];
var script = body.childNodes[0];
var text = script.childNodes[0];
console.log(text.value);
// => '\n      var a = {\n        text: \'<script src="asdf">'

it's parsing the </script> inside the <script> as HTML, messing up the document =/

Support: Integration in WYMeditor

Hi,

Thanks you for excellent software.

I'm considering replacing WYMeditor's own parser with parse5.

Firstly, I would appreciate your opinion this idea, in general.

The first question that I'm asking is "how well does parse5 work WYMeditor's supported browsers?" Our supported browser are, kind of, IE7 through 11, modern WebKit, Gecko and Blink browsers.

In practice, tests are performed on latest Chromium, latest Firefox and IE7 through 11.

I was hoping that there would be an in-browser testing framework but I only see the command line testing that uses nodeunit.

parse5 is supposed to be run in the browser, right? So how come there is in-browser unit testing? Or is there and I'm missing it?

Thanks

Release 1.0 and commit to semver

As part of making parse5 the default parser for jsdom, we would prefer having a stable semantically-versioned dependency. Right now parse5 is pre-1.0, which means

Major version zero (0.y.z) is for initial development. Anything may change at any time. The public API should not be considered stable.

To be considerate to our users, this means we would have to pin our parse5 version at an exact version, and manually upgrade every time. This is OK, but it would be preferable to release a 1.0 and then commit to semantic versioning, and then we could depend on 1.x and get any new features or bugfixes automatically, without fear that the API will break.

htmlparser2 adapter uses no html-encoding in text nodes

Compare this test script:

var parse5 = require('parse5');

var rawHtml = '&amp;lt;b&amp;gt;World&amp;lt;/b&amp;gt;';

var parser = new parse5.Parser(parse5.TreeAdapters.htmlparser2);
var dom = parser.parseFragment(rawHtml);

console.log(dom.children[0].data);

var htmlparser2 = require('htmlparser2');
var handler = new htmlparser2.DefaultHandler();
var parserInstance = new htmlparser2.Parser(handler, {
  xmlMode: false,
  lowerCaseTags: true,
  lowerCaseAttributeNames: true
});

parserInstance.includeLocation = false;
parserInstance.parseComplete(rawHtml);

console.log(handler.dom[0].data);

which produces this ouput:

&lt;b&gt;World&lt;/b&gt;
&amp;lt;b&amp;gt;World&amp;lt;/b&amp;gt;

Seems like text nodes already contain decoded data in parse5 and it's used as-is in the tree adapter?

Excessive RAM usage on startup.

I noticed parse5 uses 30mb of RAM by just being required. I managed to pin the usage and it seems its by the file lib/tokenization/named_entity_trie.js.

parse()/parseFragment() inconsistency

Hi, If I use the htmlparser2 tree adapter and try to parse the following string "<head><title>The Title</title></head><body>Hello world</body>" it automatically adds an <html> extra tags which gives the following "<html><head><title>The Title</title></head><body>Hello world</body></html>".

And if use parseFragment() it strip out some tags to give "<title>The Title</title>Hello world".

Is there a way or an option to parse an "xml based" template without any auto adds or strips ?

Tree adapter for DOM

It would be nice if we could have an implementation-agnostic tree adaptor for DOM.
That way, one could use parse5 to load the document, then use all the DOM manipulating tools out there to tweak the data, and finally serialize things again using parse5. Or just do one of parsing and serialization using parse5, the other with something different.

<isindex> tags exhibit weird behaviour

In the process of integrating parse5 into jsdom I noticed this weird test case:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=utf-8">
<TITLE>NIST DOM HTML Test - ISINDEX</TITLE>
</HEAD>
<BODY onload="parent.loadComplete()">
<FORM ID="form1" ACTION="./files/getData.pl" METHOD="post">
<ISINDEX PROMPT="New Employee: ">
</FORM>
<ISINDEX PROMPT="Old Employee: ">
</BODY>
</HTML>

results in a (serialized) dom of:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; CHARSET=utf-8">
    <title>NIST DOM HTML Test - ISINDEX</title>
  </head>
  <body onload="parent.loadComplete()">
    <form id="form1" action="./files/getData.pl" method="post"></form>
    <form>
      <hr>
      <label>Old Employee: <input name="isindex"></label>
      <hr>
    </form>
  </body>
</html>

For comparsion, Chrome produces:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>
<meta http-equiv="Content-Type" content="text/html; CHARSET=utf-8">
<title>NIST DOM HTML Test - ISINDEX</title>
</head>
<body onload="parent.loadComplete()">
<form id="form1" action="./files/getData.pl" method="post">
<isindex prompt="New Employee: ">

<isindex prompt="Old Employee: ">



</isindex></isindex></form></body></html>

I have to admit, the source there is really deprecated. Is replacing the tag while parsing in the spec or should that be handled by the browser layer? It also just drops one of the


This is a searchable index. Enter search keywords:
tags...

Closing tags

I've received a DOM of HTML code!

How can I understand from it, whether the tag has the closing tag for itself or not?

Compatible query selector library

Hey,

Giving Parse5 a try. I haven't yet been able to find a compatible CSS/querySelector-style query library to actually find objects in the returned DOM tree. Which have you found?

Thanks!

well-known self-closing tags not parsed as such

var parse5 = new (require("parse5")).SimpleApiParser({
    startTag: function(tagName,attrs,selfClosing){ console.log(tagName,selfClosing) },
    endTag: function(tagName){ console.log(tagName) }
});

parse5.parse('<img src="asdf"/>');  //-> img true
parse5.parse('<img src="asdf">');   //-> img false

<img> is self closing, with or without the XML closer.

More forgiving tag names

<{{tag}}>asdf</{{tag}}>

is currently parsed as text.

I realize that this is not standard HTML5, but it'd be nice to benefit from many of this lib's features when parsing HTML variants such as Handlebars templates.

Prepending string parser

For example, I have a html like this.

<script>
document.write('This is <h2>')
</script>
heading</h2>

I want to write an html parser with ability to execute script, something like zombiejs (for some reasons, I am not satisfying with it). Anyway, I will take care of the dom and sandbox part. But I need an interface to append string to parser due to the use of document.write().

Is it possible to do this with parse5?

Seems bug in parsing

These two codes are parsed in the same way!

<span class="copyright link">Copyright content</span>
<span class="copyright link">Copyright content</spane>

And after serialization of DOMs of these two examples I've received the same result:

<span class="copyright link">Copyright content</span>

Add regression benchmarking

Seems like we've slowed down after adoption agency refactoring. We need a regression benchmarks and determine the cause of slow down after adoption agency refactoring (is this due to context switch or non-instance calls)

html validator

It goes beyond the scope of parse5 a little, but because it's a compliant parser, if it could also provide the errors in the markup, it could be used as a validator.

mini dom implementation

are there any mini dom implementations that use dom5? in particular, i'd like to be able to replace a node with another parse5-parsed fragment (or something like that). i'm sure i can just go replacing nodes, but i'd rather use a well-tested lib. don't really care if its spec compliant or anything.

Using document.elementFromPoint()

I am trying to use the document.elementFromPoint() function which returns the top visible element by x and y coordinates - see http://dev.w3.org/csswg/cssom-view/#dom-document-elementfrompointx-y
However,

 var Parser = require('parse5').Parser;
 var parser = new Parser();
 var doc = parser.parse(ff_dom);
 doc.elementFromPoint(1,1);

gives me a "TypeError: undefined is not a function" so it seems that the document object that is returned after parsing does not support this function although (according to this https://developer.mozilla.org/en-US/docs/Web/API/Document/elementFromPoint) Chrome supports it since version 4.0 and Firefox since version 3.

Is it planned to extend the functionality of this library by elementFromPoint? Or am I even doing something wrong?

Form closing tag is moved to the end of an enclosing template

var Parser = require('parse5').Parser;
var Serializer = require('parse5').Serializer;

var parser = new Parser();
var serializer = new Serializer();

var input = '<template><form><input name="q"></form><div>second</div></template>';
var fragment = parser.parseFragment(input);

console.log(serializer.serialize(fragment));

Expected:

<template><form><input name="q"></form><div>second</div></template>

Actual:

<template><form><input name="q"><div>second</div></form></template>

stringifier is encoding inline <script> tags

var parse5 = require('parse5')
var serializer = new parse5.Serializer
serializer.serialize({
  "nodeName": "#document-fragment",
  "quirksMode": false,
  "childNodes": [
    {
      "nodeName": "script",
      "tagName": "script",
      "attrs": [],
      "namespaceURI": "http://www.w3.org/1999/xhtml",
      "childNodes": [
        {
          "nodeName": "#text",
          "value": "function test(t,n){return t&&n}"
        }
      ]
    }
  ]
})

returns:

<script>function test(t,n){return t&amp;&amp;n}</script>

how do i avoid encoding the contents of the script tag?

parse5 and streaming

Is there a way to pipe a file into parse5 (this feature is available in htmlparser2).

similarly, is there a way to pause / resume the sax parser ? I believe that the getNextToken method could be used to decide when the parsing should pause/resume.

I have been using the html-tokenize streaming parser lately and its suite (html-select, trumpet) but reaching html5 conformity on html-tokenize is still a long way to go so I am trying to see how parse5 could fit in and be used with html-select

Conditional comments

It would be great if your SAX parser could give the information about the conditional comments.

Incorrect AST

Here is a piece of code whose AST is not correct

<b>
  <i>
</b>

(there is a newline after the last tag)
The resultant AST is
root -> 1 child (html) -> 2 children (head,body)
body -> 2 children (b, i)
b -> 1 child (#text, i)
i -> 1 child (#text)

The key problem is that i is included twice (inside b and after b)

Provide the way to handle document writes

@domenic @Sebmaster
Hi guys, I just found this conversation on twitter. And since I've deleted my twitter account and I didn't find related issue in jsdom issues, I decided to open conversation here. I don't think it's a good idea to get into Prerprocessor internals and in general it will not work modifying just this.html. I think the better way will be implementing this API on my side: provide some script invocation mechanism during parsing (as it's done in real user-agents), so script can accept unfinished tree. E.g.:

//parser.parse(html, scriptHandler)
var result = parser.parse(html, function(scriptContent, document, writeHtmlFunc, async)  {
       //writeHtmlFunc can be used to inject html, passed to document.write, into parser
       //async flag from async attr can be used to postpone script execution after parsing 
});

Please, let me know your opinion.

Bug in documentation

Here is a bug in the example!

Instead of

parser.parse('<body>Yo!</body>');

this

parser5.parse('<body>Yo!</body>');

is needed.

Parsing is incorrect when having <meta> tagName

I have used the parse5.SimpleApiParser but when having a that is self closing the parser:

  1. Does not pass the selfClosing parameter to the startTag callback
  2. Does not call the endTag callback
<meta charset="utf-8">

NPM version is out of date

Hey, could you update the npm version and publish it? The tree adapters and serializer are really needed.

Closing-self-closing tags

Can this situation be handled in SAX-parser?

</bla/>

It is a self-closing and closing tag.)

Or more possible variant:

<bla//>

For example, I want to check the validity...

Support for Web Browser?

I tell me if parse5 could be executedd too inside web browser?

tern https://github.com/marijnh/tern provides this cool feature. You can execute the JS inference engine

  • inside node
  • or web browser.

To fix problem with require, it use this code ::

(function(root, mod) {
  if (typeof exports == "object" && typeof module == "object") // CommonJS
    return mod(exports, require("./infer"), require("./signal"),
               require("acorn/acorn"), require("acorn/util/walk"));
  if (typeof define == "function" && define.amd) // AMD
    return define(["exports", "./infer", "./signal", "acorn/acorn", "acorn/util/walk"], mod);
  mod(root.tern || (root.tern = {}), tern, tern.signal, acorn, acorn.walk); // Plain browser env
})(this, function(exports, infer, signal, acorn, walk) {
...

See https://github.com/marijnh/tern/blob/master/lib/tern.js

Duplicate tags' attributes

Hi!

Here's the HTML code:

<title param="1" param="2"></title>

It will be parsed as:

{ nodeName: 'title',
  tagName: 'title',
  attrs: [ { name: 'param', value: '1' } ],
  namespaceURI: 'http://www.w3.org/1999/xhtml',
  childNodes: [],
  parentNode: 
   { nodeName: '#document-fragment',
     quirksMode: false,
     childNodes: [ [Circular] ] } }

As you can see, in output DOM there is no any info about the second duplicate attribute param="2"!

Yes, I know that there is not a real and global problem! But in my task this handling is really vital!
Can you handle this situation, I would like to receive this in output DOM:

{ nodeName: 'title',
  tagName: 'title',
  attrs: [ { name: 'param', value: '1' },
           { name: 'param', value: '2' } ],
  namespaceURI: 'http://www.w3.org/1999/xhtml',
  childNodes: [],
  parentNode: 
   { nodeName: '#document-fragment',
     quirksMode: false,
     childNodes: [ [Circular] ] } }

Avoid adding empty elements

Hey I was wondering if there is a way to NOT add elements automatically to the AST which are not there in HTML. I'll explain using an example
HTML:
<html>
    <head>
    </head>
</html>

Now the html doesn't have any body tag, so I don't want that in the parse tree, whereas right now html has 2 childNodes - head and body.

getTagName in adapters misnamed?

In jsdom 3.0.0, we removed tagName from Node and moved it to Element.

This seems to have cause jsdom/jsdom#1004. I traced it to the adapter code:

exports.getTagName = function (element) {
  return element.tagName.toLowerCase();
};

It seems this gets called with a text node, which no longer has a tagName.

If I switch it to nodeName, it seems to work.

Should the adapter hook be renamed to getNodeName?

Handle boolean attributes?

I am building an html2xhtml converter on top of parse5 (https://github.com/cburgmer/http2xhtml.js/blob/master/lib/converter.js).

It seems I need to handle boolean attributes (e.g. <input type="checkbox" checked>) myself.

parse5 will return such an attribute as {name: "checked", value: ""}. This does not allow me to distinguish from actual attributes with empty string values.

I could not find much more information on those kind of attributes, so I am posing this as a question: should the parser know about those attributes and return something accordingly?

Add <template> support.

At the time when parse5 was started templates were in draft. Since now they supported by all browsers except IE (surprise!) it's time to land them to parse5.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.