mozilla / slowparse Goto Github PK

A slow JS-based HTML parser with good error feedback and debugging metadata.

Home Page: http://mozilla.github.com/slowparse/

License: Mozilla Public License 2.0

CSS 0.91% JavaScript 98.18% HTML 0.92%

slowparse's Introduction

Slowparse, a friendlier HTML5 parser

Slowparse is an experimental JavaScript-based HTML5 parser born out of Mozilla Webmaking initiatives. A live demo of Slowparse can be found over at http://mozilla.github.io/slowparse

Installing Slowparse

The Slowparse library can be used both in the browser and in environments that support commonjs requirements such as Node.js, by respectively including it as a script resource:

<script src="slowparse.js"></script>

or as module import, by installing it using npm:

$> npm install slowparse

After installing, Slowparse can then be required into your code like any other module:

var Slowparse = require("slowparse");

Using Slowparse

To use Slowparse, call its .HTML function:

var result = Slowparse.HTML(document, '... html source here ...', options);

This function takes a DOM context as first argument, and HTML5 source code as second argument. The options object is optional, and if used can contain:

options.errorDetectors

This is an array of "additional parsers" that will be called as 'detector(html, domBuilder.fragment)` when no errors are found by Slowparse. These can be useful when you have additional constraints on what HTML source is permitted in your own software that cannot or should not be dealt with by Slowparse itself.

This is mostly a convenience construction, and using it is equivalent to doing an if (!result.error) test and running the input through your own, additional parsers if no errors we found.

options.disallowActiveAttributes

This option can be either true or false, and when true will blank out attributes when it sees any that start with on such as onclick, onload, etc.

This means the DOM formed during the Slowparse run is a tiny bit more secure, although you will still be responsible for checking for potentially harmful active content (Slowparse is not a security tool, and should not be used as such).

Validating HTML

Slowparse accepts both full HTML5 documents (starting at <!doctype html> and ending in </html>) as well as well formatted HTML5 fragments. Any input that does not pass HTML5 validation leads to a result output with an error property:

var result = Slowparse.HTML(document, '<a href+></a>');
console.log(result.error);
/*
  {
    type: 'INVALID_ATTR_NAME',
    start: 3,
    end: 8,
    attribute: { name: { value: "+" }},
    cursor: 3
  };
*/

There are a large number of errors that Slowparse can generate in order to indicate not just that a validation error occurred, but also what kind of error it was. The full list of reportable errors can currently be found in the ParseErrorBuilders.js file.

Using validated HTML

If Slowparse yields a result without an .error property, the input HTML is considered valid HTML5 code, and can be injected into whatever context you need it injected into.

var input = "...";

var result = Slowparse.HTML(document, input);

if (!result.error) {
  activeContext.inject(input);
} else {
  notifyUserOfError(result.error);
}

Note that Slowparse generates an internal DOM for validation that can be tapped into, as result.document. If no options object with the disallowActiveAttributes is passed during parsing, this DOM should be identical to the one built by simply injecting your source code. If disallowActiveAttributes:true is used, this DOM will be the same as the one built by the browser, with the exception of on... attributes, which will have been forced empty to prevent certain immediate script actions from kicking in.

Getting friendlier error messages

By default, Slowparse generates error objects. However, if you prefer human-readable error messages, the ./locale/ directory contains a file en_US.json that consists of English (US) localized error snippets. These are bits of HTML5 with templating variables that can be instantiated with the corresponding error object.

For example, if you are getting a MISSING_CSS_BLOCK_CLOSER error, the local file specifies the following human-friendly error:

<p>Missing block closer or next property:value; pair following
<em data-highlight='[[cssValue.start]],[[cssValue.end]]'>[[cssValue.value]]</em>.</p>

We can replace [[cssValue.start]] with Slowparse's result.error.cssValue.start and [[cssValue.end]] with result.error.cssValue.end, and the same for cssValue.value, to generate a functional error. For instance, if there is an error in a CSS block after a property background:white, with "white" on the 24th character in the stream, the error might resolve as:

<p>Missing block closer or next property:value; pair following
<em data-highlight='24,29'>white</em>.</p>

Note that Slowparse has no built in mechanism for generating these errors, but only provides you with the error objects as a result from parsing, and the locale file for resolving error objects to uninstantiated human readable HTML snippets.

Working on Slowparse

The slowparse code is split up into modules, located in the ./src directory, which are aggregated by ./src/index.js for constructing the slowparse library. This construction is handled by browserify, and runs every time the npm test command is run, yielding a rebuilt slowparse.js.

If you wish to help out on Slowparse, we try to keep Slowparse test-driven, so if you have bad code that is being parsed incorrectly, create a new test case in the ./test/test-slowparse.js file. To see how tests work, simply open that file and have a look at the various tests already in place. Generally all you need to do is copy-paste a test case that's similar to what you're testing, and changing the description, input HTML, and test summary for pass/fail results.

Passing all tests is the basic prerequisite to a patch for Slowparse landing, so make sure your code comes with tests and all of them pass =)

slowparse's People

Contributors

Stargazers

Watchers

slowparse's Issues

Error when using slashes in CSS class names

Reported from Khan Academy user:
https://www.khanacademy.org/computer-programming/css-character-escaping/6037177738330112

Apparently slashes are valid CSS class names (it takes a lot of digging to verify this, but it seems to be the case). Slowparse does not like them, however, and reports 'Missing block opener after ..'.

Replicate in Thimble with this code:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8">
        <title>CSS Character Escaping</title>
        <style>
            .\. {
                color: red;
            }
            .\! {
                color: green;
            }
            .\, {
                color: blue;
            }
            .\; {
                color: purple;
            }
            .\{ {
                color: purple;
            }
        </style>
    </head>
    <body>
        <p class=".">My class name is "."</p>
        <p class="!">My class name is "!"</p>
        <p class=",">My class name is ","</p>
        <p class=";">My class name is ";"</p>
        <p class="{">My class name is "{"</p>
    </body>
</html>

vendor-prefixed CSS isn't recognized

It might be cool if folks could use vendor-prefixed CSS, or at least some nice subset of it like -o-transform and -moz-transition. Alternatively, a somewhat sneaky thing to do might be to take any transform declarations and magically turn them into their vendor-specific equivalents after parsing...

Either way, not that high a priority I guess.

signal a special error when <?xml is detected

It'll probably be a good idea to signal, using a special error, that slowparse is an HTML5 parser, not an XHTML parser, when <?xml is detected at the start of the stream.

pasting code in editor does not stretch with scrollbar, but keeps it weirdly sized

pasting code is important, and should trigger a reflow instead of whatever it's doing now.

content after </html> should flag an error

We currently do not check whether there is anything after </html> in the code. If there is, ideally, we should throw an error that can be used to inform the user they have content in the wrong place.

Non-error when missing a semi-colon in CSS

This is currently not an error:

font-weight: 200
font-size: hmm;

technically css doesn't care about newlines (as far as I know) so without permitted-values-for-properties parsing we either keep this in, or we add a soft-rule that newlines break up values. Which would have the most educational value?

add in @instruction parsing for newer CSS constructions (keyframes, etc)

There's a number of @Instruction selector replacements and prefixes that slowparse does not understand right now.

update the README to explain that slowparse's DOM is for validation only

To prevent people from inserting the Slowparse-generated DOM into their documents, which is going to be different from the one that the browser will generate itself due to shortcuts and ignores. Slowparse should be used as validator, and a "no errors" report should be taken to mean the source passed into it can be safely injected into a document.

gh-pages front end

With a view to making more of a community around slowparse, it might be nice to have a very simple jshint style web page that uses it, and people can try without having to do any setup. One logical way to do this would be to create a gh-pages branch that gets updated via post commit hook. Then we can always have a testable version of the tool online.

problem parsing -- in comments

from https://github.com/mozilla/webpagemaker/issues/33

solving this with https://github.com/toolness/slowparse/pull/31

Not capturing text nodes in optional-and-ommitted tags

If I parse code with a <li> that omits its closing tag it is closed prematurely:

<ul>
    <li>Hello Everyone
</ul>

becomes

<ul>
    <li></li>Hello Everyone
</ul>

A message that could be improved with automatically closed tags

The following HTML:

<!DOCTYPE HTML>
<html>
    <head>
        <title>Ahora es Cuba</title>
        <meta charset="utf-8">
    </head>
    <body>
       <p><h1><a href="#bla">AHORA ES CUBA</a></h1>
        </p>
    </body>
</html>

results in "The closing </p> tag here doesn't pair with the opening <body> tag here. This is likely due to a missing </body> tag."

It took me a while to realize the problem is that they've put an <h1> inside a <p>, and that must auto-close the <p>. Do you think it's possible to do a more helpful message there?
(It took me a while to figure it out myself)

p { a { color: red; } signals "nonexistent property" for "a { color"

this is true, but it should really say "p has not been closed" instead.

'Attr.nodeValue' is deprecated. Please use 'value' instead.

warning (but deprecated, so error at any time) in DOMbuilder.

"parsing of elements with boolean attributes" test fails on IE9

Specifically, 8 of the sub-tests fail. All other tests in the whole suite succeed at the moment.

audio/video elements with autoplay attribute are inadvertently played

Suppose the HTML parsed by slowparse contains something like the following:

<audio src="..." autoplay>

Slowparse's default behavior is to create an actual DOM <audio> element and set its attributes accordingly. However, this actually causes the corresponding audio file to start playing, which is a side effect that the client probably doesn't intend.

Note that even if autoplay isn't enabled, the user's browser might still start buffering the media over the network in preparation for playback, which also probably isn't desired. This already happens with <img> elements too, though it's not quite as catastrophic as random audio playing from a user's browser.

add support for single quotes for attributes in addition to double quotes

not a problem for learning templates, big problem for arbitrary sites

SVG is not supported.

This will not stand. Working on this over at https://github.com/pomax/slowparse/tree/svg-enabled

style attribute content isn't evaluated as CSS

It'd be nice if we parsed the content of the style attribute as CSS, so that we could help users with any malformed CSS inside it.

slowparse needs to understand void elements

Currently, void elements like <img> will raise slowparse errors because they'll be interpreted as unclosed tags. Need to fix that.

<textarea> content is parsed, rather than passed over

since textarea can literally contain any data, since it's treated as CDATA rather than PCDATA, slowparse should pass it over rather than trying to parse it as if it's part of the html token stream. Right now this breaks slowparse:

<textarea>
  this is inconsequential html: <p> <i> <script> <embed> and other elements should not be parsed.
</textarea>

currently it flags </textarea> as not closing off the <embed> tag.

Possibly in scope: message about malformed color property

I notice many KA students are typing rgb and then a space before the (, which is invalid CSS. Example:

        #summer {color: rgb (176, 42, 176);}
        #winter {color: rgb (43, 204, 196);}

In Chrome, the little in-dev-tools-validator notices the issue and crosses it out, with "Unknown property value."

Would that level of warning be in scope for slowparse?

Example:

allow whitespace to appear before the DOCTYPE

A message that could be improved with misordered closed tags

This HTML:

<strong><em>Nobody’s born smart. We all start at 0. Can’t talk, can’t walk, certainly can’t do algebra.<br>
        You can learn anything.</strong></em>

results in error:
"The closing tag here doesn't pair with the opening tag here. This is likely due to a missing tag."

It'd be better if it realized it's a mis-ordering issue, like "This is likely due to misordering the and tags".
This was another one where it took me a bit myself to realize the problem, especially as browsers are pretty forgiving of that sort of thing.

Not sure if it's possible to detect the mis-ordering however.

text nodes may or may not be incorrectly parsed

the DOM that is build for text nodes such as

<ul>
    <li>Hello Everyone
</ul>

ends up treating the text node as a sibling to the <li> rather than as a child. Depending on whether slowparsing is doing the right thing this may or may not require fixing

scrollbar for codemirror does not show up on pastes

when pasting a swathe of text into the editor, the scrollbar that pops into existence when manually editing does not appear to pop up correctly.

CSS comments don't forward the token position for the next selector

/* Something like this */
h1 {
  font-family: lolcatFont;
}

will cause the "h1" higlight to start at the comment's opening /* rather than forwarding token start to the start of the following css selector

Self-closing tags on void elements aren't accepted

The HTML5 Validator is okay with self-closing tags on void elements (e.g. <br/>), but not on normal elements (e.g. <p/>). In fact, it's almost definitely going to result in wonky code when used on non-void elements, because many browsers interpret it as an opening tag instead of an opening tag immediately followed by a closing tag--I once spent hours debugging a <script src="blah.js"/> only to find out that the script was never being loaded, because I self-closed the tag instead of providing a </script>.

So, I think we should do what the HTML5 validator does and accept self-closing tags on void elements, but not on non-void elements. We could potentially provide a helpful error when we run into self-closed non-void elements, though.

CSS user-select goes CSS_INVALID_PROPERTY_NAME_ERROR

Moved from https://github.com/mozilla/thimble.webmaker.org/issues/481:

@pamelafox filed:

Attempting to use this non-standard property results in the invalid property name error:
https://developer.mozilla.org/en-US/docs/Web/CSS/user-select

I can see where to add it in the code (cssProperties), but not sure if that'd conflict with philosophy about which properties "exist". Thoughts?

and @Pomax wrote:

this would be an issue with https://github.com/mozilla/slowparse, adding "user-select" to the list of known CSS properties should be all that's required (although adding a test case so that it's provably accepted would be good too of course!)

Use actual code text in error messages

Right now when something gets highlighted, for example, a broken CSS property "colo" vs. "color", we write the following error message:

"This CSS property does not exist."

Hovering over "This" with the mouse correctly highlights "colo" so it knows what's there. It would be nice if it used the incorrect text.

@font-face does not work

cause by "src" property not being in the clear list, should be fixed with the next pull merger

Slowparse should be able to report security violations

Webpagemaker mandates that users can't include JS in their HTML. It's likely that other sites might prohibit the use of JS too.

We should make an optional "plugin" for Slowparse that allows the use of JS to be reported as errors, so that users get instant feedback informing them that what they're writing won't work. We should encourage UIs built on top of Slowparse to also point the user to sandbox sites that do allow the use of JS when users try writing it.

Note that such a plugin should not actually be advertised as a sanitizer; merely a way to provide instant feedback warning that future sanitization by another agent will prevent the user's code from executing when they publish or share it.

/*- breaks CSS parsing

a CSS style element a la

<style>
/*- this is a comment */
....
</style>

seems to break CSS parsing

Parsing issue with semi-colons in :before and :after content

Reported by a user of the KA HTML/CSS environment, which uses slowparse.

Before/After Pseudoelements can't have semicolons in their content because the error checker mistakes the semicolon for the end of the CSS rule and concludes that the closing quote is a property name with no value set.

This code works:

h1::before{content:'&lt'} /* Inserts "&lt" before every header */

but this code, with a semicolon inside the quotes, does not work:

h1::before{content:'&lt;'} /* Inserts an opening angle bracket ("&lt;") before every header */

I replicated on Thimble here: https://thimble.webmaker.org/project/110284/remix

using <!-- in CSS should throw an error

it's clearly wrong. But understandable. We should throw an error anyway.

CSS block comments can lead to a parse error on @keyframes

When an @keyframes block is preceded by a CSS block comment, parsing seems to break down:

   @keyframes somelabel {
        0% { transform: scale(1); }
        50% { transform: scale(1.5); }
        100% { transform: scale(1); }
      }

works,

   /*
      Testing keyframe functionality:
   */
   @keyframes somelabel {
        0% { transform: scale(1); }
        50% { transform: scale(1.5); }
        100% { transform: scale(1); }
      }

throws a Missing block closer or next property:value; pair following @keyframes somelabel error.

Missing semi colons in CSS should cause an error

<style>
  body {
    color: red
    font-size: 16px
  }
</style>

The code above in incorrect (missing semi colons), but Slowparse doesn't report an error. For reference, the Firefox Dev Tools give the following error:

Expected end of value but found 'font-size'.  Error in parsing value for 'color'.  Declaration dropped. @ http://mozilla.github.io/slowparse/demo/:4

parser unquoted attributes too?

Not sure how prevalent this is in the wild, but we may need to support this for page remixing. It might actually not be too much work.

Problem with <video> and <source> tags

Slowparse gets this error with the following html:
"The closing tag here doesn't pair with the opening

My Popcorn Fun

add in comment ignoring for CSS

block comments are not ignored at the moment, and they should very much be.

"Missing block closer or next property:value; pair following @-moz-keyframes x" w/ CSS animations

From this tweet.

We need to bust up a test case and fix it.

incorrect 'boolean attribute' report

using the following text:

<h2><span start=</h2>

reports an unsupported boolean attribute type, even though the = should have made the parser switch to string-content attribute detection and complain there's quotes missing.

this only happens if </h2> is present.

DOMException with an unclosed attribute

With HTML like this:

<!DOCTYPE HTML>
<html>
    <head>
        <title>Challenge: A picture-perfect trip</title>
        <meta charset="utf-8">
    </head>
    <body>
        <h1>The perfect trip</h1>

        <p>I would see scenes like...</p>
        <img src="https://www.kasandbox.org/programming-images/landscapes/beach-waves-at-sunset.png"height="206" alt= "a beautiful sunset at the beach>

        <p>And animals like...</p>
        <img src="https://www.kasandbox.org/programming-images/animals/cheetah.png" height="206">

        <p>And eat food like...</p>
        <img src="https://www.kasandbox.org/programming-images/food/cake.png" height="206">

    </body>
</html>

The following error occurs:
"Failed to execute 'createAttribute' on 'Document': The qualified name provided ('https:') has an empty local name."

That is because it thinks that the quote sign ends the attribute, and thus the thing after it, https, is an attribute name, and then I think it confuses it with a namespaced tag due to the colon.

It would be nice if it could fail better if possible.
On Thimble, this results in a silent error, by the way. On KA, we are catching this and outputting "Something's wrong with the HTML, but we're not sure what."

HTML comments are unsupported

HTML comments, e.g. , are not currently recognized by slowparse.

slowparse doesn't support IE8

I tried hacking on a throwaway branch to see if it could work, and aside from obvious things like fixing Array.forEach() and String.trim(), the big problem was that text and attribute nodes don't seem to support expando properties in IE8. If we want to support IE8, we'll have to push all the parseInfo data for text and attributes up into their parent elements' parseInfo structure.

allow /> for html4/xhtml documents (with possible html5 warning)

we should not fatally break on older html closers, rather we should allow it and warn or notify

A message that could be improved with missing equal signs for attributes

Given code:

<!DOCTYPE HTML>
<html>
    <head>
        <title>Challenge: A picture-perfect trip</title>
        <meta charset="utf-8">
    </head>
    <body>
        <h1>The perfect trip</h1>

        <p>I would see scenes like oceans</p> <img src="https://www.kasandbox.org/programming-images/landscapes/beach-in-hawaii.png" alt="beach in hawaii" width="207">

        <p>And animals like sharks</p>
        <img src="https://www.kasandbox.org/programming-images/animals/shark.png" alt="shark swimming in the oceans" width="207">

        <p>And eat food like fresh grilled snapper</p>
        <img src="https://www.kasandbox.org/programming-images/food/fish_grilled-snapper.png" alt="fresh grilled snapper"
  width "207">


    </body>
</html>

It says:
"The opening tag here doesn't end with a >."

However, if it said "an attribute name should be followed by an equal sign", that would be more accurate.
Not sure how possible that is, but the current one does confuse.

Slowparse should be able to recover from non-fatal errors

Ultimately, Slowparse's error reporting is really about making writing HTML/CSS easy and fun for the user rather than confusing and error-prone. Really fatal errors like unclosed quotes and comments are ideal for this, because browsers will silently fail when given such input and present documents to the user that are extremely likely to be quite different from the their expectations.

However, clients should be able to build UIs on top of Slowparse that aren't obsessively pedantic. There are many "minor" errors like CLOSE_TAG_FOR_VOID_ELEMENT (see #20) which are treated as fatal but nonetheless aren't that bad in practice--browsers still work fine when such errors are present and most people write HTML for years without knowing that such things are technically incorrect.

So, we should add an option to Slowparse.HTML() that allows such errors to be reported as warnings instead of fatal errors. We might even want to simply cease reporting them as errors and only report them as warnings.

Note that we also currently have a class of fatal errors that aren't actually pedantic, but which represent "technically valid code that likely doesn't do what you expect it to". See #21 and UNQUOTED_ATTR_VALUE for examples of these. We may also want to report these as warnings rather than fatal errors.

Another option at the API level is to simply define a set of errors that can be recovered from, and allow clients to pass in a list of which ones they want to enable recovery for. This would allow individual clients to tailor their UIs to their particular audiences, rather than being forced to comply with our classifications, which are likely to be tailored to people who have never seen HTML/CSS before.

<p> text <p> text does not signal a new paragraph, but a nested element

annoying HTML thing, we may or may not want to support this.

Move locale resources from Transifex to Pontoon

We currently using .JSON and the file is being translated on Transifex. We should convert it to .property and move them to Pontoon.