Code Monkey home page Code Monkey logo

atlas-html-stream's Introduction

atlas-html-stream

A super fast html-parser stream that outputs tag, text and closing nodes.

Travis


install

npm install --save atlas-html-stream

why

I didn't like htmlparser2's streaming API and I wanted an html parser that collpased text-node whitespace by default. I also wanted to see if I could write a faster parser.

performance

htmlparser-benchmark

The following benchmark was done on my local machine using htmlparser-benchmark. Lower is better:

atlas-html-stream: 4.91208 ms/file ± 6.08846
htmlparser2: 7.06549 ms/file ± 5.09643

I only tested against htmlparser2 because it was the fastest stream-based parser. This benchmark uses the streaming interface of both parsers.

memory usage

A simple parser requires the body to be in memory, which can be game-breaking if your HTML is very large. This parser is a stream, and benefits from using chunked data. The basic idea is to keep track of the current chunk and anything from the previous chunk which spills over into the current chunk. For example, consider the following two chunks:

  1. "first data chunk <di"
  2. "v>second chunk</div>"

When the parser is done with the first chunk, it will recognize that it needs to hold onto "<di" and will only flush "first chunk " from memory. When the parser is done with the second chunk, it will flush everything from memory since there's no longer a pending node.

examples

Using this parser is really easy since it extends the stream interface -- all you need to do is pipe or write HTML to it and listen to "data" events:

with piping

The stream interface is recommended, because it allows you to consume a constant amount of memory (regardless of HTML size):

const { createReadStream } = require("fs");
const HtmlParser = require("atlas-html-stream");
const myHtmlFile = createReadStream("./index.html");

myHtmlFile.pipe(new HtmlParser()).on("data", ({name, data, text}) => {
  if (text){
    // is a text node
    console.log(text);
  } else if (name && data){
    // is an opening tag with potential attributes
    console.log(name, data)
  } else {
    // is a closing tag
    console.log(name)
  }
}).on("end", () => {
  console.log("parsed all html")
})

without piping

This is not recommended unless you can afford to have your HTML exist entirely in memory. First, we'll write a small helper to abstract away the streaming interface:

// helpers.js
const HtmlParser = require("atlas-html-stream");
const parser = new HtmlParser();

const parseHtml = html => {
  const nodes = [];
  parser.on("data", data => nodes.push(data))
  parser.write(html)
  // resets the parser back to vanilla state
  parser.reset();
  return nodes;
}

module.exports = { parseHtml };

Next, we'll use our helper to parse our in-memory html file:

const { readFileSync } = require("fs");
const { parseHtml } = require("./helpers");
const file = readFileSync("./index.html");
const nodes = parseHtml(file);
// do what you want with the parsed nodes.

comment, script and style nodes

The examples above show you how minimal the API is, as it should be. Comment, script and style nodes are treated as regular nodes, in that they emit opening and closing tags. The main difference is that the content inside these nodes is treated as a single raw text node with no whitespace-collapse. For simplicity, let's use our parseHtml helper function from above in the following examples:

script/style tags

...
const scriptNodes = parseHtml(`
  <script src="./script.js">
    const myVar = 5;
  </script>
`);
scriptNodes.forEach(n => console.log(n))
// { name: 'script', data: { src: './script.js' } }
// { text: '\n    const myVar = 5;\n  ' }
// { name: 'script' }

Style tags are treated the same way, except their name property has a value of "style".

comment tags

Again, these are treated the same way.

const commentNodes = parseHtml(`
  <!-- 
    this is 
    a comment 
  -->
`)
commentNodes.forEach(n => console.log(n));
// { name: '!--', data: {} }
// { text: ' \n    this is \n    a comment \n  ' }
// { name: '!--' }

keeping whitespace

By default, the parser will collapse all non-script/style/comment whitespace down to a single space, which is useful if you're scraping and don't care about whitespace. In some cases, you need the whitespace. You can pass a preserveWS option to the constructor, which will force the parser to keep all whitespace it finds in text nodes:

const nodes = parseHtml(`
  This is some text

      <b> 
        Hola
          </b>  
`)
nodes.forEach(n => console.log(n));
// { text: "\n  This is some text\n\n      " }
// { name: "b", data: {} }
// { text: " \n        Hola\n          " }
// { name: "b" }
// { text: "  \n" }

You can then post-process the nodes if you need more specificity. Note that this example doesn't use a streaming interface for simplicity, however, the streaming interface is highly recommended:

myFile.pipe(new HtmlParser({preserveWS: true})).pipe(myTransform).on("data", node => {...})

myTransform can conditionally check text nodes and post-process them one-by-one.

todo

parseHtml helper

Should parseHtml be exported from this package in addition to HtmlParser? Using parseHtml is not recommended over the streaming interface, but it seems like a valid helper for cases where the html string needs to be in memory.

even faster

I'd like to make this thing even faster. The parsing itself takes about 3.5 ms/file (using htmlparser-benchmark) on my machine. Pushing nodes as data events to our stream adds around 40% more processing time, which is why the benchmark above shows around 4.9 ms/file -- this can't be avoided, because we want the streaming interface.

The SeqMatcher slows down this parser (checking comment, script and style nodes); there might be a faster way to handle these special nodes.

Switching on the state of the parser first is probably faster than switching on the current character first, but I haven't tested the latter. The main idea is to minimize the amount of instructions required for each [state, char] pair, according to how likely each pair is. If [TEXT, !whitespace] has the highest probability in "typical" html, then we would want this pair to require the least amount of instructions. In other words, sum_i(probabilityOfPair_i*numSteps_i) should be minimized.

I think collapsing the states SCRIPT, STYLE and COMMENT into a single state SPECIAL might improve the performance. Instead of polluting the space with SeqMatchers for each one initially, we can create them on-demand for whichever SPECIAL node we are parsing, and store them in a persistent hash in case we see the same SPECIAL node again.

dynamic parser idea

What defines a "typical" html document? A set of parsing instructions for a "typical" html document may be slower for "atypical" html documents which have a vastly different probabilityOfPair_i distribution. What if the parser could use the distribution of pairs of the current document to dynamically change its condition-tree on-the-fly? For example, if we're getting overwhelmed by raw text, it would be faster to check if the state is TEXT first. Alternatively, if our document has almost no raw text, it would be smarter to check if the state is TEXT last. While the dynamic parser sounds interesting, it may not be worth implementing if it adds a ton of overhead.

caveats

large text nodes

Since the parser keeps pending nodes' data in memory, a large text node will cause memory issues. In the vast majority of cases, you will not run into text nodes bigger than the chunks themselves. If this becomes a problem, the parser could be modified to accept a maxTextSize parameter which it could use to chunk the individual text nodes themselves. For now, this is an extreme edge case and isn't handled.

doctype

The doctype node is treated as any other node. The input "<!DOCTYPE html>" would result in the following data event: {name: "!DOCTYPE", data: {html: ""}}

attributes with no value

Tag attributes with no value are not treated in a special way; a key with no value will have an empty string value in the node's data field. I may implement true as the default value, but this seems like something that can easily be done by the caller.

atlas-html-stream's People

Contributors

atlassubbed avatar paed01 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

bonniernews

atlas-html-stream's Issues

Preserve blanks when in text node

First of all: Thank you for an excellent and blazing fast library. It's pretty clever to pass a stream. It enables us to pipe it to another transformer.

Description

We use your library (v1.0.1) for parsing and filtering markup before sending it to the client. Hence, the markup is reassembled after parsed. Environment: node v8.12.

Consider this original markup:

Title: <b>Jan Bananberg</b>

Since trailing whitespaces are removed when parsing, the result after reassembly would be:

Title:<b>Jan Bananberg</b>

which will be displayed as: Title:Jan Bananberg instead of Title: Jan Bananberg

Suggestion

I fully understand the complexity of whitespaces in HTML, so IMHO, may I suggest an option to preserve blankspace, e.g. HtmlParser({ preserveBlankspace: true }). Leave it up to the developer so to speak.

Forward slash breaks unquoted attribute values

Hello!

Mad props!

Thank you so much for this fantastic HTML-parser! Great stuff!

We use this, amongst other things, to parse and process our sites output markup on request, so we really appreciate the speed of this thing.

TL;DR

Unquoted attribute values with /. Yay or nay? Could it be fixed? Would it be worth it?

An imperfection

Until recently it has worked flawlessly but we've hit a snag. We've found an issue with attribute values without quotes. The parser doesn't seem to like / characters in those values.

Quoted

Given the following markup passed to the HTML-parser...

<a href="https://www.mozilla.com/"> Absolute URL </a>
<a href="//example.com/"> Scheme-relative URL </a>
<a href="/en-US/docs/Web/HTML/"> Origin-relative URL </a>
<a href="./p/"> Directory-relative URL </a>

...we get these lovely data chunks. All is well!

{ name: 'a', data: { href: 'https://www.mozilla.com/' } }
{ text: 'Absolute URL' }
{ name: 'a' }
{ name: 'a', data: { href: '//example.com/' } }
{ text: 'Scheme-relative URL' }
{ name: 'a' }
{ name: 'a', data: { href: '/en-US/docs/Web/HTML/' } }
{ text: 'Origin-relative URL' }
{ name: 'a' }
{ name: 'a', data: { href: './p/' } }
{ text: 'Directory-relative URL' }
{ name: 'a' }

Unquoted

An optimized, equally valid, version of the same markup...

<a href=https://www.mozilla.com/> Absolute URL </a>
<a href=//example.com/> Scheme-relative URL </a>
<a href=/en-US/docs/Web/HTML/> Origin-relative URL </a>
<a href=./p/> Directory-relative URL </a>

...unfortunately does not get the same result. Instead we get this mess :'r

{ name: 'a', data: { href: 'https:', 'www.mozilla.com': '' } }
{ name: 'a' }
{ text: 'Absolute URL' }
{ name: 'a' }
{ name: 'a', data: { href: 'example.com' } }
{ name: 'a' }
{ text: 'Scheme-relative URL' }
{ name: 'a' }
{ name: 'a', data: { href: 'en-US', docs: '', Web: '', HTML: '' } }
{ name: 'a' }
{ text: 'Origin-relative URL' }
{ name: 'a' }
{ name: 'a', data: { href: '.', p: '' } }
{ name: 'a' }
{ text: 'Directory-relative URL' }
{ name: 'a' }

Question

I've tinkered around with the code. By just stripping this statement https://github.com/atlassubbed/atlas-html-stream/blob/master/src/HtmlParser.js#L102 I've managed to get the example with the absolute URL-version working.

A value starting with a / character seems to be an entirely different headache :C

A value ending with /, maybe right before >, maybe even at the very beginning of the the current text chunk... XO

I would be happy to help with this, even though my expertise in the matter is quite limited. I thought it'd be best to ask if you even think this would be possible to fix without completely destroying the performance of the parser?

Disclaimer

I understand this is just a parser and optimized output, without unnecessary quotes, could be achieved post parse.

We are piping the output stream from our templating engine into this parser and then to a series of processors. The templating engine in question produces optimized markup, hence this very long question.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.