andrejewski / himalaya Goto Github PK

View Code? Open in Web Editor NEW

912.0 25.0 128.0 1.2 MB

JavaScript HTML to JSON Parser

Home Page: http://andrejewski.github.io/himalaya

License: ISC License

JavaScript 100.00%

himalaya html parser json javascript

himalaya's Introduction

Himalaya

Parse HTML into JSON

Try online 🚀 | Read the specification 📖

Usage

Node

npm install himalaya

import fs from 'fs'
import { parse } from 'himalaya'
const html = fs.readFileSync('/webpage.html', { encoding: 'utf8' })
const json = parse(html)
console.log('👉', json)

Browser

Download himalaya.js and put it in a <script> tag. Himalaya will be accessible from window.himalaya.

const html = '<div>Hello world</div>'
const json = window.himalaya.parse(html)
console.log('👉', json)

Himalaya bundles well with Browersify and Webpack.

Example Input/Output

<div class="post post-featured">
  <p>Himalaya parsed me...</p>
  <!-- ...and I liked it. -->
</div>

;[
  {
    type: 'element',
    tagName: 'div',
    attributes: [
      {
        key: 'class',
        value: 'post post-featured',
      },
    ],
    children: [
      {
        type: 'element',
        tagName: 'p',
        attributes: [],
        children: [
          {
            type: 'text',
            content: 'Himalaya parsed me...',
          },
        ],
      },
      {
        type: 'comment',
        content: ' ...and I liked it. ',
      },
    ],
  },
]

Note: In this example, text nodes consisting of whitespace are not shown for readability.

Features

Synchronous

Himalaya transforms HTML into JSON, that's it. Himalaya is synchronous and does not require any complicated callbacks.

Handles Weirdness

Himalaya handles a lot of HTML's fringe cases, like:

Closes unclosed tags <p><b>...</p>
Ignores extra closing tags <span>...</b></span>
Properly handles void tags like <meta> and <img>
Properly handles self-closing tags like <input/>
Handles <!doctype> and <-- comments -->
Does not parse the contents of <script>, <style>, and HTML5 <template> tags

Preserves Whitespace

Himalaya does not cut corners and returns an accurate representation of the HTML supplied. To remove whitespace, post-process the JSON; check out an example script.

Line, column, and index positions

Himalaya can include the start and end positions of nodes in the parse output. To enable this, you can pass parse the parseDefaults extended with includePositions: true:

import { parse, parseDefaults } from 'himalaya'
parse('<img>', { ...parseDefaults, includePositions: true })
/* =>
[
  {
    "type": "element",
    "tagName": "img",
    "attributes": [],
    "children": [],
    "position": {
      "start": {
        "index": 0,
        "line": 0,
        "column": 0
      },
      "end": {
        "index": 5,
        "line": 0,
        "column": 5
      }
    }
  }
]
*/

Going back to HTML

Himalaya provides a stringify method. The following example parses the HTML to JSON then parses the JSON back into HTML.

import fs from 'fs'
import { parse, stringify } from 'himalaya'

const html = fs.readFileSync('/webpage.html', { encoding: 'utf8' })
const json = parse(html)
fs.writeFileSync('/webpage.html', stringify(json))

Why "Himalaya"?

First, my friends weren't helpful. Except Josh, Josh had my back.

While I was testing the parser, I threw a download of my Twitter homepage in and got a giant JSON blob out. My code editor Sublime Text has a mini-map and looking at it sideways the data looked like a never-ending mountain range. Also, "himalaya" has H, M, L in it.

himalaya's People

Contributors

Stargazers

Watchers

Forkers

limweb adrien-thierry jmreed0112358 wiresjs eliot-akira klaaz0r hjkim-muhayu jaime-lee sf-billops luukk aiephoenix scvarun talkingstove katibgames maletsden biswaranjanpati ccpu jessjava pietrop warlock tylerreinhart jussisaurio noureddinemounir nauvalazhar pengpengwt mjunaidi shrishti01 lemonhall operfildoluiz mbeller-weltenbauer tans birowo ainaraza rjjakes pickardjoe lancetipton ismartsa fredrikhelenefors balakrishnareddypolu michaeladrianlucas eugv86 wuchaofan minewhat miljantekic firefoxxy8 dhiaeddinesaidi lazarljubenovic lobesoftllc marcusjay demi-ob nervjs tnylea lynnezhao befront zac-stewart rashikkathuria dcombest51 wazowski78 vodek2 odeds layerjs the-cc-dev salatielsql rahulmeshram laredovich sbrichardson reactual mikemfleming sis-dk wangjiafenghw caojiaquan probussistem quinszouls rememberlenny victoryancn ankrim jinseopjung ecklf tangert zhqagp sandraur9 anhnguyenpku sammight yohikofox jshanm joeytslingingcode qc-l ivanwills vietphan-tiki timlewismt pea-jj hitsuki9 hhy5277 zababurinsv scinence-drawer juhuiyichu chenc199909 monw3c cloudhubke therakeshpurohit

himalaya's Issues

add code coverage

I think we have good code coverage, but I want to know for certain so we can either:

show off
feel bad and work to get something we can show off

We can use coveralls.io for the honor/guilt badge.

Getting tags that are not defined

I m actually playing with DOM -> AST -> DOM and I was wondering if there is anyway of getting informations related on tags that are not define by standard HTML tags.

For example, I m having the following string:

<div>
    <h1>Hi there</h1>
    <result></result>
</div>

But result is not defined, moreover its empty, and so I can't know that it exists inside the string inside the AST.

Any suggestions ? :)

Nested Unordered Lists Error

I was testing nested unordered lists and noticed that the "li" tag within the nested "ul" became a child of the top-level "ul" when parsing. Example below:

<ul><li>TEXT<ul><li>SUBTEXT</li></ul></li></ul>

Results in (note: triple asterisks added to highlight the JS object in question):

[ { "type": "Element", "tagName": "ul", "attributes": {}, "children": [ { "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "TEXT" }, { "type": "Element", "tagName": "ul", "attributes": {}, "children": [] } ] }, ***{ "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "SUBTEXT" } ] }*** ] } ]

Expected (note: triple asterisks added to highlight the JS object in question):

[ { "type": "Element", "tagName": "ul", "attributes": {}, "children": [ { "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "TEXT" }, { "type": "Element", "tagName": "ul", "attributes": {}, ***"children": [ { "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "SUBTEXT" } ] } ]*** } ] } ] } ]

What's interesting is it appears to only be limited to "ul" and "li" (at least in my limited testing). Similar markup with "div", "span" and a nested "div" seems to work fine.

<div><span>SPAN TEXT<div>NESTED DIV TEXT</div></span></div>

Confirmed in my own code and at https://jew.ski/himalaya/

properties issue

In some cases HTML properties are set as an attribute. But a property should have a boolean value - true / false.

With himalaya it's possible to set a HTML property as - example checked="checked" if we say that the "checked" is a property. It should have been checked="true"

ES2015 template string support

Would be a great idea I guess to add support for ES2015 template string

isindex and _charset_ attribute issues

Just study the specs:

https://html.spec.whatwg.org/#attr-fe-name-isindex

A lot of edge cases not handled correctly

There are a lot of edge cases not handled correctly

tagName with spaces will be compiled totaly wrong
attributes with spaces will be compiled totaly wrong
stand alone tagName freaks out - </div>
crazy tagName freaks out - <</div>>
unfinished tagName freaks out - <div>
self closing tag with spaces
self closing tag with spaces, trailing text
normalize whitespace - e.g. Line one\n<br>\t \r\n\f <br>\nline two<font><br> x </font>
brackets in attribute - e.g. <div xxx="</div>">
unfinished comment. e.g. <!-- comment text or <!-- comment text -- or <!-- comment text -
unfinished attribute. e.g <div foo="
spaces in closing - e.g < / div > ( gives a weird output)
if no value on an attribute, you setting it to - name = name. Not a valid HTML & XML syntax
- namespaces - XML, Xlink etc. E.g. <ns:tag>text</ns:tag>
  This is only a few edge cases

issues with attributes missing close quote

<div a="1><span id="foo">xxx

Unable to npm install himalaya

I'm not very much familiar with "npm install", but i want this amazing tool on my system with Ubuntu 16 LTS.
To do so I installed "npm" and "nodejs-legacy" with sudo-apt-get install command. Afterwhich I did "npm install himalaya", but I see this:

_$npm install himalaya
npm WARN saveError ENOENT: no such file or directory, open '/home/username/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/home/username/package.json'
npm WARN username No description
npm WARN username No repository field.
npm WARN username No README data
npm WARN username No license field.

added 2 packages in 0.686s_

$ npm version

npm version
{ npm: '5.0.0',
ares: '1.10.1-DEV',
cldr: '31.0.1',
http_parser: '2.7.0',
icu: '59.1',
modules: '57',
node: '8.0.0',
openssl: '1.0.2k',
tz: '2017b',
unicode: '9.0',
uv: '1.11.0',
v8: '5.8.283.41',
zlib: '1.2.11' }

Please help. Thanks.
I already tried : npm cache clean -f

Option to skip whitespaces

Preserves Whitespace

Himalaya does not cut corners and returns an accurate representation of the HTML supplied.

Is there an option to 'cut corners'? 😂

code never executed

This part of the code never got executed?

https://github.com/andrejewski/himalaya/blob/master/index.js#L35-L38

And the closing tag section:

https://github.com/andrejewski/himalaya/blob/master/index.js#L91-L93

Maybe missing specs, but I tried this in browser and couldn't excute this lines.

attributes without quotes not supported

attributes without quotes not supported
<foo bar=baz>

can not parse cdata or close element if script tag is unclosed

Support older node versions

Issue #30 shows there are incompatibility issues with older Node versions. As Himalaya does not rely on any radical new features, we should include babel-polyfill and configure Travis CI to test of some of the more recent LTS versions.

form element isn't allowed inside phrasing content

As stated in the specs:

*"For example, a form element isn't allowed inside phrasing content, because when parsed as HTML, a form element's start tag will imply a p element's end tag. Thus, the following markup results in two paragraphs, not one:

<p>Welcome. <form><label>Name:</label> <input></form>
It is parsed exactly like the following:

<p>Welcome. </p><form><label>Name:</label> <input></form>"*

However. If you try to parse this:

<p>Welcome. <form><label>Name:</label> <input></form>

your parser totaly screw this up,

script and xmp issue when compile closing tag is not case sensitive

issue when compile closing script tag is not case sensitive

<script></SCRIPT> outputs:

js
  "content": "</SCRIPT></root>"
  "tagName": "script"
   "type": "Element"

What is this root?

Same happen for xmp tags: <xmp></xmp \n >

Problem on attributes with quotes in the value.

Hi, great work with this library, but i have some problemes when i use the function toHTML in the translate.js. When an attribute value contain a single quotes, the quotes in the attribute value are single too, when, i believe, it should be double, the same happens with double quotes. I searched in the implementation and the solution is a simple ! in the conditional in the line 13. Here is an example of the output when i use the parser and after of do somethig with the json, i use the toHTML method to come back to the HTML:
Original HTML

<button @click="$store.dispatch('INCREMENT')" class="increment"> Increment</button>

and the Output is this

<button @click='$store.dispatch('INCREMENT')' class='increment'> Increment</button>

This cause that the browser interpretate @click='$store.dispatch(' and increment')' and this is an error.
I using this library and i hope that the issues can be solved soon. Thanks for the great work.

Problem with attribute name

Issue with line breaks, is it a problem?

I really don't know if this should work or not because at the end, there's no space between attributes and that's not right, I mean, as a valid html. But we're so used to browsers fixing up things for us (codes like this below do work on browsers) that I ask to myself, should this work on himalaya as well?

Thanks :)

var html = require("himalaya")

var markup = `<button custom-attr-one="Hello world"
custom-attr-two="Hello title">Button</button>`;

html.parse(markup)

output:

[
    {
        "type": "Element",
        "tagName": "button",
        "attributes": {
            "customAttrOne": "Hello world\"\ncustom-attr-two=\"Hello title"
        },
        "children": [
            {
                "type": "Text",
                "content": "Button"
            }
        ]
    }
]

no support for cdata

There is no support for cdata

Problem with attribute name

Hello. I am having a problem with trying to convert json to html obly with inputs and it's attribute name.
For Example, I translate html to json
<input type="text" name="name">
and I got this json

[
  {
    "type": "Element",
    "tagName": "input",
    "attributes": {
      "type": "text",
      "name": "name"
    },
    "children": []
  }
]

But When I try to translate this json to HTML I got something like this

<input type='text' name>

As you see attribute name lost it value. Could you help me to fix this issue ?
Thank you.

formatStyles error

if the style is

background-image:url("https://wd.geilicdn.com/bj-vshop-216085684-1496287337836-345156715_900_900.jpg.webp?w=400&")

the formatStyles function will return

{
     "backgroundImage":"url(&quot",
     "http":"//wd.geilicdn.com/bj-vshop-216085684-1496287337836-345156715_900_900.jpg.webp?w=400&amp"
}

I debug the code, and find code.

function formatStyles(str) {
     return str.trim().split(';').map(function (rule) {
          return rule.trim().split(':');
     }).reduce(function (styles, keyValue) {
……

Create TypeScript type definitions

Create TypeScript type definitions for this project to being able to use it in projects written in TypeScript.

no sanitizing or validation

There is no validation in this code, is this going to be added with a good performance?

E.g. tagName should be validated. I never heard about tagName like ''#DIV=()" or "sPAn" etc. Or Chinese or Arabic letters.

As suggested here: #5 Also validation would be needed. Components are upperCase letters.

(Suggestion) Consider changing the attributes property from an Object to an Array

If you did change the attributes property from an Object to an Array, it would make it much easier to loop through and check the content length. This is especially useful when dynamically building elements using document.createElement(). 👌🏻

docType and other tags not case sensitive

<!DoCtyPE html> should become !doctype

performance issues

Use of lastIndexOf is terrible slow. Same with split(), and slice().

A better solution would be to skip this, and just iterate through the string. Should be faster.

Use of reduce(), map() and now in your latest change - filter() - are all performance killers.

Test it out on jsperf(). There exist better options!

Possible to port for the browser/client?

It would be appealing to use this in a browser context. Please and thank you :)

Should support components?

A must this days is components, and if Himalaya should be used with virtual DOM libraries, I suggest you add in support for components. Now you have a Text, Element, Comment node.

What about adding in a 4th node? Component? I think that will drag users to this script.

inlineStyle and serializeAttr in translate.js has some errors

I input the AST tree that has been transformed

{
    "type": "Element",
    "tagName": "body",
    "children": [
        {
            "type": "Element",
            "tagName": "view",
            "attributes": {
                "className": [
                    "div"
                ]
            },
            "children": [
                {
                    "type": "Text",
                    "content": "\n    &lt;     &gt; a ' \" &amp;  \n    "
                }
            ]
        }
    ]
}

then i get the error

TypeError: Cannot convert undefined or null to object
    at inlineStyle (/projec_root/node_modules/himalaya/lib/translate.js:38:17)

happen in inlineStyle function

function inlineStyle(style) {
  return Object.keys(style).reduce(function (css, key) {
    return css + '; ' + dasherize(key) + ': ' + style[key];
  }, '').slice(2);
}

and serializeAttr also has some problem if value is null or undefined

function serializeAttr(attr, value, isXml) {
  if (!isXml && attr === value) return attr;

  try{
    console.log(value)
    //-------------- here 
    var text = value.toString();
  
    var quoteEscape = text.indexOf('\'') !== -1;
    var quote = quoteEscape ? '"' : '\'';
    return attr + '=' + quote + text + quote;
  }catch(e){
    console.log(value)
    console.log(e)
  }
  return ''
}

form tags

As you can see in htmlparser2, formtags are a special case. https://github.com/fb55/htmlparser2/blob/master/lib/Parser.js#L26-L34

TypeError: str.charAt is not a function

getting the following error when running the basic setup

v7.5.0/lib/node_modules/himalaya/lib/lexer.js:30
var isText = str.charAt(state.cursor) !== '<';

TypeError: str.charAt is not a function]

MicroOp

Just flicking through ya code and noticed this line...
https://github.com/andrejewski/himalaya/blob/master/index.js#L42
if(!str.indexOf(commentStart)) {
I take it that line is checking to see if str starts with commentStart. Well I just wanted to point out that thats the slowest way of doing it (by alot). Check this....
https://jsperf.com/string-startswith/25
..notice the last test? might be ugly but the speed increases are huge. Used it once myself and made a huge difference. Use to be the lastIndexOf won (besides being hardcoded), was surprised to see slice and substring winning.

While Im here....
Those !str.indexOf,!~,etc are real clever and all but hard to know whats going on. I had to run tests in the console to figure it out as what the hell do ya google?

how to use in html page with node.js server

Not workig

Tried to convert below html:

Spectacular Mountain

Got below response:
[ { type: 'Element', tagName: '!doctype', attributes: { html: 'html' }, children: [] }, { type: 'Text', content: '\n' }, { type: 'Element', tagName: 'html', attributes: {}, children: [ [Object], [Object], [Object] ] }, { type: 'Text', content: '\n' } ]

Parser Ability To Handle Embedded Tables

Hi There,

I have a minimum set of instructions to reproduce an issue I believe is with the parser. The set of steps to reproduce are the following:

Read Sample File Using i.e.
HtmlTemplate = fs.readFileSync(TemplateHtmlPath, {encoding: 'utf8'});
Convert To Json i.e.
JsonTemplate = himalaya.parse(HtmlTemplate );
Convert the Json Back To Html & Write To File i.e.
let Html = toHTML(JsonTemplate );
fs.writeFileSync(FinalHtmlPath, Html);

The Html returned from toHTML(JsonTemplate) is very different to the HTML that was originally loaded into the program. I don't think the module currently handles embedded tables. Thoughts? I attach sample files (Test.html, script.js & style.css). As you can see from the sample html it contains embedded tables.

Test.zip

strings starts with ! should be a text node, not cdata or comment etc

Strings starts with ! should be a text node, not cdata or comment etc
<! foo

In this case, the tagname would be !and that is not a valid tagName.

only one instance of babel-polyfill is allowed

Throwing this error

/home/yogesh/yogesum/api/node_modules/babel-polyfill/lib/index.js:10
  throw new Error("only one instance of babel-polyfill is allowed");
  ^

Error: only one instance of babel-polyfill is allowed
    at Object.<anonymous> (/home/yogesh/yogesum/api/node_modules/babel-polyfill/lib/index.js:10:9)
    at Module._compile (module.js:570:32)
    at Module._extensions..js (module.js:579:10)
    at Object.require.extensions.(anonymous function) [as .js] (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:214:7)
    at Module.load (module.js:487:32)
    at tryModuleLoad (module.js:446:12)
    at Function.Module._load (module.js:438:3)
    at Module.require (module.js:497:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/yogesh/yogesum/api/node_modules/himalaya/lib/index.js:5:1)
    at Module._compile (module.js:570:32)
    at Module._extensions..js (module.js:579:10)
    at Object.require.extensions.(anonymous function) [as .js] (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:214:7)
    at Module.load (module.js:487:32)
    at tryModuleLoad (module.js:446:12)
    at Function.Module._load (module.js:438:3)
    at Module.require (module.js:497:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/yogesh/yogesum/api/server/components/pdfgen/htmlParse.js:2:18)
    at Module._compile (module.js:570:32)
    at normalLoader (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:199:5)
    at Object.require.extensions.(anonymous function) [as .js] (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:216:7)

content of himalaya/lib/index.js:

// line number 5
require('babel-polyfill');

var _lexer = require('./lexer');

himalaya version: 0.2.5 (project using babel for development)

volume attribute not parsed correctly

Have a look here:
https://html.spec.whatwg.org/#effective-media-volume

See section 5.

'range 0.0 to 1.0'. In your parser, you can set all kind of numbers, also negative.

Also have a look at the start attribute. No negative numbers allowed

Error: Cannot find module './lexer'

Hi!
meteor npm install himalaya

import himalaya  from 'himalaya';
const parsed = himalaya.parse(html)
return parsed;

Error: Cannot find module './lexer'

can not parse cdata if script tag is empty

empty attributes values are converted to 0.

from a html string to json

input:

<div custom-attr=''></div>

output:

[{"tagName":"div","attributes":{"customAttr":0},"children":[],"type":"Element"}]

So when converting back to html:

<div custom-attr='0'></div>

attribute with value and funky whitespace not supported

attribute with value and funky whitespace are not supported.

<foo bar = "baz" >

xml and xml prolog issues

This compiles totaly wrong!!

<?xml version="1.0" ?>

and xml doesn't seem to be supported either. And this is not case sensitive

README link to "full specification" is broken

https://github.com/andrejewski/himalaya/tree/master/text/ast-spec.md

Himalaya has a specification for its output. Essentially, everything is a node and can either be an `Element`, `Comment`, or `Text` node. The [full specification](https://github.com/andrejewski/himalaya/tree/master/text/ast-spec.md) provides the complete details.

Should be https://github.com/andrejewski/himalaya/blob/master/text/ast-spec-v0.md

How to parse html strings

Like I have a data which fetched from website in html format like {"data" :"<p>How are you?</p>"}

How to convert these data to json using himalayas

require('himalaya') doesn't work for webpack builds in browser

Trying to use himalaya in a browser, but require('himalaya') is returning an empty object. This is due to the code in index.js:

if (typeof window === 'undefined') {
  module.exports = {default: lib, ...lib}
} else {
  window.himalaya = lib
}

I'm guessing this is to deal with the case where the file is being loaded from a script tag e.g. from a CDN. My advice would be to always do the module.exports code path, and then handle the script tag case with different build settings. If you wanted, you could build a dist/himalaya.js and dist/himalaya.min.js that people could use for this purpose.

I can make a PR if you like?

While loop break condition (possibily) wrong when trying to find closing tag in 'parse' method

Hey Chris,
Love the library. I am experiencing an issue when trying to crawl certain websites (e.g. )

In the parse method I see this snippet:

while (--_len > -1) {
 if (tagName === stack[_len].tagName) {
     stack = stack.slice(0, _len);
     nodes = stack[_len - 1].children;
     break;
  }
}

I admit I don't fully understand what's going on, but wouldn't that break for _len = 0 (after decrement) ?
It will still enter the while, provided that enters the if condition, stack is going to be always an empty array. On top of that we will try a lookup for stack[-1] which clearly doesn't have a children prop.

Am I missing something?
Thanks again

Couldn't find preset "es2015" using react native

Hello, I'm trying to use this package with react native v48.4, but I ran into this issue. I am open to help out if possible.

Don't force lowercase on anchors

Hello,

First, I would like to thank you for this awesome library. Works great.

I would like to add support for camel-cased anchors, which is useful for special HTML templating systems.
Right now, if we put camel case in HTML, himalaya cut it down to lower case:

var himalaya = require('himalaya');
var toHTML = require('himalaya/translate').toHTML;
toHTML( himalaya.parse('<div><specialAnchor><specialValue1>1</specialValue1><specialValue2>2</specialValue2></div>') )
// returns "<div><specialanchor><specialvalue1>1</specialvalue1><specialvalue2>2</specialvalue2></specialanchor></div>"

Since JSON supports (and use) camelCase, i think himalaya should provide an option to support it.
From looking into the source, it seems that this line is the problem:

const tagName = tagToken.content.toLowerCase() // parser.js:44

Since HTML supports uppercase for anchors, I think himalaya should return exactly what was parsed through toHTML(himalaya.parse(someString))

toHTML( himalaya.parse('<P>some text</P>') )
// returns '<p>some text</p>' but should return '<P>some text</P>'

What's your opinion about this? Is there another reason justifying the forced lower case on anchors?