Code Monkey home page Code Monkey logo

himalaya's Introduction

Himalaya

Parse HTML into JSON

npm Build Status Coverage Status

Try online ๐Ÿš€ | Read the specification ๐Ÿ“–

Usage

Node

npm install himalaya
import fs from 'fs'
import { parse } from 'himalaya'
const html = fs.readFileSync('/webpage.html', { encoding: 'utf8' })
const json = parse(html)
console.log('๐Ÿ‘‰', json)

Browser

Download himalaya.js and put it in a <script> tag. Himalaya will be accessible from window.himalaya.

const html = '<div>Hello world</div>'
const json = window.himalaya.parse(html)
console.log('๐Ÿ‘‰', json)

Himalaya bundles well with Browersify and Webpack.

Example Input/Output

<div class="post post-featured">
  <p>Himalaya parsed me...</p>
  <!-- ...and I liked it. -->
</div>
;[
  {
    type: 'element',
    tagName: 'div',
    attributes: [
      {
        key: 'class',
        value: 'post post-featured',
      },
    ],
    children: [
      {
        type: 'element',
        tagName: 'p',
        attributes: [],
        children: [
          {
            type: 'text',
            content: 'Himalaya parsed me...',
          },
        ],
      },
      {
        type: 'comment',
        content: ' ...and I liked it. ',
      },
    ],
  },
]

Note: In this example, text nodes consisting of whitespace are not shown for readability.

Features

Synchronous

Himalaya transforms HTML into JSON, that's it. Himalaya is synchronous and does not require any complicated callbacks.

Handles Weirdness

Himalaya handles a lot of HTML's fringe cases, like:

  • Closes unclosed tags <p><b>...</p>
  • Ignores extra closing tags <span>...</b></span>
  • Properly handles void tags like <meta> and <img>
  • Properly handles self-closing tags like <input/>
  • Handles <!doctype> and <-- comments -->
  • Does not parse the contents of <script>, <style>, and HTML5 <template> tags

Preserves Whitespace

Himalaya does not cut corners and returns an accurate representation of the HTML supplied. To remove whitespace, post-process the JSON; check out an example script.

Line, column, and index positions

Himalaya can include the start and end positions of nodes in the parse output. To enable this, you can pass parse the parseDefaults extended with includePositions: true:

import { parse, parseDefaults } from 'himalaya'
parse('<img>', { ...parseDefaults, includePositions: true })
/* =>
[
  {
    "type": "element",
    "tagName": "img",
    "attributes": [],
    "children": [],
    "position": {
      "start": {
        "index": 0,
        "line": 0,
        "column": 0
      },
      "end": {
        "index": 5,
        "line": 0,
        "column": 5
      }
    }
  }
]
*/

Going back to HTML

Himalaya provides a stringify method. The following example parses the HTML to JSON then parses the JSON back into HTML.

import fs from 'fs'
import { parse, stringify } from 'himalaya'

const html = fs.readFileSync('/webpage.html', { encoding: 'utf8' })
const json = parse(html)
fs.writeFileSync('/webpage.html', stringify(json))

Why "Himalaya"?

First, my friends weren't helpful. Except Josh, Josh had my back.

While I was testing the parser, I threw a download of my Twitter homepage in and got a giant JSON blob out. My code editor Sublime Text has a mini-map and looking at it sideways the data looked like a never-ending mountain range. Also, "himalaya" has H, M, L in it.

himalaya's People

Contributors

andrejewski avatar greenkeeper[bot] avatar xdumaine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

himalaya's Issues

add code coverage

I think we have good code coverage, but I want to know for certain so we can either:

  • show off
  • feel bad and work to get something we can show off

We can use coveralls.io for the honor/guilt badge.

Getting tags that are not defined

I m actually playing with DOM -> AST -> DOM and I was wondering if there is anyway of getting informations related on tags that are not define by standard HTML tags.

For example, I m having the following string:

<div>
    <h1>Hi there</h1>
    <result></result>
</div>

But result is not defined, moreover its empty, and so I can't know that it exists inside the string inside the AST.

Any suggestions ? :)

Nested Unordered Lists Error

I was testing nested unordered lists and noticed that the "li" tag within the nested "ul" became a child of the top-level "ul" when parsing. Example below:

<ul><li>TEXT<ul><li>SUBTEXT</li></ul></li></ul>

Results in (note: triple asterisks added to highlight the JS object in question):

[ { "type": "Element", "tagName": "ul", "attributes": {}, "children": [ { "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "TEXT" }, { "type": "Element", "tagName": "ul", "attributes": {}, "children": [] } ] }, ***{ "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "SUBTEXT" } ] }*** ] } ]

Expected (note: triple asterisks added to highlight the JS object in question):

[ { "type": "Element", "tagName": "ul", "attributes": {}, "children": [ { "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "TEXT" }, { "type": "Element", "tagName": "ul", "attributes": {}, ***"children": [ { "type": "Element", "tagName": "li", "attributes": {}, "children": [ { "type": "Text", "content": "SUBTEXT" } ] } ]*** } ] } ] } ]

What's interesting is it appears to only be limited to "ul" and "li" (at least in my limited testing). Similar markup with "div", "span" and a nested "div" seems to work fine.

<div><span>SPAN TEXT<div>NESTED DIV TEXT</div></span></div>

Confirmed in my own code and at https://jew.ski/himalaya/

properties issue

In some cases HTML properties are set as an attribute. But a property should have a boolean value - true / false.

With himalaya it's possible to set a HTML property as - example checked="checked" if we say that the "checked" is a property. It should have been checked="true"

A lot of edge cases not handled correctly

There are a lot of edge cases not handled correctly

  • tagName with spaces will be compiled totaly wrong
  • attributes with spaces will be compiled totaly wrong
  • stand alone tagName freaks out - </div>
  • crazy tagName freaks out - <</div>>
  • unfinished tagName freaks out - <div>
  • self closing tag with spaces
  • self closing tag with spaces, trailing text
  • normalize whitespace - e.g. Line one\n<br>\t \r\n\f <br>\nline two<font><br> x </font>
  • brackets in attribute - e.g. <div xxx="</div>">
  • unfinished comment. e.g. <!-- comment text or <!-- comment text -- or <!-- comment text -
  • unfinished attribute. e.g <div foo="
  • spaces in closing - e.g < / div > ( gives a weird output)
  • if no value on an attribute, you setting it to - name = name. Not a valid HTML & XML syntax
    • namespaces - XML, Xlink etc. E.g. <ns:tag>text</ns:tag>
      This is only a few edge cases

Unable to npm install himalaya

I'm not very much familiar with "npm install", but i want this amazing tool on my system with Ubuntu 16 LTS.
To do so I installed "npm" and "nodejs-legacy" with sudo-apt-get install command. Afterwhich I did "npm install himalaya", but I see this:

_$npm install himalaya
npm WARN saveError ENOENT: no such file or directory, open '/home/username/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/home/username/package.json'
npm WARN username No description
npm WARN username No repository field.
npm WARN username No README data
npm WARN username No license field.

added 2 packages in 0.686s_

$ npm version

npm version
{ npm: '5.0.0',
ares: '1.10.1-DEV',
cldr: '31.0.1',
http_parser: '2.7.0',
icu: '59.1',
modules: '57',
node: '8.0.0',
openssl: '1.0.2k',
tz: '2017b',
unicode: '9.0',
uv: '1.11.0',
v8: '5.8.283.41',
zlib: '1.2.11' }

Please help. Thanks.
I already tried : npm cache clean -f

Option to skip whitespaces

Preserves Whitespace

Himalaya does not cut corners and returns an accurate representation of the HTML supplied.

Is there an option to 'cut corners'? ๐Ÿ˜‚

Support older node versions

Issue #30 shows there are incompatibility issues with older Node versions. As Himalaya does not rely on any radical new features, we should include babel-polyfill and configure Travis CI to test of some of the more recent LTS versions.

form element isn't allowed inside phrasing content

As stated in the specs:

*"For example, a form element isn't allowed inside phrasing content, because when parsed as HTML, a form element's start tag will imply a p element's end tag. Thus, the following markup results in two paragraphs, not one:

<p>Welcome. <form><label>Name:</label> <input></form>
It is parsed exactly like the following:

<p>Welcome. </p><form><label>Name:</label> <input></form>"*

However. If you try to parse this:

<p>Welcome. <form><label>Name:</label> <input></form>

your parser totaly screw this up,

Problem on attributes with quotes in the value.

Hi, great work with this library, but i have some problemes when i use the function toHTML in the translate.js. When an attribute value contain a single quotes, the quotes in the attribute value are single too, when, i believe, it should be double, the same happens with double quotes. I searched in the implementation and the solution is a simple ! in the conditional in the line 13. Here is an example of the output when i use the parser and after of do somethig with the json, i use the toHTML method to come back to the HTML:
Original HTML

<button @click="$store.dispatch('INCREMENT')" class="increment"> Increment</button>

and the Output is this

<button @click='$store.dispatch('INCREMENT')' class='increment'> Increment</button>

This cause that the browser interpretate @click='$store.dispatch(' and increment')' and this is an error.
I using this library and i hope that the issues can be solved soon. Thanks for the great work.

Issue with line breaks, is it a problem?

I really don't know if this should work or not because at the end, there's no space between attributes and that's not right, I mean, as a valid html. But we're so used to browsers fixing up things for us (codes like this below do work on browsers) that I ask to myself, should this work on himalaya as well?

Thanks :)

var html = require("himalaya")

var markup = `<button custom-attr-one="Hello world"
custom-attr-two="Hello title">Button</button>`;

html.parse(markup)

output:

[
    {
        "type": "Element",
        "tagName": "button",
        "attributes": {
            "customAttrOne": "Hello world\"\ncustom-attr-two=\"Hello title"
        },
        "children": [
            {
                "type": "Text",
                "content": "Button"
            }
        ]
    }
]

Problem with attribute name

Hello. I am having a problem with trying to convert json to html obly with inputs and it's attribute name.
For Example, I translate html to json
<input type="text" name="name">
and I got this json

[
  {
    "type": "Element",
    "tagName": "input",
    "attributes": {
      "type": "text",
      "name": "name"
    },
    "children": []
  }
]

But When I try to translate this json to HTML I got something like this

<input type='text' name>

As you see attribute name lost it value. Could you help me to fix this issue ?
Thank you.

formatStyles error

if the style is

background-image:url("https://wd.geilicdn.com/bj-vshop-216085684-1496287337836-345156715_900_900.jpg.webp?w=400&")

the formatStyles function will return

{
     "backgroundImage":"url(&quot",
     "http":"//wd.geilicdn.com/bj-vshop-216085684-1496287337836-345156715_900_900.jpg.webp?w=400&amp"
}

I debug the code, and find code.

function formatStyles(str) {
     return str.trim().split(';').map(function (rule) {
          return rule.trim().split(':');
     }).reduce(function (styles, keyValue) {
โ€ฆโ€ฆ

no sanitizing or validation

There is no validation in this code, is this going to be added with a good performance?

E.g. tagName should be validated. I never heard about tagName like ''#DIV=()" or "sPAn" etc. Or Chinese or Arabic letters.

As suggested here: #5 Also validation would be needed. Components are upperCase letters.

performance issues

Use of lastIndexOf is terrible slow. Same with split(), and slice().

A better solution would be to skip this, and just iterate through the string. Should be faster.

Use of reduce(), map() and now in your latest change - filter() - are all performance killers.

Test it out on jsperf(). There exist better options!

Should support components?

A must this days is components, and if Himalaya should be used with virtual DOM libraries, I suggest you add in support for components. Now you have a Text, Element, Comment node.

What about adding in a 4th node? Component? I think that will drag users to this script.

inlineStyle and serializeAttr in translate.js has some errors

I input the AST tree that has been transformed

{
    "type": "Element",
    "tagName": "body",
    "children": [
        {
            "type": "Element",
            "tagName": "view",
            "attributes": {
                "className": [
                    "div"
                ]
            },
            "children": [
                {
                    "type": "Text",
                    "content": "\n    &lt;     &gt; a ' \" &amp;  \n    "
                }
            ]
        }
    ]
}

then i get the error

TypeError: Cannot convert undefined or null to object
    at inlineStyle (/projec_root/node_modules/himalaya/lib/translate.js:38:17)

happen in inlineStyle function

function inlineStyle(style) {
  return Object.keys(style).reduce(function (css, key) {
    return css + '; ' + dasherize(key) + ': ' + style[key];
  }, '').slice(2);
}

and serializeAttr also has some problem if value is null or undefined

function serializeAttr(attr, value, isXml) {
  if (!isXml && attr === value) return attr;

  try{
    console.log(value)
    //-------------- here 
    var text = value.toString();
  
    var quoteEscape = text.indexOf('\'') !== -1;
    var quote = quoteEscape ? '"' : '\'';
    return attr + '=' + quote + text + quote;
  }catch(e){
    console.log(value)
    console.log(e)
  }
  return ''
}

TypeError: str.charAt is not a function

getting the following error when running the basic setup

v7.5.0/lib/node_modules/himalaya/lib/lexer.js:30
var isText = str.charAt(state.cursor) !== '<';

TypeError: str.charAt is not a function]

MicroOp

Just flicking through ya code and noticed this line...
https://github.com/andrejewski/himalaya/blob/master/index.js#L42
if(!str.indexOf(commentStart)) {
I take it that line is checking to see if str starts with commentStart. Well I just wanted to point out that thats the slowest way of doing it (by alot). Check this....
https://jsperf.com/string-startswith/25
..notice the last test? might be ugly but the speed increases are huge. Used it once myself and made a huge difference. Use to be the lastIndexOf won (besides being hardcoded), was surprised to see slice and substring winning.

While Im here....
Those !str.indexOf,!~,etc are real clever and all but hard to know whats going on. I had to run tests in the console to figure it out as what the hell do ya google?

Not workig

Tried to convert below html:

Spectacular Mountain

Mountain View

Got below response:
[ { type: 'Element', tagName: '!doctype', attributes: { html: 'html' }, children: [] }, { type: 'Text', content: '\n' }, { type: 'Element', tagName: 'html', attributes: {}, children: [ [Object], [Object], [Object] ] }, { type: 'Text', content: '\n' } ]

Parser Ability To Handle Embedded Tables

Hi There,

I have a minimum set of instructions to reproduce an issue I believe is with the parser. The set of steps to reproduce are the following:

  1. Read Sample File Using i.e.
    HtmlTemplate = fs.readFileSync(TemplateHtmlPath, {encoding: 'utf8'});
  2. Convert To Json i.e.
    JsonTemplate = himalaya.parse(HtmlTemplate );
  3. Convert the Json Back To Html & Write To File i.e.
    let Html = toHTML(JsonTemplate );
    fs.writeFileSync(FinalHtmlPath, Html);

The Html returned from toHTML(JsonTemplate) is very different to the HTML that was originally loaded into the program. I don't think the module currently handles embedded tables. Thoughts? I attach sample files (Test.html, script.js & style.css). As you can see from the sample html it contains embedded tables.

Test.zip

only one instance of babel-polyfill is allowed

Throwing this error

/home/yogesh/yogesum/api/node_modules/babel-polyfill/lib/index.js:10
  throw new Error("only one instance of babel-polyfill is allowed");
  ^

Error: only one instance of babel-polyfill is allowed
    at Object.<anonymous> (/home/yogesh/yogesum/api/node_modules/babel-polyfill/lib/index.js:10:9)
    at Module._compile (module.js:570:32)
    at Module._extensions..js (module.js:579:10)
    at Object.require.extensions.(anonymous function) [as .js] (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:214:7)
    at Module.load (module.js:487:32)
    at tryModuleLoad (module.js:446:12)
    at Function.Module._load (module.js:438:3)
    at Module.require (module.js:497:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/yogesh/yogesum/api/node_modules/himalaya/lib/index.js:5:1)
    at Module._compile (module.js:570:32)
    at Module._extensions..js (module.js:579:10)
    at Object.require.extensions.(anonymous function) [as .js] (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:214:7)
    at Module.load (module.js:487:32)
    at tryModuleLoad (module.js:446:12)
    at Function.Module._load (module.js:438:3)
    at Module.require (module.js:497:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/yogesh/yogesum/api/server/components/pdfgen/htmlParse.js:2:18)
    at Module._compile (module.js:570:32)
    at normalLoader (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:199:5)
    at Object.require.extensions.(anonymous function) [as .js] (/home/yogesh/yogesum/api/node_modules/babel-core/lib/api/register/node.js:216:7)

content of himalaya/lib/index.js:

// line number 5
require('babel-polyfill');

var _lexer = require('./lexer');

himalaya version: 0.2.5 (project using babel for development)

Error: Cannot find module './lexer'

Hi!
meteor npm install himalaya

import himalaya  from 'himalaya';
const parsed = himalaya.parse(html)
return parsed;

Error: Cannot find module './lexer'

empty attributes values are converted to 0.

from a html string to json

input:

<div custom-attr=''></div>

output:

[{"tagName":"div","attributes":{"customAttr":0},"children":[],"type":"Element"}]

So when converting back to html:

<div custom-attr='0'></div>

xml and xml prolog issues

This compiles totaly wrong!!

<?xml version="1.0" ?>

and xml doesn't seem to be supported either. And this is not case sensitive

How to parse html strings

Like I have a data which fetched from website in html format like {"data" :"<p>How are you?</p>"}

How to convert these data to json using himalayas

require('himalaya') doesn't work for webpack builds in browser

Trying to use himalaya in a browser, but require('himalaya') is returning an empty object. This is due to the code in index.js:

if (typeof window === 'undefined') {
  module.exports = {default: lib, ...lib}
} else {
  window.himalaya = lib
}

I'm guessing this is to deal with the case where the file is being loaded from a script tag e.g. from a CDN. My advice would be to always do the module.exports code path, and then handle the script tag case with different build settings. If you wanted, you could build a dist/himalaya.js and dist/himalaya.min.js that people could use for this purpose.

I can make a PR if you like?

While loop break condition (possibily) wrong when trying to find closing tag in 'parse' method

Hey Chris,
Love the library. I am experiencing an issue when trying to crawl certain websites (e.g. )

In the parse method I see this snippet:

while (--_len > -1) {
 if (tagName === stack[_len].tagName) {
     stack = stack.slice(0, _len);
     nodes = stack[_len - 1].children;
     break;
  }
}

I admit I don't fully understand what's going on, but wouldn't that break for _len = 0 (after decrement) ?
It will still enter the while, provided that enters the if condition, stack is going to be always an empty array. On top of that we will try a lookup for stack[-1] which clearly doesn't have a children prop.

Am I missing something?
Thanks again

Don't force lowercase on anchors

Hello,

First, I would like to thank you for this awesome library. Works great.

I would like to add support for camel-cased anchors, which is useful for special HTML templating systems.
Right now, if we put camel case in HTML, himalaya cut it down to lower case:

var himalaya = require('himalaya');
var toHTML = require('himalaya/translate').toHTML;
toHTML( himalaya.parse('<div><specialAnchor><specialValue1>1</specialValue1><specialValue2>2</specialValue2></div>') )
// returns "<div><specialanchor><specialvalue1>1</specialvalue1><specialvalue2>2</specialvalue2></specialanchor></div>"

Since JSON supports (and use) camelCase, i think himalaya should provide an option to support it.
From looking into the source, it seems that this line is the problem:

const tagName = tagToken.content.toLowerCase() // parser.js:44

Since HTML supports uppercase for anchors, I think himalaya should return exactly what was parsed through toHTML(himalaya.parse(someString))

toHTML( himalaya.parse('<P>some text</P>') )
// returns '<p>some text</p>' but should return '<P>some text</P>'

What's your opinion about this? Is there another reason justifying the forced lower case on anchors?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.