Code Monkey home page Code Monkey logo

html-dom-parser's Introduction

html-dom-parser

NPM

NPM version Bundlephobia minified + gzip Build Status codecov NPM downloads

HTML to DOM parser that works on both the server (Node.js) and the client (browser):

HTMLDOMParser(string[, options])

The parser converts an HTML string to a JavaScript object that describes the DOM tree.

Example

import parse from 'html-dom-parser';

parse('<p>Hello, World!</p>');
Output

[
  Element {
    type: 'tag',
    parent: null,
    prev: null,
    next: null,
    startIndex: null,
    endIndex: null,
    children: [
      Text {
        type: 'text',
        parent: [Circular],
        prev: null,
        next: null,
        startIndex: null,
        endIndex: null,
        data: 'Hello, World!'
      }
    ],
    name: 'p',
    attribs: {}
  }
]

Replit | JSFiddle | Examples

Install

NPM:

npm install html-dom-parser --save

Yarn:

yarn add html-dom-parser

CDN:

<script src="https://unpkg.com/html-dom-parser@latest/dist/html-dom-parser.min.js"></script>
<script>
  window.HTMLDOMParser(/* string */);
</script>

Usage

Import with ES Modules:

import parse from 'html-dom-parser';

Require with CommonJS:

const parse = require('html-dom-parser').default;

Parse empty string:

parse('');

Output:

[]

Parse string:

parse('Hello, World!');
Output

[
  Text {
    type: 'text',
    parent: null,
    prev: null,
    next: null,
    startIndex: null,
    endIndex: null,
    data: 'Hello, World!'
  }
]

Parse element with attributes:

parse('<p class="foo" style="color: #bada55">Hello, <em>world</em>!</p>');
Output

[
  Element {
    type: 'tag',
    parent: null,
    prev: null,
    next: null,
    startIndex: null,
    endIndex: null,
    children: [ [Text], [Element], [Text] ],
    name: 'p',
    attribs: { class: 'foo', style: 'color: #bada55' }
  }
]

The server parser is a wrapper of htmlparser2 parseDOM but with the root parent node excluded. The next section shows the available options you can use with the server parse.

The client parser mimics the server parser by using the DOM API to parse the HTML string.

Options (server only)

Because the server parser is a wrapper of htmlparser2, which implements domhandler, you can alter how the server parser parses your code with the following options:

/**
 * These are the default options being used if you omit the optional options object.
 * htmlparser2 will use the same options object for its domhandler so the options
 * should be combined into a single object like so:
 */
const options = {
  /**
   * Options for the domhandler class.
   * https://github.com/fb55/domhandler/blob/master/src/index.ts#L16
   */
  withStartIndices: false,
  withEndIndices: false,
  xmlMode: false,
  /**
   * Options for the htmlparser2 class.
   * https://github.com/fb55/htmlparser2/blob/master/src/Parser.ts#L104
   */
  xmlMode: false, // Will overwrite what is used for the domhandler, otherwise inherited.
  decodeEntities: true,
  lowerCaseTags: true, // !xmlMode by default
  lowerCaseAttributeNames: true, // !xmlMode by default
  recognizeCDATA: false, // xmlMode by default
  recognizeSelfClosing: false, // xmlMode by default
  Tokenizer: Tokenizer,
};

If you're parsing SVG, you can set lowerCaseTags to true without having to enable xmlMode. This will return all tag names in camelCase and not the HTML standard of lowercase.

Note

If you're parsing code client-side (in-browser), you cannot control the parsing options. Client-side parsing automatically handles returning some HTML tags in camelCase, such as specific SVG elements, but returns all other tags lowercased according to the HTML standard.

Migration

v5

Migrated to TypeScript. CommonJS imports require the .default key:

const parse = require('html-dom-parser').default;

v4

Upgraded htmlparser2 to v9.

v3

Upgraded domhandler to v5. Parser options like normalizeWhitespace have been removed.

v2

Removed Internet Explorer (IE11) support.

v1

Upgraded domhandler to v4 and htmlparser2 to v6.

Release

Release and publish are automated by Release Please.

Special Thanks

License

MIT

html-dom-parser's People

Contributors

albertaz1992 avatar andrewleedham avatar blizzardengle avatar dependabot[bot] avatar dreierf avatar github-actions[bot] avatar mergify[bot] avatar remarkablemark avatar russiancow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

html-dom-parser's Issues

Parser completely removes body elements if html string has open <html> and <body> tags, but does not have close </body> and tag

Expected Behavior

Parser should try to keep all existing html tags if html string has open and tags, but does not have close and tags

Actual Behavior

Parser completely removes all html children elements if html string has open and tags, but does not have close tag

Steps to Reproduce

Just try to parse following html string:

 <html>
        <body>
          <h1 style="font-family: Arial;">
            html-react-parser
          </h1>

Reproducible Demo

https://jsfiddle.net/d2g59ch4/

Environment

  • Version: 0.2.2
  • Platform: Mac OS
  • Browser: Chrome 77

<head> and <body> with whitespace doesn't get parsed correctly

Relates to remarkablemark/html-react-parser#624

Expected Behavior

When I parse HTML with html-dom-parser on the client:

<head></head>
<body
>text</body>

I should get both head and body elements.

Actual Behavior

I get only head element and no body:

[
  {
    "parent": null,
    "prev": null,
    "next": null,
    "startIndex": null,
    "endIndex": null,
    "children": [],
    "name": "head",
    "attribs": {},
    "type": "tag"
  }
]

Steps to Reproduce

import parse from 'html-dom-parser'

parse(`
<head></head>
<body
>text</body>
`)

The cause of the bug is due to catch-all-regex for head and body in domparser needs to be DOTALL to include newlines.

Reproducible Demo

https://codesandbox.io/s/html-react-parsser-624-p0i5pd?file=/src/App.js

Environment

  • Version: 3.0.0

Bug in Nodes implementation of case sensitive tag names

Expected Behavior

When parsing a complex SVG in browser or in Node the tagName (name) of many elements should be case sensitive. Here is the spec list indicating which tags.

Actual Behavior

When attempting to parse a complex SVG element (this camera) in Node several elements are having their names lower-cased instead of being returned in camel-case. Specifically the radialGradient and linearGradient tags are effected in this example.

Steps to Reproduce

Attempt to parse just the svg code from this camera in Node.

Reproducible Demo

html-dom-parser works as expected in the browser: https://jsfiddle.net/nx7h6wvz/

html-dom-parser DOES NOT work in Node. Here is how you can test this:

  1. Have Node installed and install html-dom-parser
  2. Create a file named test.js with the following code:
const Parse = require('html-dom-parser');
const Fs = require('fs');

function printTagNames(parent) {
    let html = '';
    parent.children.forEach((child) => {
        // Skip if not an actual element.
        if (child.type !== 'tag') {
            return;
        }
        // Get and show the tag name.
        const tagName = child.name;
        console.log(tagName);
        // Recurse through the children.
        if (child.children) {
            html += printTagNames(child);
        }
    });
    return html;
}

const code = Parse(Fs.readFileSync('camera.svg').toString());

// Locate SVG:
let svg;
code.forEach((ref) => {
    if (ref.type === 'tag') {
        if (ref.name === 'svg') {
            svg = ref;
        }
    }
});

printTagNames(svg);
  1. Download the (this camera) and place it in the same directory as the test.js file. Make sure to rename the svg file as camera.svg.
  2. Run the test in your terminal with: node test.js

Environment

  • Version: 3.1.5
  • Platform: Linux, Ubuntu Debian, Node v18.13.0
  • Browser: Latest Chromium (Not Applicable)
  • OS: Pop!_OS

Keywords

  • name
  • tagName
  • incorrect name case
  • case sensitivity
  • CASE_SENSITIVE_TAG_NAMES

Add support for edge environments like Vercel Edge, Cloudflare workers, etc

Problem

When I tried to use this library (indirectly, via html-react-parser), I got the error: "Error: This browser does not support document.implementation.createHTMLDocument". I believe this is because the browser version of the library was being loaded instead of the server version.

This is basically same underlying issue as #181.

Suggested Solution

Ideally the bundling would work correctly such that the server version would be selected automatically in edge environments.

Short of that, you could have an explicit server export, so eg one could do something like import parse from 'html-dom-parser/server' to force loading the server version. this would then need to be propagated to html-react-parser as well, so one could force that library to use the server version of this library.

Workarounds

I verified that the underlying htmlparser2 library works fine in edge environment, so my workaround for now is to use that directly to parse the html to DOM, manually unset the parent for the nodes, and then call domToReact from html-react-parser directly with the parsed dom nodes. This works fine and is simple, so this is not high urgency - but would be nice to have for others trying to use this library (or more likely html-react-parser) in edge environments.

Method of removing title parameter causes crash in IE9

This surfaced through my use of html-react-parser.

Console throws SCRIPT600: Invalid target element for this operation.

Debugger points to doc.documentElement.innerHTML = ''; in domparser.js

https://msdn.microsoft.com/en-us/library/ms533897(v=vs.85).aspx

The innerHTML property is read-only on the col, colGroup, frameSet, html, head, style, table, tBody, tFoot, tHead, title, and tr objects.
You can change the value of the title element using the document.title property.

Prepare for Chrome User-Agent reduction

Description

Chrome is changing how much userAgent data they expose to the browser. Usage of navigator.userAgent in https://github.com/remarkablemark/html-dom-parser/blob/master/lib/client/utilities.js#L138 will throw browser warnings in the future.

Possible solution

something like

function isIE() {
  const reducedUAString = navigator?.userAgentData?.brands
              ?.map?.(
                  ({ brand, version }) => (`${brand}/${version}`),
              )?.join?.(' ');
  
  // for other browsers
  if( !reducedUAString ) {
      return  /(MSIE |Trident\/|Edge\/)/.test(navigator.userAgent);
  } else {
    // for chrome
    return  /(MSIE |Trident\/|Edge\/)/.test(reducedUAString);
  }
}

Additional context

User-Agent (UA) reduction is the effort to minimize the identifying information shared in the User-Agent string which may be used for passive fingerprinting. As these changes are rolled out, all resource requests will have a reduced User-Agent header. As a result, the return values from certain Navigator interfaces will be reduced, including: navigator.userAgent, navigator.appVersion, and navigator.platform.

https://developer.chrome.com/docs/privacy-sandbox/user-agent/

Resolution error when using TypeScript with `moduleResolution` set to `node16` or `nodenext`

Expected Behavior

html-dom-parser to resolve correctly when using TypeScript with moduleResolution set to node16 or nodenext (effectively: in pure ESM environment).

Actual Behavior

html-dom-parser is not resolved correctly:

https://arethetypeswrong.github.io/?p=html-dom-parser%405.0.7

This also causes the types to be non-functional in VSCode.

Steps to Reproduce

Create a project using TypeScript with moduleResolution set to node16 or nodenext.
Create a file with the following content:

import htmlToDOM from 'html-dom-parser';

you'll already see, when you hover over htmlToDOM bit, that types are not loaded correctly.

Reproducible Demo

Environment

  • Version: 5.1.2
  • Platform: Node.js
  • Browser: not applicable
  • OS: macOS Sonoma 14.2.1

Keywords

Relates to remarkablemark/html-react-parser#1305

Can't import the named export 'formatDOM' from non EcmaScript module (only default export is available)

Expected Behavior

Actual Behavior

I'm receiving this error when I simply try to use the library.

error - /app/node_modules/html-dom-parser/lib/client/html-to-dom.mjs
Can't import the named export 'formatDOM' from non EcmaScript module (only default export is available)

Steps to Reproduce

I am actually using html-react-parser which is using html-dom-parser. But as I checked in node_modules, I have the latest version of both libraries:

"html-dom-parser": "^3.1.1",
"html-react-parser": "^3.0.3",

which means it includes this fix as well: #335
But I am still having this issue.

Reproducible Demo

This is actually simply how I am using this library:

import parse from 'html-react-parser';

parse(someString);

But this problem is happening inside NextJS.

Environment

  • Version: 3.1.1
  • Platform: I am using NextJs, that may be causing the issue as well!
  • Browser: Not Applicable, it happens in server

Usage of html-dom-parser-server in browser

Hi,

I'm using your html-react-parser npm lib which in turn has a dependency on this repo.
I'd like to make use of the htmlparser2 options but this doesn't work within the browser as the implementation within this repo is swapped out for the html-dom-parser-client implementation.

I've quickly changed the package.json in my local version to use the html-dom-parser-server implementation and it works fine in the browser (when webpacked) and allows me to provide the options I require.

I'm wondering if the client implementation was/is still needed (given the server implementation wrapping htmlparser2 appears to work fine)?
If so, is there a way I can opt into using the server one regardless?
I couldn't see a way but I could also easily be missing something.

Thanks

Iain

SVG <clipPath> tag is erroneously lowercased

Any SVG <clipPath> elements in the parsed document have their tag names lowercased. Unfortunately, SVG is case-sensitive, so this is incorrect.

Smallest reproducible example:

const htmlDomParser = require('html-dom-parser')
const dom = htmlDomParser('<svg><clipPath></clipPath></svg>')
dom[0].children[0].name == 'clippath' // true

I can try to submit a PR, but I haven't gotten far enough to pin-point the source of the issue yet.

Carriage return is stripped in client parser

Expected Behavior

Carriage return preserved in client parser:

import parse from 'html-dom-parser';

parse('\r\n'); // '\r\n'

Actual Behavior

Carriage return stripped in client parser:

import parse from 'html-dom-parser';

parse('\r\n'); // '\n'

Steps to Reproduce

See above

Reproducible Demo

https://jsfiddle.net/remarkablemark/a8zqgp4s/

Environment

  • Version: 3.1.5
  • Platform: Browser
  • Browser: Chrome
  • OS:

Keywords

carriage return, newline, client parser, innerHTML

fails to parse complex html with script tags

Expected Behavior

all tags should be parsed

Actual Behavior

the last 2 tags p and audio are omitted and instead show up in the above div.

Steps to Reproduce

see PR that adds a failing test case

#24

Reproducible Demo

see PR #24

Environment

  • Version: 0.2.3
  • Platform: Mac OSX (mocha)
  • Browser: N/A

`<template>` children are not parsed correctly on the client

Expected Behavior

import parse from 'html-dom-parser';

parse('<template>test</template>');

Output:

[
  {
    "parent": null,
    "prev": null,
    "next": null,
    "startIndex": null,
    "endIndex": null,
    "children": [
      {
        "parent": "[Circular]",
        "prev": null,
        "next": null,
        "startIndex": null,
        "endIndex": null,
        "data": "test",
        "type": "text"
      }
    ],
    "name": "template",
    "attribs": {},
    "type": "tag"
  }
]

Actual Behavior

import parse from 'html-dom-parser';

parse('<template>test</template>');

Output:

[
  {
    "parent": null,
    "prev": null,
    "next": null,
    "startIndex": null,
    "endIndex": null,
    "children": [],
    "name": "template",
    "attribs": {},
    "type": "tag"
  }
]

Steps to Reproduce

Open ./examples/index.html in a browser and enter the HTML:

<template><article><p>Test</p></article></template>

See the output children is [].

Reproducible Demo

Environment

  • Version: 3.1.3
  • Platform:
  • Browser: Chrome
  • OS:

Keywords

template

lib/client/html-to-dom.mjs is not a valid ECMAScript Module

0c4c2b6 has added https://github.com/remarkablemark/html-dom-parser/blob/master/lib/client/html-to-dom.mjs with

module.exports = HTMLDOMParser;
module.exports.default = HTMLDOMParser;

which are not valid ECMAScript exports statements (valid syntax). export and export default are, like in index.mjs.

The end result is this is that my project crashes on browser with ReferenceError: module is not defined.

Could you either revert that change or make it ECMAScript compliant?

Bug: Assumes all attributes are lower case

Some attributes are not supposed to be lowercase, for example SVG attributes.

Example
viewBox attribute is made lower case.

const htmlToDOM = require('html-dom-parser');

const html = `<svg viewBox="foo"></svg>`;
const output = htmlToDOM(html);

console.log(output);

// Output
// [ { type: 'tag',
//     name: 'svg',
//     attribs: { viewbox: 'foo' },
//     children: [],
//     next: null,
//     prev: null,
//     parent: null } ]

Add Deno support

Expected Behavior

html-dom-parser uses a server dom parser implementation on deno, similar to nodejs. But with guards in place if there is no document defined.

Actual Behavior

Throwing exception that document.implementation doesn't exist. There is no document defined on deno.

Steps to Reproduce

Import html-dom-parser on deno. In my case I used a library that uses html-react-parser, which uses html-dom-parser underneath.

Reproducible Demo

Save this to a file like test.js and then run it on deno: deno run test.js

import parse from 'https://cdn.skypack.dev/html-dom-parser';

console.log(parse('<p class="foo" style="color: #bada55">Hello, <em>world</em>!</p>'));

Convert back DOM elements to string

Hi,

Is it possible to convert back the parsed string?

Because I'd like to delete the last empty tags from a string, and I really don't know how to do it if not by parsing the text into dom elements, remove empty tags, and then convert it all back.

Thanks :)

Self Closing Tags

If you submit a self closing tag e.g. <img /> the function stops parsing at that point

Suggest ESM support

domhandler has supported ESM since v5 fb55/domhandler@5582477

But html-dom-parser still imports commonjs format of domhandler with the latest version(v5.0.3) https://github.com/remarkablemark/html-dom-parser/blob/master/package.json#L45.

That will cause some TypeErrors when you try to build a package with html-dom-parser to ESM, as domhandler import as esm, the uses like require('domhandler').Text get undefined: https://github.com/remarkablemark/html-dom-parser/blob/master/lib/client/utilities.js#L9

So i think html-dom-parser should support ESM build

TypeError: Cannot read properties of undefined (reading 'length') at formatDOM

Expected Behavior

Parses without throwing error:

import parse from 'html-dom-parser';

parse('<meta name="author" content="John Doe Mason />');

Actual Behavior

Throws error:

TypeError: Cannot read properties of undefined (reading 'length')
 at formatDOM (lib/client/utilities.js)

Steps to Reproduce

import parse from 'html-dom-parser';

parse('<meta name="author" content="John Doe Mason />');

This is caused by the newly added line: https://github.com/remarkablemark/html-dom-parser/blob/v3.1.4/lib/client/utilities.js#L88

Reproducible Demo

Environment

  • Version: 3.1.4
  • Platform:
  • Browser:
  • OS:

Keywords

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.