Code Monkey home page Code Monkey logo

xml-parser's Introduction

XML Parser, Stringifier and DOM

Parse XML, HTML and more with a very tolerant XML parser and convert it into a DOM.

These three components are separated from each other as own modules.

Component Size
Parser 4.7 KB
Stringifier 1.3 KB
DOM 3.1 KB

Install

npm install xml-parse

Require

const xml = require("xml-parse");

Parser

Parsing is very simple.

Just call the parse method of the xml-parse instance.

const xml = require("xml-parse");

// Valid XML string
var parsedXML = xml.parse('<?xml version="1.0" encoding="UTF-8"?>' +
                          '<root>Root Element</root>');
console.log(parsedXML);

// Invalid XML string
var parsedInavlidXML = xml.parse('<root></root>' +
                                 '<secondRoot>' +
                                   '<notClosedTag>' +
                                 '</secondRoot>');
console.log(parsedInavlidXML);

Parsed Object Structure

The result of parse is an object that maybe looks like this:

(In this case we have the xml string of the given example)

[
  {
    type: 'element',
    tagName: '?xml',
    attributes: {
      version: '1.0',
      encoding: 'UTF-8'
    },
    childNodes: [],
    innerXML: '>',
    closing: false,
    closingChar: '?'
  },
  {
    type: 'element',
    tagName: 'root',
    attributes: {},
    childNodes: [
      {
        type: 'text',
        text: 'Root Element'
      }
    ],
    innerXML: 'Root Element',
    closing: true,
    closingChar: null
  }
]

The root object is always an array because of the fact that it handles invalid xml with more than one root element.

Object Nodes

There are two kinds of objects. element and text. An object has always the property type. The other keys depend from this type.

'Element' Object Node
{
  type: [String], // "element"
  tagName: [String], // The tag name of the tag
  attributes: [Object], // Object containing attributes as properties
  childNodes: [Array], // Array containing child nodes as object nodes ("element" or "text")
  innerXML: [String], // The inner XML of the tag
  closing: [Boolean], // If the tag is closed typically (</tagName>)
  closingChar: [String] || null // If it is not closed typically, the char that is used to close it ("!" or "?")
}
'Text' Object Node
{
  type: [String], // "text"
  text: [String] // Text contents of the text node
}

Stringifier

The stringifier is the simplest component. Just pass a parsed object structure.

const xml = require("xml-parse");

var xmlDoc = [
  {
    type: 'element',
    tagName: '?xml',
    attributes: {
      version: '1.0',
      encoding: 'UTF-8'
    },
    childNodes: [],
    innerXML: '>',
    closing: false,
    closingChar: '?'
  },
  {
    type: 'element',
    tagName: 'root',
    attributes: {},
    childNodes: [
      {
        type: 'text',
        text: 'Root Element'
      }
    ],
    innerXML: 'Root Element',
    closing: true,
    closingChar: null
  }
]

var xmlStr = xml.stringify(xmlDoc, 2); // 2 spaces

console.log(xmlStr);

DOM

The DOM method of xml-parser instance returns a Document-Object-Model with a few methods. It is oriented on the official W3 DOM but not complex as the original.

const xml = require("xml-parse");

var xmlDoc = new xml.DOM(xml.parse('<?xml version="1.0" encoding="UTF-8"?>' +
                                     '<root>Root Element</root>')); // Can also be a file path.

xmlDoc.document; // Document Object. (Root)

'Element' Object Node

// An element (e.g the 'document' object) has the following prototype methods and properties:

var objectNode = document.childNodes[1]; // Just an example

// This is the return of a object node element

objectNode = {
  type: 'element',
  tagName: 'tagName',
  attributes: [Object],
  childNodes: [Object],
  innerXML: 'innerXML',
  closing: true,
  closingChar: null,
  getElementsByTagName: [Function], // Returns all child nodes with a specific tagName
  getElementsByAttribute: [Function], // Returns all child nodes with a specific attribute value
  removeChild: [Function], // Removes a child node
  appendChild: [Function], // Appends a child node
  insertBefore: [Function], // Inserts a child node
  getElementsByCheckFunction: [Function], // Returns all child nodes that are validated by validation function
  parentNode: [Circular] // Parent Node
}

Handling with child nodes

With appendChild or insertBefore methods of every object node, you are allowed to append a child node. You do not have to do something like createElement.

Because a child node is just an object literal, with some properties like type, tagName, attributesand more you just have to pass such an object to the function.

appendChild

element.appendChild(childNode); // ChildNode is just a object node
Example
const xml = require('xml-parse');

var xmlDoc = new xml.DOM(xml.parse('<?xml version="1.0" encoding="UTF-8"?>' +
                                     '<root>Root Element</root>'));

var root = xmlDoc.document.getElementsByTagName("root")[0];

root.appendChild({
  type: "element",
  tagName: "appendedElement",
  childNodes: [
    {
      type: "text",
      text: "Hello World :) I'm appended!"
    }
  ]
});

insertBefore

element.insertBefore(childNode, elementAfter); // ChildNode is just an object literal, 'elementAfter' is just a child node of the parent element
Example
const xml = require('xml-parse');

var xmlDoc = new xml.DOM(xml.parse('<?xml version="1.0" encoding="UTF-8"?>' +
                                     '<root>Root Element</root>'));

var root = xmlDoc.document.getElementsByTagName("root")[0];

root.insertBefore({
  type: "element",
  tagName: "insertedElement",
  childNodes: [
    {
      type: "text",
      text: "Hello World :) I'm appended!"
    }
  ]
}, root.childNodes[0]);

removeChild

element.removeChild(childNode); // 'childNode' is just a children of the parent element ('element')
Example
const xml = require('xml-parse');

var xmlDoc = new xml.DOM(xml.parse('<?xml version="1.0" encoding="UTF-8"?>' +
                                     '<root>Root Element</root>'));

var root = xmlDoc.document.getElementsByTagName("root")[0];

root.removeChild(root.childNodes[0]);

parentNode

The parentNode of a object node represents its parent element. It's a [Circular] reference.

const xml = require('xml-parse');

var xmlDoc = new xml.DOM(xml.parse('<?xml version="1.0" encoding="UTF-8"?>' +
                                     '<root>Root Element</root>'));

var root = xmlDoc.document.getElementsByTagName("root")[0];

console.log(root.childNodes[0].parentNode); // Returns the 'root' element

Get child nodes

getElementsByTagName

element.getElementsByTagName("myTagName"); // Returns all elements whose tag name is 'myTagName'

getElementsByAttribute

element.getElementsByAttribute("myAttribute", "myAttributeValue"); // Returns all elements whose attribute 'myAttribute' is 'myAttributeValue'

getElementsByCheckFunction

// With this method you can set custom 'get' methods.
element.getElementsByCheckFunction(function(element) {
  if (element.type === "element" && element.childNodes.length == 30) {
    return true;
  }
}); // Returns all elements that have exactly 30 childNodes

xml-parser's People

Contributors

benedictchen avatar mauriceconrad avatar muwum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

xml-parser's Issues

QNames and Whitespace

Try parsing the following XML

<marc:record xmlns:marc="http://www.loc.gov/MARC21/slim">
marc:leader00000cam a2200000 4500</marc:leader>
<marc:controlfield tag="001">25</marc:controlfield>
<marc:controlfield tag="005">20150304</marc:controlfield>
<marc:controlfield tag="008">950620p19821982||||||||||||||||||||nor|||</marc:controlfield>
<marc:datafield tag="015" ind1="" ind2="">
<marc:subfield code="a">82,A49,0102</marc:subfield>
</marc:datafield>
<marc:datafield tag="020" ind1="" ind2="">
<marc:subfield code="9">3-7678-0565-0</marc:subfield>
<marc:subfield code="c">Pp. : DM 9.80</marc:subfield>
</marc:datafield>
</marc:record>

  1. There is a CR/LF sequence after the processing instruction that turns the entire XML into a text node.

  2. CR/LF seems to make the parser stop parsing - so you have to globally remove them

  3. QNames are not recognised (marc:record is a QName) so you have to remove the namespace prefix.

Then it works, and it wasn't too slow. But how on earth can one handle the result? Make one small change and you get the same sort of tree without any indication that something went wrong.

And there ought to be an option to ignore whitespace.

Extremely slow execution

Trying to parse this XML:

<td >Mexico</td><td style="max-width: 768px; white-space: normal;">123456789012345678901234567</td><td></td><td>&nbsp;<span class="mx" data-price="N/A"></span></td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>

takes up to 6 seconds. Pretty slow.

Well, if you add 5 more numbers to the second <td>'s content:

<td >Mexico</td><td style="max-width: 768px; white-space: normal;">12345678901234567890123456789012</td><td></td><td>&nbsp;<span class="mx" data-price="N/A"></span></td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>

the execution time is now 180 seconds.

It seems like each symbol added doubles the execution time. The fuck.

TypeScript support

It would be nice to be able to use xml-parse in TypeScript with ECMAScript modules.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.