Code Monkey home page Code Monkey logo

swift-html-parser's Introduction

SwiftHTMLParser

SwiftHTMLParser is a library for parsing and traverseing HTML and XML written in Swift. It parses plaintext HTML or XML into an object tree (DOM), and allows for the easy traversal and searching of the tree's nodes, similar to an HTML Selector or XPath.

Installation

To depend on SwiftHTMLParser in your own project, add it to the dependencies clause in your Package.swift file:

dependencies: [
    .package(url: "https://github.com/rnantes/swift-html-parser.git", from: "1.0.0")
]

Basic Structure

Object naming is based on the HTML Standard. There are also easy to follow introductions available from w3schools and w3.

  • Node, a protocol: - Consists of an start and closing Tag. (Closing tags may be ommited in some special cases)
  • Tag, a struct: - contains the tag's name, the opening tag contains any of the node's Attributes
  • Attribute, a struct: - consist of a name and an associated value

Nodes

  • Element, a struct: - a Node that may contain nested nodes.
  • TextNode, a struct:- a Node that represents a block of text.
  • Comment, a struct: - a Node that represents a single or multi-line comment within an element.
  • CData, a struct: - a Node that represents a CData section and its associated text.
  • DocumentTypeNode, a struct: - a Node which provides metadata on how to parse the document

Using the API

Read in Plaintext HTML from a File

let fileURL = URL.init(fileURLWithPath: "/some/absolute/path/simple.html")!

Parse the HTML String Into a Tree of Node Objects (DOM)

let nodeTree = try HTMLParser.parse(htmlString)

Alternativly to parse an XML file

let nodeTree = try XMLParser.parse(xmlString)

Create a Node Selector Path Then Traverse the Node Tree to Find Matching Nodes

Element, Text, Comment, and CData selectors are availabe

// create a node selector path to describe what nodes to match in the nodeTree
let nodeSelectorPath: [NodeSelector] = [
    ElementSelector().withTagName("html"),
    ElementSelector().withTagName("body"),
    ElementSelector().withTagName("div").withClassName("essay"),
    ElementSelector().withTagName("p").atPosition(0)
]

// find the nodes that match the nodeSelectorPath
let matchingNodes = HTMLTraverser.findNodes(in: nodeTree, matching: nodeSelectorPath)

Tutorial

The HTML File We Will Use for The Following Examples

We will use the example file: simple.html

<!DOCTYPE html>
<html>
    <head>
        <title>This is a Simple Example</title>
    </head>
    <body>
        <h1>This is a Heading</h1>

        <div class="essay">
            <p class="essay-paragraph opening-paragraph">This is the first paragraph.</p>
            <p class="essay-paragraph body-paragraph">This is the second paragraph.</p>
            <p class="essay-paragraph body-paragraph">This is the third paragraph.</p>
            <p class="essay-paragraph body-paragraph">This is the fourth paragraph.</p>
            <p class="essay-paragraph closing-paragraph">This is the fifth paragraph.</p>

            <div>
                <h3>Editor Notes</h3>
                No notes here
            </div>
        </div>

        <div class="bibliography">
            <ul>
                <li id="citation-1998">This is the first citation.</li>
                <li id="citation-1999">This is the second citation.</li>
                <li id="citation-2000">This is the third citation.</li>
            </ul>

            <div>
                <h3>Bibliography Notes</h3>
                No notes here
            </div>
        </div>

    </body>
</html>

Find Matching Elements

func parseAndTraverseSimpleHTML() throws {
    // get string from file
    let fileURL = URL.init(fileURLWithPath: "/some/absolute/path/simple.html")!
    let htmlString = try String(contentsOf: fileURL, encoding: .utf8)

    // parse the htmlString into a tree of node objects (DOM)
    let nodeTree = try HTMLParser.parse(htmlString)

    // create a node selector path to describe what nodes to match in the nodeTree
    let nodeSelectorPath: [NodeSelector] = [
        ElementSelector().withTagName("html"),
        ElementSelector().withTagName("body"),
        ElementSelector().withTagName("div").atPosition(0),
        ElementSelector().withTagName("p").withClassName("body-paragraph")
    ]

    // find the elements that match the nodeSelectorPath
    // notice we use the findElements() function which only matches elements
    let matchingElements = HTMLTraverser.findElements(in: nodeTree, matching: nodeSelectorPath)

    // matchingElements will contain the 3 matching <p> elements with the className 'body-paragraph'
    // will print: 3
    print(matchingElements.count)
}

Find a Matching Text Node

func parseAndTraverseSimpleHTMLTextNode() throws {
    // get string from file
    let fileURL = URL.init(fileURLWithPath: "/some/absolute/path/simple.html")!
    let htmlString = try String(contentsOf: fileURL, encoding: .utf8)

    // parse the htmlString into a tree of node objects (DOM)
    let nodeTree = try HTMLParser.parse(htmlString)

    // create a node selector path to describe what nodes to match in the nodeTree
    // this is equvalent to the selector: body > p or xpath: /html/body/p
    let nodeSelectorPath: [NodeSelector] = [
        ElementSelector().withTagName("html"),
        ElementSelector().withTagName("body"),
        ElementSelector().withTagName("div").withClassName("bibliography"),
        ElementSelector().withTagName("ul"),
        ElementSelector().withTagName("li").withId("citation-1999"),
        TextNodeSelector()
    ]

    // find the nodes that match the nodeSelectorPath
    // Notice we use the findNodes() function which can match with any node type
    let matchingNodes = HTMLTraverser.findNodes(in: nodeTree, matching: nodeSelectorPath)

    // matchingNodes will contain the matching generic node
    // we have to cast the Node to a TextNode to access its text property
    guard let paragraphTextNode = matchingNodes.first as? TextNode else {
        // could not find paragraph text node
        return
    }

    // will print: This is the second citation.
    print(paragraphTextNode.text)
}

Find Matching Elements Using a Child Node Selector Path

func parseAndTraverseSimpleHTMLChildNodeSelectorPath() throws {
    // get string from file
    let fileURL = URL.init(fileURLWithPath: "/some/absolute/path/simple.html")!
    let htmlString = try String(contentsOf: fileURL, encoding: .utf8)

    // parse the htmlString into a tree of node objects (DOM)
    let nodeTree = try HTMLParser.parse(htmlString)

    // create a child node selector path that will match the parent node
    // only if the childNodeSelectorPath matches the element's child nodes
    let childNodeSelectorPath: [NodeSelector] = [
        ElementSelector().withTagName("div"),
        ElementSelector().withTagName("h3"),
        TextNodeSelector().withText("Editor Notes")
    ]

    // create a node selector path to describe what nodes to match in the nodeTree
    // Notice the last ElementSelector will only match if the element contains
    // child nodes that match the childNodeSelectorPath
    let nodeSelectorPath: [NodeSelector] = [
        ElementSelector().withTagName("html"),
        ElementSelector().withTagName("body"),
        ElementSelector().withTagName("div").withChildNodeSelectorPath(childNodeSelectorPath)
    ]

    // find the nodes that match the nodeSelectorPath
    // Notice we use the findNodes() function which can match with any node type
    let matchingElements = HTMLTraverser.findElements(in: nodeTree, matching: nodeSelectorPath)

    // matchingElements should only contain the div element with the 'essay' class name
    // will print: 1
    print(matchingElements.count)

    guard let divElement = matchingElements.first else {
        // could not find paragraph text node
        XCTFail("could not find paragraph text node")
        return
    }

    guard let firstClassName = divElement.classNames.first else {
        // divElement does not have any classnames
        return
    }

    // will print: essay
    print(firstClassName)
}

Testing

Automated testing was used to validate the parsing of tags, comments, single and double quoted attributes, imbedded JavaScript, etc. Specially created sample HTML files as well as HTML from top sites were used in testing. However, all cases may not have been covered. Please open a issue on Github and provide sample HTML if you discover a bug so it can be fixed and a test case can be added.

Run Tests Via the Command Line

swift test

Run Tests Via Docker

docker build -t swift-html-parser . && docker run -it swift-html-parser

swift-html-parser's People

Contributors

rnantes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.