Code Monkey home page Code Monkey logo

html-parser's Introduction

HTML Parser

The task of Advanced Programming class on 2019/03/28. C++ HTML parser that generates a simple DOM tree.

Requirements

  • C++ compiler with C++17 support
  • CMake (>= 3.0)

Sample

This project contain a sample that read HTML input from a file or stdin, and print the colorized DOM tree to the terminal.

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ../src
make

# Read HTML input from stdin
./html-parser <<< '<div><a href=/qwq title="qaq">a</a> &le; b<!-- Comment --></div>'

# Read HTML input from index.html
./html-parser index.html

The output is like (for the first example that reads from stdin):

API

Include HTMLDocument.h.

HTMLDocument

The interface to parse HTML string and get data from it.

HTMLDocument::HTMLDocument

Construct a HTMLDocument object from a std::istream or string.

// explicit HTMLDocument::HTMLDocument(std::istream &)
HTMLDocument document1(std::cin);

// explicit HTMLDocument::HTMLDocument(std::istream &&)
HTMLDocument document2(std::ifstream("index.html"));

// explicit HTMLDocument::HTMLDocument(const StringEx &)
HTMLDocument document3("<div>a &le; b</div>");

HTMLDocument::parse

Parse HTML document from a new string, replacing the current if exists.

HTMLDocument document(std::cin);

// void HTMLDocument::parse(const StringEx &)
document.parse("<div>a &le; b</div>");

HTMLDocument::inspect

Print the colorized DOM tree of HTML document to the terminal.

HTMLDocument document("<div>a &le; b</div>");

// void HTMLDocument::inspect()
document.inspect();

HTMLDocument::getTextContent

Get all text in the document.

HTMLDocument document("<div>a &le; b</div><div>qwq</div>");

// StringEx HTMLDocument::getTextContent()
StringEx textContent = document.getTextContent();
// textContent = "a ≤ bqwq"

HTMLDocument::getElementById

Get the element whose id attribute equals to a string. Return a HTMLDocument::Element object if found, a null HTMLDocument::Element object if NOT found.

HTMLDocument document("<div id="my-div">a &le; b</div>");

// HTMLDocument::Element HTMLDocument::getElementById(const StringEx &)
HTMLDocument::Element div = document.getElementById("my-div");

HTMLDocument::getElementsByName

Get all elements whose name attribute equal to a string. Return a std::vector<HTMLDocument::Element> that contains all matching elements.

HTMLDocument document("<div name="my">a &le; b</div><span name="my">qwq</span>");

// std::vector<HTMLDocument::Element> HTMLDocument::getElementsByName(const StringEx &)
std::vector<HTMLDocument::Element> elements = document.getElementsByName("my");

HTMLDocument::getElementsByTagName

Get all elements whose tag name equals to a string. Return a std::vector<HTMLDocument::Element> that contains all matching elements.

HTMLDocument document("<div>a &le; b</div><div>qwq</div>");

// std::vector<HTMLDocument::Element> HTMLDocument::getElementsByTagName(const StringEx &)
std::vector<HTMLDocument::Element> elements = document.getElementsByTagName("div");

HTMLDocument::getElementsByClassName

Get all elements which have a certain class. Return a std::vector<HTMLDocument::Element> that contains all matching elements.

HTMLDocument document("<div class="my-class">a &le; b</div><div class="my-class">qwq</div>");

// std::vector<HTMLDocument::Element> HTMLDocument::getElementsByClassName(const StringEx &)
std::vector<HTMLDocument::Element> elements = document.getElementsByClassName("my-class");

HTMLDocument::getTitle

Get the page title (i.e. text inside the first <title> tag) of the document.

HTMLDocument document("<title>a &le; b</title>");

// StringEx HTMLDocument::getTitle()
StringEx title = document.getTitle();
// title = "a ≤ b"

HTMLDocument::getArticleContent

Get the page's article content (i.e. text inside all <p> tags) of the document, separated by \n.

HTMLDocument document("<p>a &le; b</p><div>QAQ</div><p>qwq</p>");

// StringEx HTMLDocument::getArticleContent()
StringEx content = document.getArticleContent();
// content = "a ≤ bqwq"

HTMLDocument::Element

The interface to get data from a HTML element or its subtree.

The default constructor constructs a empty element, on which you do any operation will result in a std::invalid_argument exception. Check it with if (element) first.

HTMLDocument::Element::inspect

Print the colorized DOM tree of this element to the terminal.

HTMLDocument document("<div id="wrapper"><div>a &le; b</div></div>");
HTMLDocument::Element element = document.getElementById("wrapper");

// void HTMLDocument::Element::inspect()
element.inspect();

HTMLDocument::Element::getTextContent

Get all text in the element.

HTMLDocument document("<div id="wrapper"><div>a &le; b</div><div>qwq</div></div>");
HTMLDocument::Element element = document.getElementById("wrapper");

// StringEx HTMLDocument::Element::getTextContent()
StringEx textContent = element.getTextContent();
// textContent = "a ≤ b"

HTMLDocument::Element::getAttribute

Get a attribute with specfied name of the element. Return a empty string if not found.

HTMLDocument document("<div id="wrapper" data-url="/qwq"></div>");
HTMLDocument::Element element = document.getElementById("wrapper");

// StringEx HTMLDocument::Element::getAttribute(const StringEx &)
StringEx value = element.getTextContent("data-url");
// value = "/qwq"

HTMLDocument::Element::getElementsByTagName

Get all elements whose tag name equals to a string. Return a std::vector<HTMLDocument::Element> that contains all matching elements.

HTMLDocument document("<div id="wrapper"><div>a &le; b</div><div>qwq</div></div>");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::vector<HTMLDocument::Element> HTMLDocument::Element::getElementsByTagName(const StringEx &)
std::vector<HTMLDocument::Element> elements = element.getElementsByTagName("div");

HTMLDocument::Element::getElementsByClassName

Get all elements which have a certain class. Return a std::vector<HTMLDocument::Element> that contains all matching elements.

HTMLDocument document("<div id="wrapper"><div class="my-class">a &le; b</div><div class="my-class">qwq</div></div>");
HTMLDocument::Element element = document.getElementById("wrapper");

// std::vector<HTMLDocument::Element> HTMLDocument::Element::getElementsByClassName(const StringEx &)
std::vector<HTMLDocument::Element> elements = element.getElementsByClassName("my-class");

html-parser's People

Contributors

menci avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.