Code Monkey home page Code Monkey logo

lexbor's Introduction

Lexbor: Crafting a Browser Engine with Simplicity and Flexibility

Why build yet another browser engine? There's a myriad of challenges developers face in fully utilizing modern web technologies. Parsing HTML and CSS, dealing with URLs and encodings often involves slow, resource-heavy implementations or outdated solutions. Even established solutions, written in C++ and reaching tens of megabytes in volume, are often not versatile enough. Meanwhile, language-specific implementations for Python, Node.js, Rust, or any other favorite of the day are slow and prone to lock-in.

The Core Requirements

Lexbor's core requirements rose from the ashes of these challenges:

Portability

Lexbor aims to adapt to different platforms and integrate into various programming languages. It's not yet another library full of quirks and idiosyncrasies; Lexbor aims to offer developers flexibility to incorporate it into their work directly, regardless of the programming language they chose.

Modularity

Lexbor wants to keep things simple: Developers should be able to use only the parts they need. Whether it's an HTML or URL parser, the engine's code should be straightforward and easy to navigate, promoting rapid development.

Speed

In a nutshell, Lexbor wants things to happen real fast. It's not just about making a browser engine; it's about making sure that everything, even the most resource-intensive tasks such as HTML parsing, occur swiftly to meet the real-time demands of modern web applications.

Independence

Lexbor empowers developers by giving them full control over algorithms, resources, and dimensions. By eliminating on external dependencies, we let developers customize the engine without sacrificing performance or features.

Compliance

Lexbor commits to industry standards. Developers need to be sure that the code aligns with widely established specifications. The output of Lexbor's modules, be it HTML, CSS, URLs, or others, should match that of modern browsers, meeting industry specifications.

Origin Story

Having had all these goals in mind for about a decade, Alexander Borisov, whose name gave the project its title, came up with the idea of a browser engine crafted entirely in C (there's no school like the old school). The language was chosen simply because we believed it could meet all the criteria seamlessly.

Unlike heavyweights such as WebKit or Blink, Lexbor takes a lean and focused approach, delivering a nimble yet powerful browser engine. All it takes is years of top-notch developer expertise.

An important point to make: Lexbor doesn't stop at parsing and rendering modern HTML. It offers each component as a standalone entity, ready to be integrated into other people's projects. This approach sets us apart, providing a modular solution that not only meets browser needs but also empowers developers with versatile tools for their own web-related tasks.

All in all, we envision Lexbor a promising player in the menagerie of browser technologies, pushing the boundaries and helping developers fully leverage modern web technologies.

Features

  • Modules.
  • Single or separate libraries for each module.
  • No outside dependencies.
  • Easy to port to any platform.
  • C99 support.
  • Speed.

HTML Module

CSS Module

Selectors Module

  • Search for HTML elements using CSS selectors.
  • Fast.

Encoding Module

URL Module

Punycode Module

Unicode Module

  • Unicode Standard Annex #15.
    • Support Unicode normalization forms: D (NFD), C (NFC), KD (NFKD), KC (NFKC).
    • Support chunks (stream).
  • Unicode Technical Standard #46.
  • Fast.

Development of modules in process

  • Layout
  • Font
  • and so on

Build and Installation

Binary packages

Binaries are available for:

  • CentOS 6, 7, 8
  • Debian 8, 9, 10, 11
  • Fedora 28, 29, 30, 31, 32, 33, 34, 36, 37
  • RHEL 7, 8
  • Ubuntu 14.04, 16.04, 18.04, 18.10, 19.04, 19.10, 20.04, 20.10, 21.04, 22.04

Currently for x86_64 architecture. If you need any other architecture, please, write to [email protected].

vcpkg

For vcpkg users there is a lexbor port that can be installed via vcpkg install lexbor or by adding it to dependencies section of your vcpkg.json file.

macOS

Homebrew

To install lexbor on macOS from Homebrew:

brew install lexbor

MacPorts

To install lexbor on macOS from MacPorts:

sudo port install lexbor

Source code

For building and installing Lexbor library from source code, use CMake (open-source, cross-platform build system).

cmake . -DLEXBOR_BUILD_TESTS=ON -DLEXBOR_BUILD_EXAMPLES=ON
make
make test

Please, see more information in documentation.

Single or separately

Single

  • liblexbor — this is a single library that includes all modules.

Separately

  • liblexbor-{module name} — libraries for each module.

You only need an HTML parser? Use liblexbor-html.

Separate modules may depend on each other. For example, dependencies for liblexbor-html: liblexbor-core, liblexbor-dom, liblexbor-tag, liblexbor-ns.

The liblexbor-html library already contains all the pointers to the required dependencies. Just include it in the assembly: gcc program.c -llexbor-html.

External Bindings and Wrappers

  • Elixir binding for the HTML module (since 2.0 version)
  • Crystal Fast HTML5 Parser with CSS selectors for Crystal language
  • Python binding for modest and lexbor engines.
  • D Fast HTML5 Parser with CSS selectors for D programming language
  • Ruby Fast HTML5 Parser with both CSS selectors and XPath support.

You can create a binding or wrapper for the lexbor and place the link here!

Documentation

Available on lexbor.com in Documentation section.

Roadmap

Please, see roadmap on lexbor.com.

Getting Help

AUTHOR

Alexander Borisov [email protected]

COPYRIGHT AND LICENSE

Lexbor.

Copyright 2018-2024 Alexander Borisov

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Please, see LICENSE file.

lexbor's People

Contributors

afflerbach avatar alexandruica avatar azq2 avatar barracuda156 avatar davidkorczynski avatar eltociear avatar helios-vmg avatar kostya avatar lanodan avatar lexborisov avatar mardy avatar niansa avatar nielsdos avatar nmeum avatar own2pwn avatar petk avatar phoerious avatar pospelove avatar rushter avatar searene avatar sonertari avatar timgates42 avatar trikko avatar vtorri avatar zyc9012 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lexbor's Issues

cc missing (MSYS2)

With the gcc package of MSYS2, cc is missing

maybe cc should be defined to gcc in Makefile if it is not present

Linking of tests fails on Windows (mingw-w64)

With gcc on Windows, the link command must follow a specific order :

gcc -shared -o foo bar.o -ldep1 -ldep2

(-o foo can be anywhere, though)

but when running make test :

  1. cc -shared -L/home/vtorri/gitroot/lexbor/lib -llexbor ./unit/kv.o ./unit/test.o ./unit/kv_state.o ./unit/kv_rules.o -o /home/vtorri/gitroot/lexbor/test/lexbor/unit/libtest.dll

-llexbor must be after all the .o, not before, otherwise, there are undef ref.

  1. cc -Wall -Werror -pipe -Wno-unused-function -Itest -Wno-unused-variable -Wno-unused-function -std=c99 -I. -I/home/vtorri/gitroot/lexbor/source -L/home/vtorri/gitroot/lexbor/lib -Lunit -ltest -llexbor core/array.c -o core/array

move again -ltest -llexbor at the end (same errors)

the order must be from the most dependant to the least (hence -ltest -llexbor and not -llexbor -ltest)

undefined symbols when building tests

problem: some symbols (like lxb_encoding_single_index_windows_874 or lxb_encoding_res_map) are declared as extern. On UNIX, there is no problem because by default everything is exported by the linker. On Windows, it's the contrary, every symbol is hidden.

so they must be declared with LXB_API. Note that it will also fix undef ref on unix if -fvisibility=hidden is passed to the compiler.

Linker errors on Windows + CMake

I am having trouble building lexbor with my C++ project. The build fails with below error when trying to link executable with statically built lexbor.

SomeInternalClass.cpp.obj : error LNK2019: unresolved external symbol __imp__lxb_html_document_create referenced in function "public: void __thiscall SomeInternalClass::func(class QString const &,class QString const &)" (?func@SomeInternalClass@@QAEXABVQString@@0@Z)
SomeInternalClass.cpp.obj : error LNK2019: unresolved external symbol __imp__lxb_html_document_destroy referenced in function "public: void __thiscall SomeInternalClass::func(class QString const &,class QString const &)" (?func@SomeInternalClass@@QAEXABVQString@@0@Z)
SomeInternalClass.cpp.obj : error LNK2019: unresolved external symbol __imp__lxb_html_document_parse referenced in function "public: void __thiscall SomeInternalClass::func(class QString const &,class QString const &)" (?func@SomeInternalClass@@QAEXABVQString@@0@Z)
SomeInternalClass.cpp.obj : error LNK2019: unresolved external symbol __imp__lxb_dom_element_qualified_name referenced in function "public: void __thiscall SomeInternalClass::func(class QString const &,class QString const &)" (?func@SomeInternalClass@@QAEXABVQString@@0@Z)
my.exe : fatal error LNK1120: 4 unresolved externals

I have installed lexbor as a submodule and have included it my CMakeLists.txt as follows:

add_subdirectory(lexbor)
include_directories(lexbor/source)

target_link_libraries(my lexbor_static)

Can anyone please shed some light on this?

How to use token outside of parsing context

Hi, i'm trying tokenizer, it works super fast, comparing to myhtml. But there is question, i want to store token, and use it later, (after lxb_html_tokenizer_end, but before lxb_html_tokenizer_destroy). Now i just save it, but if i try to get tag_id it return LXB_TAG__END_OF_FILE for any tag. I trying options other than LXB_HTML_TOKENIZER_OPT_WO_COPY, but it not helps.

Edit: actually, maybe i should store full token struct, instead of pointer, checking it.

Skipping nodes in lxb_dom_node_simple_walk?

In #72 I was given a solution for walking over the DOM tree. The example includes

        case LXB_TAG__EM_COMMENT:
        case LXB_TAG_SCRIPT:
        case LXB_TAG_STYLE:
            /* Skip node and his children's. */
            return LEXBOR_ACTION_NEXT;

However, I find that LEXBOR_ACTION_NEXT does not actually skip the nodes. Indeed there is no special handling of this return value in:

https://github.com/lexbor/lexbor/blob/master/source/lexbor/dom/interfaces/node.c#L248

And it appears that the other clients of lxb_dom_node_simple_walk assume that LEXBOR_ACTION_NEXT does not skip.

I would like a solution for skipping nodes during a node walk without writing my own walking function.

[discussion] about memory allocation wrappers

some questions :

  1. Why did you add wrappers around memory management ?

  2. why is the wrapper around free() returning a void * ? That seems strange

  3. do you plan to use memory pools ?

  4. have you considered the usage of jmalloc ?

Is there any plan to support c++?

Hi,
I just tried using this library in a c++ project. However, several errors occurred. Here is my code:

#include <iostream>

#include <lexbor/core/base.h>
#include <lexbor/core/types.h>
#include <lexbor/html/parser.h>
#include <lexbor/html/serialize.h>
#include <lexbor/html/interfaces/element.h>

int main() {
    lxb_html_document_t *document;

    const lxb_char_t html[] = "<div>V</div>";
    size_t html_len = sizeof(html) - 1;

    document = lxb_html_document_create();

    lxb_html_document_parse(document, html, html_len);
}

Here is the output.

/usr/bin/clang++   -I/home/searene/CLionProjects/CMakeTest/lexbor/source  -std=c++17 -g -O0 -Wall -g   -std=gnu++17 -o CMakeFiles/CMakeTestExecutable.dir/main.cpp.o -c /home/searene/CLionProjects/CMakeTest/main.cpp
In file included from /home/searene/CLionProjects/CMakeTest/main.cpp:5:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/parser.h:15:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tree.h:18:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/html.h:16:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/interfaces/document.h:19:
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/dom/interfaces/document.h:129:12: error: cannot initialize return object of type 'lxb_char_t *' (aka 'unsigned char *') with an rvalue of type 'void *'
    return lexbor_mraw_alloc(document->text, sizeof(lxb_char_t) * len);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/searene/CLionProjects/CMakeTest/main.cpp:5:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/parser.h:15:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tree.h:19:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tokenizer.h:19:
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/token.h:129:64: error: expected ')'
                             lexbor_str_t *name, lexbor_str_t *public,
                                                               ^
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/token.h:128:29: note: to match this '('
lxb_html_token_doctype_parse(lxb_html_token_t *token, lexbor_mraw_t *mraw,
                            ^
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/token.h:150:12: error: cannot initialize return object of type 'lxb_html_token_t *' with an rvalue of type 'void *'
    return lexbor_dobject_calloc(dobj);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/searene/CLionProjects/CMakeTest/main.cpp:5:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/parser.h:15:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tree.h:21:
In file included from /home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag.h:44:
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1888:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "altGlyph", 8}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1890:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "altGlyphDef", 11}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1892:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "altGlyphItem", 12}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1894:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "animateColor", 12}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1896:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "animateMotion", 13}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1898:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "animateTransform", 16}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1944:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "clipPath", 8}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1980:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feBlend", 7}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1982:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feColorMatrix", 13}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1984:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feComponentTransfer", 19}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1986:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feComposite", 11}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1988:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feConvolveMatrix", 16}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1990:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feDiffuseLighting", 17}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1992:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feDisplacementMap", 17}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1994:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feDistantLight", 14}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/searene/CLionProjects/CMakeTest/lexbor/source/lexbor/html/tag_res.h:1996:5: error: taking the address of a temporary object of type 'lxb_html_tag_fixname_t' [-Waddress-of-temporary]
    &((lxb_html_tag_fixname_t) {(const lxb_char_t *) "feDropShadow", 12}),
    ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make[3]: *** [CMakeFiles/CMakeTestExecutable.dir/build.make:66: CMakeFiles/CMakeTestExecutable.dir/main.cpp.o] Error 1
make[3]: Leaving directory '/home/searene/CLionProjects/CMakeTest/cmake-build-debug'
make[2]: *** [CMakeFiles/Makefile2:76: CMakeFiles/CMakeTestExecutable.dir/all] Error 2
make[2]: Leaving directory '/home/searene/CLionProjects/CMakeTest/cmake-build-debug'
make[1]: *** [CMakeFiles/Makefile2:88: CMakeFiles/CMakeTestExecutable.dir/rule] Error 2
make[1]: Leaving directory '/home/searene/CLionProjects/CMakeTest/cmake-build-debug'
make: *** [Makefile:189: CMakeTestExecutable] Error 2

Take the first error as an example.

error: cannot initialize return object of type 'lxb_char_t *' (aka 'unsigned char *') with an rvalue of type 'void *'

This kind of conversion is allowed in c, but forbidden in c++. So clearly this library does not support c++. Do you have any plan to support c++ in the future?

Abnormally high memory usage in exceptional cases

When I decompress the attached file broken.html.gz and execute

./html2sexpr broken.html

the example program uses more than 12GB RAM and gets killed by the kernel. The uncompressed version of the file is 5MB.

The file seems to be degenerated in a systematic way, so I I tried to break down the file with the following commands:

export $N=100
head -n $N broken.html > new.html
tail -n $N broken.html >> new.html
/usr/bin/time -f '%M' ./html2sexpr new.html

and I plotted the result into a graph:

https://docs.google.com/spreadsheets/d/1CXq7bcITtlaiS1hh-MILxQUUzvf8yuxEG9CZagEZ5K0/edit?usp=sharing

It looks as if the memory growth is quadratic, so it can grow very quickly.

I can reproduce the problem with the latest master version, and with the release v0.4.0.

Is there a way to optimize the parsing algorithm here? Firefox and chrome manage to open the file pretty easily.

Undefined reference

Я пробую собрать, скопировав пример из Quick Start https://lexbor.com/docs/lexbor/
но при сборке получаю

 undefined reference to `lxb_html_document_create'
undefined reference to `lxb_html_document_parse'
undefined reference to `lxb_html_document_destroy'

платформа linux. IDE NetBeans. С ctandart с11

IDE выдает придупреждение на импорт <lexbor/html/parser.h> -

существует неразрешенный #include <stddef.h> во включенном элементе /usr/include/stdio.h

Unexpected tokenizer results

Quite hard to work with tokenizer, because of unexpected results:
examples running tokenizer/callback:

not closed <!--

HTML:
<body>  <script>    <!--      text  </script>  <a href=#></a>  </body>

Result:
Tag name: body; Tag id: 31; Is close: false
Tag name: #text; Tag id: 2; Is close: false
Tag name: script; Tag id: 161; Is close: false
Tag name: #text; Tag id: 2; Is close: false
Tag name: !--; Tag id: 4; Is close: false
Tag name: #end-of-file; Tag id: 1; Is close: false
HTML:
<body>  <script>    document.write("<a>text</a>")  </script>  <a href=#></a>  </body>

Result:
Tag name: body; Tag id: 31; Is close: false
Tag name: #text; Tag id: 2; Is close: false
Tag name: script; Tag id: 161; Is close: false
Tag name: #text; Tag id: 2; Is close: false
Tag name: a; Tag id: 6; Is close: false
Tag name: #text; Tag id: 2; Is close: false
Tag name: a; Tag id: 6; Is close: true
Tag name: #text; Tag id: 2; Is close: false
Tag name: script; Tag id: 161; Is close: true
Tag name: #text; Tag id: 2; Is close: false
Tag name: a; Tag id: 6; Is close: false
Tag name: a; Tag id: 6; Is close: true
Tag name: #text; Tag id: 2; Is close: false
Tag name: body; Tag id: 31; Is close: true
Tag name: #end-of-file; Tag id: 1; Is close: false

btw, html2sexpr show correct results.
Is this expected?

building with static disabled and separated libs : cmake error

error is :

  Cannot add target-level dependencies to non-existent target
  "lexbor-css_static".

  The add_dependencies works for top-level logical targets created by the
  add_executable, add_library, or add_custom_target commands.  If you want to
  add file-level dependencies see the DEPENDS option of the add_custom_target
  and add_custom_command commands.
Call Stack (most recent call first):
  CMakeLists.txt:224 (SET_MODULE_LIB_DEPENDENCIES)


CMake Error at config.cmake:302 (target_link_libraries):
  Cannot specify link libraries for target "lexbor-css_static" which is not
  built by this project.
Call Stack (most recent call first):
  CMakeLists.txt:224 (SET_MODULE_LIB_DEPENDENCIES)

i think that the problem is line 223 or 224 (i have local changes) of toplevel CMakeLists.txt :

SET_MODULE_LIB_DEPENDENCIES("${libname}_static" "${module_deps}" "_static")

should be added only if static lib is created

and I think that around line 170, it's the same :

SET_MODULE_LIB_DEPENDENCIES("${libname}_static" "${module_deps}" "_static")

Swift package

Do you think it would be possible to offer a Package.swift file for easy integration with the Swift package manager?

I see you're using Xcode for development so I figured it might be something you'd consider for the main repo.

cmake build errors

Hello! I've just include html.h and dom.h but can't build it...

/usr/local/include/lexbor/dom/interfaces/attr_res.h:24:7: error: expected primary-expression before ‘.’ token
     {{.u.short_str = "#undef", .length = 6, .next = NULL},

installed without any flags...

Not found by CSS class... why?

Hello! I try use simple example, but not working... can you check that?

/*
 * Copyright (C) 2018 Alexander Borisov
 *
 * Author: Alexander Borisov <[email protected]>
 */

#include "base.h"
#include <lexbor/dom/dom.h>
#include <string>

int main(int argc, const char *argv[]) {
    lxb_status_t status;
    lxb_dom_element_t *element;
    lxb_html_document_t *document;
    lxb_dom_collection_t *collection;

    const lxb_char_t html[] = "<div class=\"best blue some\">\n"
                              "    <div class=\"red pref_best grep\">\n"
                              "        <div class=\"red best grep\">\n"
                              "            <div class=\"ev-scoreboard__team-logo--2UDIQ red c++ best\"></div>\n"
                              "        </div>\n"
                              "    </div>\n"
                              "</div>";

    size_t html_szie = sizeof(html) - 1;

    document = parse(html, html_szie);

    collection = lxb_dom_collection_make(&document->dom_document, 128);
    if (collection == NULL) {
        FAILED("Failed to create Collection object");
    }

    const std::string className = "ev-scoreboard__team-logo--2UDIQ";

//    status = lxb_dom_elements_by_class_name(lxb_dom_interface_element(document->body),
//                                            collection,
//                                            reinterpret_cast<const lxb_char_t *>((unsigned char *) className.c_str()),
//                                            className.length());

    status = lxb_dom_elements_by_class_name(lxb_dom_interface_element(document->body),
                                            collection,
                                            (const lxb_char_t *) "ev-scoreboard__team-logo--2UDIQ",
                                            31);

    if (status != LXB_STATUS_OK) {
        FAILED("Failed to get elements by name");
    }

    PRINT("HTML:");
    PRINT("%s", (const char *) html);
    PRINT("%s", ("\nFind all elements by class name '" + className + "'.").c_str());
    PRINT("%s", ("Elements found: " + std::to_string(lxb_dom_collection_length(collection))).c_str());

    for (size_t i = 0; i < lxb_dom_collection_length(collection); i++) {
        element = lxb_dom_collection_element(collection, i);
        serialize_node(lxb_dom_interface_node(element));
    }

    lxb_dom_collection_destroy(collection, true);
    lxb_html_document_destroy(document);

    return 0;
}

Output:

HTML:
<div class="best blue some">
    <div class="red pref_best grep">
        <div class="red best grep">
            <div class="ev-scoreboard__team-logo--2UDIQ red c++ best"></div>
        </div>
    </div>
</div>

Find all elements by class name 'ev-scoreboard__team-logo--2UDIQ'.
Elements found: 0

Process finished with exit code 0

Error: invalid conversion from ‘void*’ to ‘lxb_char_t*’ {aka ‘unsigned char*’}

Issue

I built lexbor from source using the following command:

cmake . -DLEXBOR_BUILD_TESTS=ON -DLEXBOR_BUILD_EXAMPLES=ON -DLEXBOR_BUILD_SEPARATELY=ON
make
sudo make install

Then managed to get an error from this code:

// test.cpp
#include <lexbor/html/parser.h>

int main()
{
}

Compiled it using:

g++ test.cpp

Error

The error message received is:

In file included from /usr/local/include/lexbor/dom/qualified_name.h:15,
                 from /usr/local/include/lexbor/dom/interfaces/attr.h:15,
                 from /usr/local/include/lexbor/html/tree.h:15,
                 from /usr/local/include/lexbor/html/parser.h:15,
                 from test.cpp:1:
/usr/local/include/lexbor/html/parser_char.h: In function ‘lxb_status_t lxb_html_str_append(lexbor_str_t*, lexbor_mraw_t*, const lxb_char_t*, size_t)’:
/usr/local/include/lexbor/html/parser_char.h:99:5: error: invalid conversion from ‘void*’ to ‘lxb_char_t*’ {aka ‘unsigned char*’} [-fpermissive]
   99 |     lexbor_str_check_size_arg_m(str, lexbor_str_size(str), mraw, (length + 1),
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      |     |
      |     void*

System Configuration

  • Operating system
$ uname -a
Linux debian 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux
  • Compiler specifications
$ gcc --version
gcc (Debian 9.2.1-21) 9.2.1 20191130

lxb_dom_node_text_content and <br> elements

Thank you for the library! I am trying to extract human-readable text from an element and its descendants similar to .textContent. However I am finding that

<div>John Lennon<br>Paul McCartney</div>

is returned as John LennonPaul McCartney when using lxb_dom_node_text_content.

I would like <br> elements replaced with newlines, and non-semantic whitespace replaced with single spaces. How would you suggest I proceed?

License text

The license is not available. A copyright is insufficient.

How use this library as git sub-module?

Hi. I'm curious to know how use this library as git-submodule. Actually I'm going to create an application that is going to use this library and i will probably create a package for my application for void-linux, arch, ...

So i don't want to use distributions packages. I want to embed this library as git submodule. So if usere want to build my project, they'll automatically first fetch and build lexbor and then build my project.

And i only want to use liblexbor-html library that depends on liblexbor-core, liblexbor-dom, liblexbor-tag, liblexbor-ns according to readme.

Copy elements/nodes

Is there a way to make a (deep?) copy of an element/node? I can't find any fast way to do this.
Copying element content and reparsing seems not the best performance option...

Cannot parse &nbsp;

Hi, I found that an extra character would be inserted into the HTML if it contained &nbsp;. Here is my code, it's very simple, just read an HTML and output it.

#include "lexbor/core/base.h"
#include "lexbor/core/types.h"
#include "lexbor/html/parser.h"
#include "lexbor/html/serialize.h"

int main() {
    const char* html = "<div>&nbsp;</div>";
    lxb_html_parser_t* parser = lxb_html_parser_create();
    lxb_status_t status = lxb_html_parser_init(parser);

    if (status != LXB_STATUS_OK) {
        return 1;
    }

    /* Parse */
    lxb_html_document_t* document = lxb_html_parse(parser, reinterpret_cast<const lxb_char_t *>(html), strlen(html));
    if (document == nullptr) {
        return 1;
    }

    /* Create Collection for elements */
    lxb_dom_collection_t* collection = lxb_dom_collection_make(&document->dom_document, 128);
    if (collection == nullptr) {
        return 1;
    }
    /* Get BODY element (root for search) */
    lxb_html_body_element_t* body = lxb_html_document_body_element(document);
    lxb_dom_node_t* bodyNode = lxb_dom_interface_node(body);

    lexbor_str_t str {};
    lxb_status_t serialStatus = lxb_html_serialize_tree_str(bodyNode, &str);
    if (serialStatus != LXB_STATUS_OK) {
        return 1;
    }
    size_t len = str.length;
    printf("%s", str.data);
}

I expected the output would be the same as the original one, since I didn't do anything about it. However, it was not. An extra character 0xc2 was inserted before &nbsp;:

<div>�&nbsp;</div>

Since 0xc2 was not a valid character, it was rendered as . Could you look at this issue and see if it could be fixed? Thanks a lot!

Add a way to iterate results one by one

Currently functions like lxb_dom_elements_by_class_name iterates the whole dom using lxb_dom_node_simple_walk. In many cases this could be a suboptimal solution.

For example:

<div class="a"></div>
<div class="a"></div>
<div class="a">test</div>
<div class="a"></div>
... many more ...

If i need to find the first non-empty div, I could stop the search after the third result rather than read all the other.

This could be achived probably returning a range/iterator like this:

struct lxb_dom_range_t
{
  bool empty;
  lxb_dom_element_t* front;
  void (*lxb_dom_range_next)(lxb_dom_element_t* front); 
}

or something similar.

Of course this can be easily converted to a plain array looping thru it.

Can you provide example?

I have simple html for example:

<div class="ev">
    <p class="ev_name">Name Header</p>
    <i class="ev_name">Name Sub</i>
    <i class="ev_logo" style="background-image: url(&quot;//image_01.png&quot;);"></i>
</div>

I need select by CSS .ev p.ev_name an element with tag P, and get contents of it?

in output i wanna get Name Header string... is it possible?

upd:

Found solution... but not sure..

void getByCss(const std::string &className) {
        this->status = lxb_dom_elements_by_class_name(lxb_dom_interface_element(this->document->body),
                                                      this->collection,
                                                      reinterpret_cast<const lxb_char_t *>((unsigned char *) className.c_str()),
                                                      className.length());

        for (size_t i = 0; i < lxb_dom_collection_length(this->collection); i++) {
            element = lxb_dom_collection_element(this->collection, i);

            std::string test = reinterpret_cast<const char *>(lxb_dom_node_text_content(lxb_dom_interface_node(element),
                                                                                        nullptr));
            std::cout << test << std::endl;
        }
    }

query or xpath support ?

Is this lib support XPATH query ?

to select like: "#id div > a[href]".

should I use modest_finder_by_selectors_list ?

windows msvc build error

lxb_status_t
lxb_utils_warc_parse_file(lxb_utils_warc_t *warc, FILE *fh)
{
    size_t size;
    lxb_status_t status;

    const lxb_char_t *buf_ref;
    static const size_t buffer_size = 4096 * 2;
    lxb_char_t buffer[buffer_size];

    if (fh == NULL) {
        return LXB_STATUS_ERROR_WRONG_ARGS;
    }

lxb_char_t buffer[buffer_size]; // this is not support by msvc

CMake build/installation

@lexborisov Hello again sorry to make another issue... but I looked at stack exchange first and I can't figure out what I'm doing wrong (I'm getting linker errors).

I really would like to do the following with CMake (on windows) but I can't figure out how...

  • With the built-in x64-debug configuration, I"m trying to change it so that I can

    • "Build examples" (to verify that my builds/installs are working before trying to integrate into my other projects)
    • "Build separately" (to allow for only using specific modules from the library in my projects)
  • Then I am also trying to add 2 more build configs for ("x64-release" and "x64-relWDebInfo")

    • I don't need examples for these but I do still want to build separately

Here is a screen shot of the problem I'm having with the debug config:
image

Here is a screen shot of the release config
image

And finally a screen shot of the release with debug info config
image

cmake error when building shared lib without separated libs

compiling on Windows only shared libn without separated :

CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-core.  Perhaps it has not yet
  been created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Looking for ceil
-- Looking for ceil - found
-- Append module: core (1.3.1)
CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-css.  Perhaps it has not yet been
  created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Append module: css (0.1.0)
CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-dom.  Perhaps it has not yet been
  created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Append module: dom (1.2.1)
CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-encoding.  Perhaps it has not yet
  been created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Append module: encoding (2.0.1)
CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-html.  Perhaps it has not yet
  been created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Append module: html (1.4.0)
CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-ns.  Perhaps it has not yet been
  created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Append module: ns (1.2.0)
CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-tag.  Perhaps it has not yet been
  created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Append module: tag (1.2.0)
CMake Error at config.cmake:171 (set_property):
  set_property could not find TARGET lexbor-utils.  Perhaps it has not yet
  been created.
Call Stack (most recent call first):
  CMakeLists.txt:178 (INCLUDE_MODULE_CONFIG)


-- Append module: utils (0.2.1)
-- CFLAGS: -g3 -ggdb3 -Og -Wno-pedantic-ms-format -O2 -Wall -pedantic -pipe -std=c99
-- CXXFLAGS:  -O2
-- Feature ASAN: disable
-- Configuring incomplete, errors occurred!
See also "E:/Documents/programmes_x64/msys2/home/vtorri/gitroot_64/lexbor_vtorri/builddir/CMakeFiles/CMakeOutput.log".

it seems that config.cmake is included too early. But some macro are needed also (version for example) early.

so maybe separating config.cmake in 2 files, included at different locations could solve the probleM

[discussion] about perf code

in perf.h, 2 declarations seem unnecessary to me : lexbor_perf_frequency() and lexbor_perf_clock()

perf code is used to compute time in second of a part of code. So i would say that only lexbor_perf_in_sec() is necessary and the 2 functions above could be static in perf.c

Compiler error on package use

@lexborisov
I have successfully gotten the library to compile on my machine (x64-windows).

But when I #include<lexbor/html/html.h> I get compiler errors complaining about the use of the macro interface. These errors come from the "document.h" and "interface.h" files.

The macro itself is defined in the "combaseapi.h" file.

Here is a screen shot of the errors... on the left you see my code with the include statement, on the right you see one of the uses of the interface macro that is causing a compiler error:
image

Wrong tag-detection when inside <script>-state

Hi,
looks like I found a bug. I do use (only!) the tokenizer to parse html-code and to detect tags. This works fine except of the following behavior.

You can repeat it with the "official" tokenizer/simple - example.
When changing the html-code to contain some script-code containing a "<"-sign (which is quite common in JS as a comparator), then the tokenizer sometimes thinks a new tag would be opened and handles all following code as attribute=value. Beside other errors this yields to a broken close-tag.

A minimal working example for the html-code (line 83 in the example):
<script>var a = b<c?d:e</script>

This yield to the result:
<script>var a = b<c?d:e< script>

The original code where I recognized this behavior was when parsing the youtube.com page. The explicit script-code embedded in the page was in this case:
script.txt

link errors on Windows

C:/Documents/msys2/home/vincent.torri/gitroot_64/lexbor_vtorri/source/lexbor/dom/interfaces/attr.c:258: undefined reference to `__imp_lexbor_hash_insert_lower'

and plenty of other ones

i think it's related to 6e804d6

How to use?

Can you help me, how to use this lib? I try to write custom wrapper class.. but can't...

#include "lexbor/html/html.h"
#include "lexbor/dom/dom.h"
Scanning dependencies of target test
[ 50%] Building CXX object CMakeFiles/test.dir/test.cpp.o
[100%] Linking CXX executable test
CMakeFiles/test.dir/test.cpp.o: In function `lxb_dom_collection_make':
/usr/local/include/lexbor/dom/collection.h:46: undefined reference to `lxb_dom_collection_create'
/usr/local/include/lexbor/dom/collection.h:47: undefined reference to `lxb_dom_collection_init'
/usr/local/include/lexbor/dom/collection.h:50: undefined reference to `lxb_dom_collection_destroy'
CMakeFiles/test.dir/test.cpp.o: In function `ParserLexbor::initParser()':
/home/mat/dev/zserver/include/ParserLexbor.hpp:24: undefined reference to `lxb_html_parser_destroy'
/home/mat/dev/zserver/include/ParserLexbor.hpp:26: undefined reference to `lxb_html_parser_create'
/home/mat/dev/zserver/include/ParserLexbor.hpp:27: undefined reference to `lxb_html_parser_init'
CMakeFiles/test.dir/test.cpp.o: In function `ParserLexbor::setHtml(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
/home/mat/dev/zserver/include/ParserLexbor.hpp:34: undefined reference to `lxb_html_parse'
CMakeFiles/test.dir/test.cpp.o: In function `ParserLexbor::getByCss(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
/home/mat/dev/zserver/include/ParserLexbor.hpp:50: undefined reference to `lxb_dom_elements_by_class_name'
collect2: error: ld returned 1 exit status
CMakeFiles/test.dir/build.make:83: recipe for target 'test' failed
make[3]: *** [test] Error 1
CMakeFiles/Makefile2:79: recipe for target 'CMakeFiles/test.dir/all' failed
make[2]: *** [CMakeFiles/test.dir/all] Error 2
CMakeFiles/Makefile2:86: recipe for target 'CMakeFiles/test.dir/rule' failed
make[1]: *** [CMakeFiles/test.dir/rule] Error 2
Makefile:118: recipe for target 'test' failed
make: *** [test] Error 2

How lookup elements inside dom?

Look at to the html source of this page:
https://www.goodreads.com/search?q=book
I want to grab all titles. Is there a simple way to chain and look up elements like this:

body->div>div>div>span

Edit: it's just an example. i want to know is there any sample code that show us how to iterate dom elements to get special data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.