Comments (13)
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);.
same as myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 2, 0);.
because
default:
// default MyHTML_OPTIONS_PARSE_MODE_SEPARATELY
if(thread_count < 2)
thread_count = 2;
from perl-html5-dom.
What is the sense to use more threads if performance is degraded?
from perl-html5-dom.
Hi @zdm
How many cores in your processor (without threads)?
The number of threads must be equal to the number of cores + 2. If you have 4 cores, set { threads => 2, }
.
from perl-html5-dom.
I have 4 cores. Performance is degraded in any case.
Benchmark: running t0, t2, t4 for at least 3 CPU seconds...
t0: 5 wallclock secs ( 3.94 usr + 0.61 sys = 4.55 CPU) @ 357.38/s (n=1625)
t2: 1 wallclock secs ( 2.69 usr + 0.39 sys = 3.08 CPU) @ 219.55/s (n=676)
t4: 1 wallclock secs ( 2.52 usr + 0.67 sys = 3.19 CPU) @ 106.34/s (n=339)
Rate t4 t2 t0
t4 106/s -- -52% -70%
t2 220/s 106% -- -39%
t0 357/s 236% 63%
from perl-html5-dom.
@zdm
Please, model of your processor, OS, and test files. (Spectre и Meltdown patches?)
from perl-html5-dom.
from perl-html5-dom.
Could you run benchmarks on your side and post results here?
from perl-html5-dom.
@zdm
I have the same result. I will look at implementation at a Perl.
I test Modest code and it is 2+ times faster in threads mode than in the single mode.
from perl-html5-dom.
Also, please, take in attention. that async mode is not working, maybe this issues are related.
And, by the way, it would be nice, if it can call callback when parsing is done instead of $tree->wait call.
from perl-html5-dom.
I have same perfomance degradation on original Modest C source.
$ bin/myhtml/print_tree_high_level ~/index.html
1 threads: 635.027258
2 threads: 674.421497
3 threads: 663.557751
My friends tested on AMD Ryzen and intel i7/xeon. Same results - single thread faster.
"One click" test:
git clone https://github.com/lexborisov/Modest
cd Modest
curl https://html.spec.whatwg.org/ > ~/index.html
curl https://dumpz.org/aNdawx3GKn3Q/text/ > ./examples/myhtml/print_tree_high_level.c
make
bin/myhtml/print_tree_high_level ~/index.html
Test code:
#include <stdio.h>
#include <stdlib.h>
#include <myhtml/api.h>
#include <sys/time.h>
struct res_html {
char *html;
size_t size;
};
struct res_html load_html_file(const char* filename)
{
FILE *fh = fopen(filename, "rb");
if(fh == NULL) {
fprintf(stderr, "Can't open html file: %s\n", filename);
exit(EXIT_FAILURE);
}
if(fseek(fh, 0L, SEEK_END) != 0) {
fprintf(stderr, "Can't set position (fseek) in file: %s\n", filename);
exit(EXIT_FAILURE);
}
long size = ftell(fh);
if(fseek(fh, 0L, SEEK_SET) != 0) {
fprintf(stderr, "Can't set position (fseek) in file: %s\n", filename);
exit(EXIT_FAILURE);
}
if(size <= 0) {
fprintf(stderr, "Can't get file size or file is empty: %s\n", filename);
exit(EXIT_FAILURE);
}
char *html = (char*)malloc(size + 1);
if(html == NULL) {
fprintf(stderr, "Can't allocate mem for html file: %s\n", filename);
exit(EXIT_FAILURE);
}
size_t nread = fread(html, 1, size, fh);
if (nread != size) {
fprintf(stderr, "could not read %ld bytes (%ld bytes done)\n", size, nread);
exit(EXIT_FAILURE);
}
fclose(fh);
struct res_html res = {html, (size_t)size};
return res;
}
double current_timestamp()
{
struct timeval t;
gettimeofday(&t, NULL);
return (double) t.tv_sec * 1000.0 + (double) t.tv_usec / 1000.0;
}
int main(int argc, const char * argv[])
{
const char* path;
if (argc == 2) {
path = argv[1];
}
else {
printf("Bad ARGV!\nUse: print_tree_high_level <path_to_html_file>\n");
exit(EXIT_FAILURE);
}
int tries = 20;
double start, elapsed;
struct res_html res = load_html_file(path);
myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_PARSE_MODE_SINGLE, 1, 0);
start = current_timestamp();
for (int i = 0; i < tries; ++i) {
myhtml_tree_t* tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);
myhtml_parse(tree, MyENCODING_UTF_8, res.html, res.size);
myhtml_tree_destroy(tree);
}
elapsed = current_timestamp() - start;
myhtml_destroy(myhtml);
printf("1 threads: %f\n", elapsed / tries);
myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 2, 0);
start = current_timestamp();
for (int i = 0; i < tries; ++i) {
myhtml_tree_t* tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);
myhtml_parse(tree, MyENCODING_UTF_8, res.html, res.size);
myhtml_tree_destroy(tree);
}
elapsed = current_timestamp() - start;
myhtml_destroy(myhtml);
printf("2 threads: %f\n", elapsed / tries);
myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 3, 0);
start = current_timestamp();
for (int i = 0; i < tries; ++i) {
myhtml_tree_t* tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);
myhtml_parse(tree, MyENCODING_UTF_8, res.html, res.size);
myhtml_tree_destroy(tree);
}
elapsed = current_timestamp() - start;
myhtml_destroy(myhtml);
printf("3 threads: %f\n", elapsed / tries);
return 0;
}
Currently i no have idea why. May be:
- I do something wrong when using myhtml api.
- Bug on Linux/Windows in myhtml.
I continue trying to understand real cause of degradation.
from perl-html5-dom.
Hi @Azq2
Try test with myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
.
Thread mode with one thread:
- Tokenizer
- Tree builder
- Token process
from perl-html5-dom.
And summarize:
- Threads "tree construction" and "tokenizer" very depend on each other. No way for efficient separate HTML parsing by threads (at least
in current myhtml implementation).
Speed very depends on CPU, OS and html contents. In some cases multithreads mode faster. But in other 99.9% single mode get maximum speed. - This not a problem. Single mode fast. Very fast. Faster than other available parsers. I don't know any causes for using threads.
- As see benchmark in comment #3 (comment) this is not directly related to the module. That https://github.com/lexborisov/myhtml issue.
- I added attention for
threads
option and changed default threads count to 0.
https://github.com/Azq2/perl-html5-dom#threads
from perl-html5-dom.
I think problem resolved. Please, open if not.
from perl-html5-dom.
Related Issues (9)
- Packaging issue — 1.01 includes 1.00 as a tar file HOT 1
- Problems when html's charset is windows-1253
- How async mode works? HOT 6
- ->text and other similar methods always returns encoded string HOT 6
- Strange encoding: utf-8 instead windows-1251 HOT 6
- Bad examples for outerHTML and innerHTML HOT 2
- Calling replace method with a fragment stops responding
- HTML5-DOM-1.23: Warning: the following files are missing in your kit HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from perl-html5-dom.