Code Monkey home page Code Monkey logo

perl-html5-dom's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

okay686 nanto

perl-html5-dom's Issues

Calling replace method with a fragment stops responding

Step to reproduce: run the following code:

use strict;
use warnings;

use feature qw(say);
use HTML5::DOM;

my $tree = HTML5::DOM->new->parse('<p>1</p>');
my $fragment = $tree->parseFragment('<p>2</p><p>3</p>');
$tree->at('p')->replace($fragment);
say $tree->html;

Actual result: process stops responding.

Expected result: process outputs <html><head></head><body><p>2</p><p>3</p></body></html>.

HTML5-DOM-1.23: Warning: the following files are missing in your kit

Hello,

Running Strawberry Perl 5.32.1 64 bit, I noticed the following message.

C:\home\sunnyday1>cpan ZHUMARIN/HTML5-DOM-1.23.tar.gz

(snip)

Configuring Z/ZH/ZHUMARIN/HTML5-DOM-1.23.tar.gz with Makefile.PL
CPAN: CPAN::Reporter loaded ok (v1.2018)
Checking if your kit is complete...
Warning: the following files are missing in your kit:
        HTML5-DOM-1.23/META.json
        HTML5-DOM-1.23/META.yml
Please inform the author.

(snip)

Thank you,

Why largest threads number runs slower?

my $t = {
    t0 => sub {
        state $parser = HTML5::DOM->new( { threads => 0, } );

        my $tree = $parser->parse($html);
    },
    t4 => sub {
        state $parser = HTML5::DOM->new( { threads => 4, } );

        my $tree = $parser->parse($html);
    },
};

Benchmark::cmpthese( Benchmark::timethese( 1000, $t ) );

resullts:

Benchmark: timing 1000 iterations of t0, t4...
        t0:  5 wallclock secs ( 3.63 usr +  0.85 sys =  4.48 CPU) @ 223.21/s (n=1000)
        t4:  8 wallclock secs ( 8.12 usr +  1.87 sys =  9.99 CPU) @ 100.10/s (n=1000)
    Rate   t4   t0
t4 100/s   -- -55%
t0 223/s 123% 

Without threads is runs twice faster. Why?

How async mode works?

Hi,
It is unclear, how async mode works. For example in the following code I am expecting, that printed value will be "0". Because $html is pretty large and parsed in async mode. But it is always "1". Why?

my $parser = HTML5::DOM->new( {
    threads => 4,
    async   => 1,
} );
my $tree = $parser->parse($html);
say $tree->parsed ? 1 : 0;

Problems when html's charset is windows-1253

Hi and thank you for HTML5::DOM which had served me superbly quite a few times.

Alas, it failed me when I tried to parse the contents of a webpage which it states it is encoded with "charset=windows-1253" (via this: <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">). The result is that parse() returns nodes whose text, when printed on a linux console, appears gibberish (the typical horror of Perl's screen-of-unicode-death §Ξ΅Ξ—ΣΤΛΩΛΩ).

My eventual solution was to zap the evil windows-1253 from the html content and replace it with UTF-8.

How to solve this properly (thiugh I don't mind the zapping)?

Secondly, I tried to tell HTML5::DOM not to be concerned at all with unicode and return me back un-encoded text so that I would encode it myself using parse(..., {utf8=>0}). Either I made a mistake or this is not possible because I ended up with even more gibberish. On second though why use utf8=>0 when encoding is not utf8?

Below is a self-contained example demonstrating the problem.

Many thanks,

use strict;
use warnings;

use LWP::UserAgent;
use HTTP::Request;
use HTML5::DOM;
use Encode;

my $ua = LWP::UserAgent->new();
my $response = $ua->request(
  HTTP::Request->new(
	'GET' => 'https://www.areiospagos.gr/proedros.htm',
	[
		'Connection' => 'keep-alive',
		'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
		'Accept-Encoding' => 'gzip, deflate',
		'Accept-Language' => 'en-GB,en;q=0.5',
		'Referer' => 'http://www.polignosi.com/cgibin/hweb',
		'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0',
		'Upgrade-Insecure-Requests' => '1'
	],
  )
);
die unless $response && $response->is_success;

my $html = $response->decoded_content;

print "encoding using detect(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detect($html))."\n";
print "encoding using detectUnicode(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectUnicode($html))."\n";
print "encoding using detectByPrescanStream(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectByPrescanStream($html))."\n";

# The above html contains
#  <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">
# replacing the crappy windows-1253 with UTF-8 solves my problem
#$html =~ s/charset=windows-1253/charset=UTF-8/g;

my $parser = HTML5::DOM->new();

my $tree = $parser->parse($html, {scripts => 0});

my $is_utf8_enabled = $tree->utf8;
# it prints 'true'
print "is_utf8_enabled=".($tree ? "true" : "false")."\n"; # false
my $text = $tree->find('body table#table1 tbody tr td table#table2 tbody tr td p span')->[0]->text();
# it prints gibberish (doubly-encoded)
print $text;
# it is solved by replacing the windows-1235 charset from $html, see above

Bad examples for outerHTML and innerHTML

The documentation gives the following example:

my $tree = HTML5::DOM->new->parse('<div id="test">some <b>bold</b> test</div>');
print $tree->outerHTML;                         # <div id="test">some <b>bold</b> test</div>

Running this code give the following error:

Can't locate object method "outerHTML" via package "HTML5::DOM::Tree"

It's tempting to try $tree->root->outerHTML instead, but this doesn't give the expected output either since the root node is <html>, not <div>.

->text and other similar methods always returns encoded string

say HTML5::DOM->new->parse('<b>тест</b>')->at('b')->text;

As I correctly understand parser detects encoding and store tree in utf-8 internally.
Is it possible to return text and html strings with utf8 flag set, as other html tree builders do (for example HTML::TreeBuilder:: family)?

Strange encoding: utf-8 instead windows-1251

Hi! )

Thanks for this library!
But we have a bug. (?)
We need to work with windows-1251. Not UTF-8.
But every time it returns result with utf-8 charset.

I have simple example:

index.pl (windows-1251)

use HTML5::DOM;

my $textWin1251 = "<p>Если заголовок заполнен, а подзаголовка нет – для материала все остается так же, как раньше.</p>";
my $parser = HTML5::DOM->new();

print $DOM_tree->encoding; # WINDOWS-1251
print $DOM_tree->utf8; # 0

my $DOM_tree = $parser->parse($text);
my $nodes = $DOM_tree->querySelectorAll('body > *');
my $div2 = $DOM_tree->createElement('div');

$div2->innerHTML(qq{<div>Вставляемый фрагмент кода</div>});

$nodes->[0]->after($div2);

$textWin1251 = $DOM_tree->body->innerHTML;

print $textWin1251;

Return

<p>Если заголовок заполнен, Р° подзаголовка нет – для материала РІСЃРµ остается так же, как раньше.</p><div>Вставляемый фрагмент кода</div>

Options like

my $parser = HTML5::DOM->new({
  encoding => "WINDOWS-1251", # and / or
  utf8 => 0, # and / or
  default_encoding => "WINDOWS-1251"
});

and each other has no affect.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.