imangazaliev / didom Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 204.0 468 KB

Simple and fast HTML and XML parser

License: MIT License

PHP 100.00%

dom html html-parser parser xml xml-parser xpath

didom's Introduction

Hi, I'm Muhammad Imangazaliev

A fullstack developer from Dagestan, Russia

My Projects

DiDOM - simple and fast HTML and XML parser
Syntax Highlighter Bot - a Telegram bot for generating screenshots with highlighted code

Social

didom's People

Contributors

Stargazers

Watchers

Forkers

html2k foxer-zt alfach happyproff coderxlsn romeoz hasantayyar sankam-nikolya voidshah chiz-developer apelsinpro pronskiy yaroslavb t-web izabolotnev namaljayathunga rakvium nullproduction wellic sergey-kukosh dkepov p0vidl0 denisalliswell vitos8686 insiite wesavetheworld yorci lure--- diaskooo karpiks hoai inurosen valenokpc freddykr alexsj ont0shk0 tobeorla mariolima salkhwlani wlazlak piterskiy maximing seo-broweb evgenr dc-pmurtazin my-instantcms turbo liangklfang bitande7 appstacktop jekshmek shivaraj-proves toby1991 gameplayjdk magicjhon xiongamo madtaurus erengdk gsdu8g9 dimonchoo yyrcn hotarusawamura lw006 maxnamillion1 liuxu90dream vkn999 yelfive webcore-studio che1974 zabudkin shapilokk svserg anihy scasic brother-simon bokinga vove leotop baz10 guoqing1988 denisdulici hetuhui gegeriyadi anawbas karpatsky dulumao mengdianyun xuesong55 webber12 kingcastle vladzimir 123jixinyu php-rock artempitov sleuthhound wangsir0624 brizhanev subsan ryangittings warezaddict-com

didom's Issues

Timeout

Would be nice if we could specify a timeout for URL loading.

Not working when trying to fetch an URL that has gzip enabled.

Example

$url = "https://www.chefsteps.com/activities/sous-vide-salmon--2";

$document = new Document($url, true);
print_r($document->format()->html());

Вопрос по оптимизации

Добрый день.

Использую библиотеку впервые, но она уже произвела хорошее впечатление.

Интересует такой момент по оптимизации:
Влияет ли на скорость работы использование "контекста" для поиска?
Например, получаем большую страницу $document = new Document($link, true); Далее нам нужно собрать массив из элементов этой страницы:

$el["name"] = $document->find('.element-name')[0]->getAttribute('name');
$el["value"] = $document->find('.element-value')[0]->text();

и т.д.

Все поиски ведутся через переменную $document. Будет ли какая-то разница в скорости работы, если использовать контекст? Например,

$document = new Document($link, true);
$section = $document->find('.elements-list');

$el["name"] = $section[0]->find('.element-name')[0]->getAttribute('name');
$el["value"] = $section[0]->find('.element-value')[0]->text();

Или это только добавляет лишний запрос?

HTML fragment is automatically wrapped with <p>

This is similar to #27 but affects HTML fragments which do not have a wrapped element.

Example:

    $content = "This is just a <b>short</b> test!";

    $document = new \DiDom\Document();
    $document->loadHtml($content, LIBXML_HTML_NOIMPLIED | LIBXML_BIGLINES | LIBXML_HTML_NODEFDTD | LIBXML_PARSEHUGE);

    echo $document->html();

Expected result

This is just a <b>short</b> test!

Current result

<p>This is just a <b>short</b> test!</p>

Multiple URL's

Hey,]

Really nice script I've been messing around with it for abit,

But I just can't figure out how to accomplish this.

I'm trying to loop through multiple url's but cannot seem to achieve this.

    $names = ['a', 'b', 'c', 'd'];

    foreach ($names as $name) {
        $name = $name;
    }

    $document = new Document('http://www.website.com/link/' . $name, true);

    $posts = $document->find('.classname');

    foreach($posts as $post) {
        echo $post->text(), "\n";
    }

This only seems to display d

But what I would like it to do is go to all 4 links and bring back the copy from those

http://www.website.com/link/a

Copy from A

http://www.website.com/link/b

Copy from B

http://www.website.com/link/c

Copy from C

http://www.website.com/link/d

Copy from D

all on the same page.

Warnings (PHP 5.6.14)

На PHP 5.5 было все ок, а на 5.6.14 выдает предупреждения:

Warning: DOMNode::cloneNode(): ID ***** already defined in ***\vendor\imangazaliev\didom\src\DiDom\Document.php on line 72

Missing pseudo class "nth-of-type"

¡Thank you for making this awesome library!

Ref: http://www.w3schools.com/cssref/sel_nth-of-type.asp
Location: https://github.com/Imangazaliev/DiDOM/blob/master/src/DiDom/Query.php#L251

Example

php

<?php

require 'vendor/autoload.php';

$content = new \DiDom\Document("HTML_CODE");
$content->find("ul:nth-of-type(2) li"); // should return an array of "second list" lis

HTML:

<html>
<head>
</head>
<body>
    <ul>
        <li>first list</li>
        <li>first list</li>
        <li>first list</li>
        <li>first list</li>
    </ul>
    <ul>
        <li>second list</li>
        <li>second list</li>
        <li>second list</li>
        <li>second list</li>
    </ul>
</body>
</html>

I would add it myself, but i don't understand the other pseudo classes code.
Hope this helps you to make it possible :p

remove doesn't seem to work

After removing an element, doing a find on the $document still returns it as result..

HTML fragment is automatically wrapped with <html><body>

Example:

        $html = '<div>foo</div><div><span>bar</span></div>';

        $document = new \DiDom\Document($html);
        $elements = $document->find('div');

        foreach($elements as $element) {
            $element->setAttribute("foo", "bar");
        }

        echo $document->html();

Current output

<html><body><div foo="bar">foo</div><div foo="bar"><span>bar</span></div></body></html>

Expected output:

<div foo="bar">foo</div><div foo="bar"><span>bar</span></div>

Поиск елементов по дата атрибутам

Здравствуйте.

Во-первых хочу сказать, что эта либа действительно отличная вещь и вы проделали отличную работу;
Во-вторых, есть ли возможность вытягивать елемент по дата атрибуту?

Пример, ссылка с дата атрибутом:

<a href="#hello" data-element="qwerty">
    <div>Some text</div>
</a>

Сейчас я делаю так:

$html = <<<HTML
<a href="#hello" data-element="qwerty">
    <div>Some text</div>
</a>
HTML;

$document = new Document($html);
$links    = $document->find("[data-element='qwerty']");

Получаю пустой массив.

Благодарю.

:not pseudoselector

Will this selector be added?

php_network_getaddresses: getaddrinfo failed: Name or service not known !!

Hi, I'm using DiDom with laravel 4.2, and when I try to analyse a website to get the HTML I got this error:
file_get_contents(): php_network_getaddresses: getaddrinfo failed: Name or service not known' in /home/atef/public_html/seomara/vendor/imangazaliev/didom/src/DiDom/Document.php:252.
Can you help me?!

Utf-8 символы в методе html();

Вызываю метод html() и получаю вместо кириллицы utf-8 символы вида

&#1058;&#1080;&#1084; &#1050;&#1091;&#1082;

Как конвертировать их к нормальному тексту? Ну и хотелось бы чтобы html() это делал по дефолту.

На стековерфлоу нашел рекоммендацию указывать после загрузки \DOMDocument, не сработало.

$this->document->loadHtml($html);
$this->document->encoding = 'utf-8';

<script> ломает xpath парсер

Приветствую! По какой то причине данный код:

<td id="main_col">
    <div class="borderwrap">
        <script type="text/javascript">
        new Array(
            "~~NODIV~~<div>BREAKME</div>",
        )
        </script>
        <a href="http://test.com">TEST</a>
    </div>
</td>

ломает xpath парсер с выражением:
//td[@id="main_col"]/div[@class="borderwrap"]//a/@href

Попробуйте убрать строку "~~NODIV~~<div>BREAKME</div>", и мы получаем то что нужно.

Catchable fatal error

$item = $document->find('ul.menu > li')[1];
// предыдущий элемент
var_dump($item->previousSibling());

Catchable fatal error: Argument 1 passed to DiDom\Element::setNode() must be an instance of DOMElement, instance of DOMText given, called in \vendor\imangazaliev\didom\src\DiDom\Element.php on line 32 and defined in \vendor\imangazaliev\didom\src\DiDom\Element.php on line 452

Getting nextSibling element

will be added this in class Element

public function nextSibling()
 {
    return $this->node->nextSibling;
 }

Memory Usage

What is the best way to optimize memory usage for this class. Is there a way to unset all variables used in the class in case I'm using this in a long running script?

Or would just an unset($didom) do the trick. Once again, thank you for creating such an awesome class.

Ничего не возвращают lastChild() nextSibling() и previousSibling()

Вот ваш пример:

$html = '
<ul>
    <li>Foo</li>
    <li>Bar</li>
    <li>Baz</li>
</ul>
';
$document = new Document($html);
$list = $document->first('ul');
// string(3) "Baz"
echo '<br><b>$list->child(2)->text():</b><br>'.highlight_string(print_r($list->child(2)->text(), true), true).'<br>';
// string(3) "Foo"
echo '<br><b>$list->firstChild()->text():</b><br>'.highlight_string(print_r($list->firstChild()->text(), true), true).'<br>';
// string(3) "Baz" - нет ничего
echo '<br><b>$list->lastChild()->text():</b><br>'.highlight_string(print_r($list->lastChild()->text(), true), true).'<br>';

$document = new Document($html);
$item = $document->find('ul > li')[1];
echo '<br><b>$item->previousSibling():</b><br>'.highlight_string(print_r($item->previousSibling()->text(), true), true).'<br>';
echo '<br><b>$item->nextSibling():</b><br>'.highlight_string(print_r($item->nextSibling()->text(), true), true).'<br>';

$list->lastChild()->text() ничего не возвращает
nextSibling() и previousSibling() так же ничего

Multiple Elements

Hi,

I looking to try and pull down images in SVG and also other elements from the page and loop through them. How do i go about this?

    $names = ['a', 'b', 'c', 'd'];

    foreach ($names as $name) {
        $name = $name;
    }

    $document = new Document('http://www.website.com/link/' . $name, true);

    $posts = $document->find('.classname');
    // $title = $document->find('.title'); ????????
    // Does it need to go here e.g $icon = $document->find('img[src$=svg]');

    foreach($posts as $post) {
        echo $post->text(), "\n";
        // How do i loop through the images & titles?
    }

Any help is much appreciated.

Thanks Jake.

About $dom->first() return null

Hi guy
When I want to get the text of element which is not exist in the html
(Just like this:$dom->first('#productCode')->text()) ,
it will show the Fata Error in the PHP Engine
because $dom->first('#productCode') => return null.

Can we use $dom->first('#productCode')->text() without showing Error in the future, like$('#someSelector').find('#productCode').text() on Jquery,

Не работает метод remove.

Добрый день. Видел issue с этой проблемой ( #39 ), но вы ее закрыли не дождавшись от автора примера кода.

Сделал специальный упрощенный пример:

   $content = '<html><div id="c">text text text <div id="e">div</div> test test</div></html>';
    $document = new Document($content);
    $div = $document->first('div#c');
    $div->first('div#e')->remove();
    dd($div->html());

Вывод:

<div id="c">text text text <div id="e">div</div> test test</div>

То есть див с идентификатором e остался на месте после вызова remove.

Кодировочка

Здравствуйте. У меня проблема. Всё делаю curl'ом.
Сначала вывожу что мне вернула функция с curl'ом. С кодировкой проблем нет. Для примера привожу вывод части страницы

Нет комментариев

Теперь всё тоже самое, но объявляю объект класса Document и вывожу через метод html(). Для примера привожу вывод той же части страницы.

Ð�ÐµÑ� ÐºÐ¾Ð¼Ð¼ÐµÐ½Ñ�Ð°Ñ�Ð¸ÐµÐ²

Креш при селекторе "h1.post-title a:first"

Ошибся и вместо селектора has("h1.post-title a:first-child") написал has("h1.post-title a:first") и получил 500 ошибку сервера без вывода ошибки в php. DiDOM юзаю в laravel 5.1, дебагер включен.
Понятно что сам написал неправильно селектор, но хотелось бы понять почему 500 ошибка вместо эксепшена?

А можно ли брать текст без <html><body>...</html></body>?

Было бы здорово взять текст из объекта без этих тегов, так как далеко не всегда парсят страницы, в моем случае я редактирую статью.
Какой нибудь $document->html(true)
Если такая возможность есть, то извиняюсь.

Cookie

Можно сделать установку кук при file_get_content? Погуглил вроде можно такую фичу реализовать http://stackoverflow.com/questions/3431160/php-send-cookie-with-file-get-contents

При написании парсера было все гуд только потом начало выбивать в файрфоксе вот такое
Firefox определил, что сервер перенаправляет запрос на этот адрес таким образом, что он никогда не завершится.

Эта проблема может возникать при отключении или запрещении принятия кук.

Скорее всего увидели много запросов на сервер и вробуили куки

И еще вопрос нельзя ли урлы читать через курл с отправкой всех нужных заголовков итд и будет самый лучший парсер

Проблема с кодировкой

Пробовал в различных вариациях, такой вариант мне кажется самым наглядным.
Суть в том что слетают русские буквы при utf-8

    $text = 'текст text текст text';

    dd(mb_detect_encoding($text));            // UTF-8
    dd($text);                                // "текст text текст text"

    $document = new Document($text);

    dd(mb_detect_encoding($document->html()));// UTF-8
    dd($document->html());                    // "<html><body><p>Ñ‚ÐµÐºÑ�Ñ‚ text Ñ‚ÐµÐºÑ�Ñ‚ text</p></body></html>"

Class 'DiDom\Errors' not found

Hello!
After update from "imangazaliev/didom (1.7.3)" to "imangazaliev/didom (1.8.3)"

Class 'DiDom\Errors' not found

in .../vendor/imangazaliev/didom/src/DiDom/Document.php:162

Attribute selector

`
DiDom\Element Object
(
[node:protected] => DOMElement Object
(
[tagName] => img
[schemaTypeInfo] =>
[nodeName] => img
[nodeValue] =>
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] =>
[lastChild] =>
[previousSibling] => (object value omitted)
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => img
[baseURI] =>
[textContent] =>
)

)

Fatal error: Call to a member function attr() on null in /d/functions.php on line 66
`

Any idea why I can't use the attr function, if I use ->src it works fine but I need to get a

Incorrectly parsing <i> tag when placed within <a> tags

I've come across this problem where my <i> tags are not being parsed correctly, I discovered, that the tag is becoming incomplete when passed through DOM construction.

I have found that when i place an <i> tag within <a> tags, the <i> tag is losing the closing tag.

$body = '<a href="#"><i class="fa fa-globe"></i></a>';
$document = new Document($body);
\Log::info((string) $document);

This will become:

'<html><body><a href="#"><i class="fa fa-globe"/></a></body></html>'

I also tried the following and they seemed to be OK.

'<p>He named his car <i>The lightning</i>, because it was very fast.</p>'
'<a href="#"><span>Hello</span></a>'

Really nice DOM package. :)

Как подключить?

Здравствуйте, извините за тупой вопрос, но как подключить парсер к своему файлу?
Разработка идет на денвере, но не знаю какой файл include-ом подключить к себе?

Can't set attribute in loop

Code example:

$html = new Document($file_name, true);
foreach ( $html->find($selector)[0]->find('img') as $element ) {
    $element->src = self::embed($path);
}

Expected: <img src="'data:image/jpg;base64,base64_encode_output" />
Got old src attr value.

Out of memory

А что делать, если есть XML Файл размером 500+ Mb, в котором больше 5миллионов строк?

css selector

There is a problem.
Please look at the code as follow

$html = <<<HTML
<select name="sort-field" class="sort_field">
    <option selected="selected" value="/en/tsuk/category/brands-4210405/adidas/N-7z2Z21o4Zdgl?Nrpp=20&amp;siteId=%2F12556&amp;sort_field=Relevance">Best Match</option>
    <option value="/en/tsuk/category/brands-4210405/adidas/N-7z2Z21o4Zdgl?Nrpp=20&amp;Ns=product.freshnessRank%7C0&amp;siteId=%2F12556&amp;sort_field=Newness">Newest</option>
</select>
HTML;

$Document = new DiDom\Document;
$Document->loadHtml($html);

$Element = $Document->first('select[name=sort-field]  option[selected=selected]'); // find nothing 
$Element = $Document->first('select[name=sort-field]')->first('option[selected=selected]'); // it's ok

file_get_contents(https://.....com): failed to open stream !!

Hey, I'm trying to parse this URL 'eloquentbyexample.com' and it's sucks with this exception:
file_get_contents(https://eloquentbyexample.com): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found ' in /..../vendor/imangazaliev/didom/src/DiDom/Document.php:252.
Have an idea to solve this?
thx!

Unable to parse url's with cyrilic letters

vendor/imangazaliev/didom/src/DiDom/Document.php
line ~225

if (filter_var($filepath, FILTER_VALIDATE_URL) === false) {
            if (!file_exists($filepath)) {
                throw new RuntimeException(sprintf('File %s not found', $filepath));
            }
        }

causes fail to load.

Encoding issue

I'm using your lib to parse some HTML pages. When the task is run as crontab job, everything is OK. But once I try to parse the same page via interactive action in browser, it parses the source page in wrong encoding. Any recommendations to fix it? Attaching an example of print_r of some piece of parsed page

parent() seems to return always the $document

Instead of the actual parent element in the HTML.

Не могу запустить скрипт

Не получается запустить скрипт. Брал код из примера, сервер ругается на отсутствие класса. Может это очень глупый вопрос, но как использовать скрипт без composer? Пытался инклудить файлы - не помогло. Помогите, пожалуйста.

HTML fragment with multiple root nodes

Example

    $content = "<div>This is just a <b>short</b> test!</div><div>Another Test</div>";

    $document = new \DiDom\Document();
    $document->loadHtml($content, LIBXML_HTML_NOIMPLIED | LIBXML_BIGLINES | LIBXML_HTML_NODEFDTD | LIBXML_PARSEHUGE);

    echo $document->html();

Expected result

<div>This is just a <b>short</b> test!</div><div>Another Test</div>

Current result

<div>This is just a <b>short</b> test!</div>

Странный текст

В качестве выражения для можно передать CSS-селектор

в ридми 👍

Не работает поиск елементов по значению атрибута

Сделал специальный упрощенный пример:

$html = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta property="og:title" content="Title" /></head></html>';

        $document = new \DiDom\Document($html);
        $elements = $document->find('meta[property=og:title]');
        $element = $elements[0];
        echo $element->getAttribute('content');

Выводит:

text/html; charset=UTF-8

То есть конструкция [property=og:title] не учитывается

Is there a way to modify innertext/text ?

I've tried modify->text and text() but they both don't work

$element>children() and HTML Comment

There is a problem.
Please look at the code as follow

$html = <<<HTML
<div>
<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<!-- comment -->
<p>text5</p>
</div>
HTML;

        $Document = new \DiDom\Document;
        $Document->loadHtml($html);

        //Fatal error: Uncaught InvalidArgumentException: Argument 1 passed to DiDom\Element::setNode must be an instance of DOMElement or DOMText, DOMComment given in
        $children = $Document->find('div')[0]->children(); //error

Selector for direct child elements

I wanted to find all direct children under $element->find( '>*' ) but it returned all children and children under children in recursion.

$document->loadHtml("<div><span></span></div>")->find( '>*' );
// returns two elements div and span, while jQuery('>*') returns only div

>* is compiled to //*/* but it should be compiled to /*/* which gives direct child elements.

*[@id='my-id[attr0="attr-value0"][attr1="attr-value1"]']

Attribute selector

When using find multiple times in succession things like ->text() and innerhtml don't work

Ошибка при попытке заменить элемент ( $post->replace() )

При попытке замены элемента выводит ошибку

Argument 1 passed to DOMNode::isSameNode() must be an instance of DOMNode, null given, called in /...../vendor/imangazaliev/didom/src/DiDom/Element.php on line 339 and defined

$document = new Document( 'https://habrahabr.ru/', true );
$posts = $document->find( '.post' );
foreach ( $posts as $post ) {
    $post->replace( new Element( 'span', 'Working', [ 'class' => 'alert alert-success' ] ) );
}

Bug: <img..> to <img..></img>

Hello. In last version have bug: it echo <img..></img>, but not <img>

$html = '<div class="block-media"><img src="image.png"></div>';

$doc = new Document();

$doc->loadHtml($html);

$data = $doc->xpath('//div');

print_r($data[0]->html());

old version is GOOD:
<div class="block-media"><img src="image.png"/></div>

NEW version is BAD:
<div class="block-media"><img src="image.png"></img></div>

imangazaliev / didom Goto Github PK

didom's Introduction

Hi, I'm Muhammad Imangazaliev

A fullstack developer from Dagestan, Russia

My Projects

Social

didom's People

Contributors

Stargazers

Watchers

Forkers

didom's Issues

Example:

Expected result

Current result

Example:

Current output

Expected output:

Example

Expected result

Current result

Recommend Projects

Recommend Topics

Recommend Org