technosophos / querypath Goto Github PK

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources.

Home Page: http://querypath.org

License: Other

Makefile 0.11% PHP 99.58% Shell 0.24% HTML 0.08%

querypath's Introduction

Use GravityPDF's QueryPath

QueryPath is now updated and maintained by the amazing folks at GravityPDF. https://github.com/GravityPDF/querypath. The version of QueryPath in this repository is no longer maintained. You are encouraged to use GravityPDF's version. -- Matt, Dec. 2022

QueryPath: Find your way.

Authors: Matt Butcher (lead), Emily Brand, and many others

This package is licensed under an MIT license (COPYING-MIT.txt).

At A Glance

QueryPath is a jQuery-like library for working with XML and HTML documents in PHP. It now contains support for HTML5 via the HTML5-PHP project.

Gettings Started

Assuming you have successfully installed QueryPath via Composer, you can parse documents like this:

require_once "vendor/autoload.php";

// HTML5 (new)
$qp = html5qp("path/to/file.html");

// Legacy HTML via libxml
$qp = htmlqp("path/to/file.html");

// XML or XHTML
$qp = qp("path/to/file.html");

// All of the above can take string markup instead of a file name:
$qp = qp("<?xml version='1.0'?><hello><world/></hello>")

But the real power comes from chaining. Check out the example below.

Example Usage

Say we have a document like this:

<?xml version="1.0"?>
<table>
  <tr id="row1">
    <td>one</td><td>two</td><td>three</td>
  </tr>
  <tr id="row2">
    <td>four</td><td>five</td><td>six</td>
  </tr>
</table>

And say that the above is stored in the variable $xml. Now we can use QueryPath like this:

<?php
// Add the attribute "foo=bar" to every "td" element.
qp($xml, 'td')->attr('foo', 'bar');

// Print the contents of the third TD in the second row:
print qp($xml, '#row2>td:nth(3)')->text();

// Append another row to the XML and then write the
// result to standard output:
qp($xml, 'tr:last')->after('<tr><td/><td/><td/></tr>')->writeXML();

?>

(This example is in examples/at-a-glance.php.)

With over 60 functions and robust support for chaining, you can accomplish sophisticated XML and HTML processing using QueryPath.

QueryPath Installers

The preferred method of installing QueryPath is via Composer.

You can also download the package from GitHub.

Composer (Preferred)

To add QueryPath as a library in your project, add this to the 'require' section of your composer.json:

{
  "require": {
    "querypath/QueryPath": ">=3.0.0"
  }
}

The run php composer.phar install in that directory.

To stay up to date on stable code, you can use dev-master instead of >=3.0.0.

Manual Install

You can either download a stable release from the GitHub Tags page or you can use git to clone this repository and work from the code.

Including QueryPath

As of QueryPath 3.x, QueryPath uses the Composer autoloader if you installed with composer:

<?php
require 'vendor/autoload.php';
?>

Without Composer, you can include QueryPath like this:

<?php
require 'QueryPath/src/qp.php';
?>

QueryPath can also be compiled into a Phar and then included like this:

<?php
require 'QueryPath.phar';
?>

From there, the main functions you will want to use are qp() (alias of QueryPath::with()) and htmlqp() (alias of QueryPath::withHTML()). Start with the API docs.

querypath's People

Contributors

Stargazers

Watchers

Forkers

sdboyer eabrand yaph tyaakow upworks fiveminuteargument dmortell gdmac billortell ktomk kostas123 rrmodi88 nharo23 web5design shinvdu ankurkwv hakre tonydspaniard wtwei mkalkbrenner jrcrittenden meshi07 darkmuck gobb neaplus bryanwillis dochoa drupalicus cognifloyd jrazer robertchristophersmith noisan fhefh2009 qinjx mikeafc iulyus daudmabena thinkbox anhta itaylor ptidevelopment cyberwani coderidge contributed mpcode mlnsvbd webmechanicx andronex hidenny rjmi004 mubassirhayat frans-beech-it gurpreet2501 brightlocal diegopino huodi t-web jaggedsoft kleopatra999 reyjmc03 jeisc mattfarina hushulin tsmsogn devandrei legendzhu diagowang xuesong55 elexperimento hahaliu005 onnimonni 877732602 pcfreak30 cyjimmy264 shenmadouyaowen harimau99 arthurkushman warezaddict-com ibringit fangyufangxuan inaudito nrde davisshaver kia-nasirzadeh luisgcastillos beinsports-dot-com desarrollo-troop kinekt4 viktorkuk thomaslauria hakim-cp petschko tylersatre satisfactory-clips-archive newswirecom groovenectar dagmanolis qualmon bearerpipelinetest kevinjavitz

querypath's Issues

:contains() should strip quotes from strings

The filters :contains("foo"), :contains('foo') and :contains(foo) should all be treated the same. This is to conform to jQuery's implementation of the CSS 3 Selector specification.

Retrieving nodes text contents concatenates all text contents without a separator enabled feature

Basically, when dealing with a given node, for which we would like to retrieve its text contents, we want to avoid the issue where text portions that should be separated by a space or dot are in fact concatenated without any separator.

This issue being due to situations such as: give me the node's text, where, for instant:

2ND FLOOR
SIXTY CIRCULAR ROAD
DOUGLAS IM1 1SA.

Applying the text() method to the td node will result in: "2ND FLOORSIXTY CIRCULAR ROADDOUGLAS IM1 1SA" instead of something like "2ND FLOOR, SIXTY CIRCULAR ROAD, DOUGLAS IM1 1SA"

To obtain such a result, obviously we could use something like this:
qp($node)->xpath('descendant::text()')->textImplode($separator);

Where $separator could be a whitespace, a dot, a comma, and so on...

Unfortunately, this won't work for nodes attributes. as attributes are not considered as node child.

Creating a new method dedicated to this feature would be a good idea.

QueryPathImpl parseXMLFile() assumes remote HTML files always end in .html

I'm trying to traverse an HTML document but the URI doesn't end in .html and it's remote so I get the failed to load file QueryPathException.

I'm just getting my feet wet with QueryPath, it's definitely sweet, but I'm not sure the fix here. Perhaps check the response headers to decide whether it's XML or HTML?

Write unit tests for QPXSL

The QPXSL extension doesn't seem to have any unit tests associated. Write some.

Coding errors in QueryPath.php

I found some problems in your QueryPath code, so I wnat to ask you:

In the line 450 to 456 of the file QueryPath.php :
    If no match is found, we set an empty.
but the setting is not outer the foreach!

In the line 469:
    $vals = explode(' ', $nl->item($i)->getAttribute('class')); 
    $nl->item($i) is a DOMNode and there is no method getAttribute in class DOMNode.

In line 2956:
    if ($lastDot !== FALSE && (strtolower(substr($filename, $lastDot)) == '.html'
    but many file's extension are ".htm" and not ".html". What i can do?

In line 2773
    $test = substr($string, 0, 255);
    The variant $test is never used again from here.

Hope hear form you soon!
Thanks!

Quotes not always escaped when XML is rendered

Sometimes single and double quotes are decoded and rendered in an unencoded state.

See http://drupal.org/node/686966 for an example.

This does not seem to cause PHP's libxml to choke, but it might cause problems for other parsers.

css() method overwrites all styles, not just named one

Given a document with the following element:

...

The following code:

$qp->find('h1')->css('font-size', 'large')->css('font-weight', 'bold');

produces the following:

...

The css() method apparently overwrites all styles that came before it, which makes it difficult to use since one does not always know what existing styles there are in the element.

nth-of-type selector fails on absolute index value

The "nth-of-type" pseudo-class selector (and probably others) seems to fail on something like "p:nth-of-type(5)", which should select the fifth "

" element. It comes up with a message:

Warning: Missing argument 3 for QueryPathCssEventHandler::nthOfTypeChild(), called in ...CssEventHandler.php on line 469 and defined in ...CssEventHandler.php on line 911

and doesn't seem to select the desired element. I tried "0n+5" as an alternative with the same result.

This syntax should be a valid CSS selector (http://www.w3.org/TR/css3-selectors/#nth-child-pseudo), and browsers like Safari seem to accept it. I'll post an example when I get a chance.

CSS parser chokes on malformed XML namespaces

This test fails.

/**
 * @expectedException QueryPathException
 */
public function testFailedElementNS() {
$mock = $this->getMock('TestCssEventHandler', array('elementNS'));
$mock->expects($this->once())
  ->method('elementNS')
  ->with($this->equalTo('mytest'), $this->equalTo('myns'));


// Test a failed assumption about what an NS looks like.
$parser = new CssParser('myns\:mytest', $mock);
$parser->parse();
}

:contains not working

Using both 2.0.1 and current HEAD, the :contains filter is not working. Try the following examples:

qp('http://php.net/', 'h1.summary a:contains("Released")') Should result in 4 (at this time) results, but has none.
qp('http://httpd.apache.org/', 'a:contains("Released")') Should result in 4 results, but has none.

XHTML exceptions for tags

Certain tags in XHTML cannot use an unary form (). QueryPath needs to handle those cases in the xhtml() method.

Examples:
<h?></h?>

QP 2.1 Alpha 1 template problem

I'm trying out QP 2.1 to use the new abbr() option but I think this
version is causing a problem with templates.
The template documentation shows an example of filling a table. This
works just great in 2.0 but when I try it in 2.1 I get:

Warning: SplObjectStorage::attach() expects parameter 1 to be object,
null given in C:\wamp\www\cms\querypath\QueryPath.php on line 1206

Fatal error: Call to a member function createDocumentFragment() on a
non-object in C:\wamp\www\cms\querypath\QueryPath.php on line 1717

The simpler example of filling a list works fine in 2.1

Thanks, Steve.

Code:

// Define the data
$data['.header1'][] = 'Header One';
$data['.header2'][] = 'Header Two';
$data['.table-row'][] = array(
'.cell1' => 'Cell One',
'.cell2' => 'Cell Two',
);
$data['.table-row'][] = array(
'.cell1' => 'Cell Three',
'.cell2' => 'Cell Four',
);
$data['.table-row'][] = array(
'.cell1' => 'Cell Five',
'.cell2' => 'Cell Six',
);

// Merge the template and write out the results.
qp()->tpl($tpl, $data)->writeHTML();

Attribute doubling in XHTML root element

Under some circumstances, reading an XHTML document and then writing it out results in the doubling of certain special-use attributes on the root element -- namely, xmlns and xml:lang.

This can be reproduced in plain DOM with something like this:

$doc = new DomDocument("1.0");
$doc->loadHTML($_html);
print $doc->saveXML();

It is unclear whether this has any truly negative side-effects, but it is incorrect nonetheless. Since it does appear to be a DOM bug, for QueryPath to fix it, it would have to be done as a post-processing step.

DOS-formatted XML files have newlines converted to 

Sending a DOS file through QueryPath will, under certain circumstances, have newlines converted to entities.

Circumstances:

Output is in XML (writeXML(), xml, innerXML())
Output is sub-portion of HTML document.

Pyrus builds incorrectly package documentation

Pyrus is not correctly setting the paths for documentation, which results in PEAR installation failures pretty much every time.

We either need a fixed Pyrus or we need to remove docs from the PEAR package.

Chokes on HTML5

Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Tag header invalid in...

Unless I am misreading this, QP is choking on the HTML5 tag

(which is what was located at the line the error pointed to). I am running 2.0.1

Entity escaping from innerHTML() and html()

When getting fragments of a document with html() or innerHTML(), some entities (NBSP, BULL) are not escaped on output, but are left in as UTF-8 character sequences.

This causes other PHP functions (like htmlentities()) to do weird things when the encoding argument is passed in.

Examle code:

<?php
$html = "<!DOCTYPE html><html><body>This is a string with a
non&nbsp;breaking space in it</body></html>";
$QP = htmlqp($html, 'html', array('convert_to_encoding' => 'utf-8'));
//$QP = htmlqp($html, 'html');
$QP = qp($html, 'html');

echo '1. ' . htmlentities($QP->html()) . PHP_EOL;
echo '2. ' . htmlentities($QP->top('body')->html()) . PHP_EOL;
echo '3. ' . htmlentities($QP->innerhtml()) . PHP_EOL;

echo '4. ' . htmlentities($QP->top('body')->html(), ENT_COMPAT, 'utf-8') . PHP_EOL;
?>

Only 1 and 4 encode the entities as expected.

Can't parse urls with "&"

When i parse a file with an url like this-> http://www. example.com/index.html&attr=x
i got the following message.
Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::loadXML(): EntityRef: expecting ';' in Entity

If i remove "&attr=x", it parses successfully.
As the urls need to be unchanged, i can't just replace the string.

Adding textAfter() and textBefore()

$qp->textAfter() would be something like :

$qp->get(0)->nextSibling->nodeValue;

It would be a good thing to have this in the core of querypath.

Combinators + and ~ do not work correctly

According to the spec, + and ~ are directional -- they only match elements after the left-side match.

QueryPath currently allows them to match any siblings.

strip_low_ascii generating invalid html

If you have new lines in your html ("\n"), strip_low_ascii will replace these with 


run from command line for demo:

php -r "echo filter_var(\"\n\", FILTER_UNSAFE_RAW, FILTER_FLAG_ENCODE_LOW);"


 is causing loadHTML() to fall over in a heap.

QPDB's dbInit() method does not pass options correctly

No options (including username and password) are passed to the PDO object whenthe dbInit() method is executed. The QPDB::baseDB method works correctly.

Large files read into strings result in a seg fault

Medium and large files will cause a segmentation fault under cases like this:


$contents = file_get_contents('big.xml');
qp($contents); // segmentation fault

Things will not segfault when the file is read directly by QueryPath:


qp('big.xml'); // Parses fine. No errors.

Compressed version of QP crashes my debugger

From sdboyer:

Compressed version of QP can crash some debuggers, and doesn't necessarily add anything to QueryPath. Maybe the code compression should be removed.

Typo in CssEventHandler.php

In line 1124 of CssEventHandler.php, there is a typo in the function emptyElement(): "$kid->nodType" instead of "$kid->nodeType". This causes the :empty selector to generate PHP warnings and incorrectly select elements containing only text nodes.

The backslash is not correctly removed from pseudo-class values.

The backslash is used to escape characters inside of pseudo-classes:

:contains())

The parser correctly ignores the backslash. However, it is not correctly removed by QueryPath. Thus, the string ) comes through as ) instead of as ).

Bug in Phar package

The Phar distribution does not have a proper alias set. Something in the deployment scripts is not working.

nth-last-of-type(-n+b) doesn't work correctly

The "nth-last-of-type" CSS pseudo-class doesn't seem to work correctly for the case (-n+b). The example below should delete the last three "div" elements, but instead QueryPath delete only the single "third last" element. The CSS style shows the same syntax properly interpreted by a browser, coloring the last three div elements in red. I tested this with the QueryPath 2.1 release.

<?php
require_once("QueryPath/QueryPath.php");
$html = <<<"EOD"
<!DOCTYPE html>
<html>
<head>
    <title>nth-last-of-type bug in QueryPath</title>
    <style type="text/css">
<!--
        .last3:nth-last-of-type(-n+3) {
            color:red;
        }
//-->
    </style>
</head>
<body>
The "fifth", "sixth" and "seventh" div elements should be removed, and the last three ("second", "third", "fourth") be in red.<br>
QueryPath 2.1 deletes only the single "fifth" div, even though the browser handles the same syntax correctly.<hr>
<div class="last3">I am the first div.</div>
<div class="last3">I am the second div.</div>
<div class="last3">I am the third div.</div>
<div class="last3">I am the fourth div.</div>
<div class="last3">I am the fifth div.</div>
<div class="last3">I am the sixth div.</div>
<div class="last3">I am the seventh div.</div>
</body>
</html>
EOD;

    $qp = htmlqp($html, NULL);

    if (true) {  // 
        $qp->remove("div:nth-last-of-type(-n+3)");  // should remove last three div elements, but doesn't
        $html2 = $qp->top()->html();
        $html2 = html_entity_decode($html2);  // any way to avoid the entitiy encoding?
    } else {  // just output original html with no qp editing
        $html2 = $html;
    }

echo $html2;

?>

Bug in cdata()

If there are utf8 characters in the cdata section,
unexpected trailing characters will appear in the cdata() output.
The utf8 characters are correctly handled.
Tested on v2.0.1 and the latest source.

String:< div>test< /div>
Output:< div>test< /div>

String:< div>お< /div>
Output:< div>お< /div>v>

String:< div>おお< /div>
Output:< div>おお< /div>div>

String:< div>おおお< /div>
Output:< div>おおお< /div>< /div>

textarea with name and id attributes

One issue I'm hitting is that it seems to be complaining about errors in the HTML that it did not care about in the past. Some of the errors are legit, but other are surprising, for example, it complains about the following HTML snippet:

I get an error here basically telling me that "somename" is already declared. If I change either the name or id, it works.

Improper tag selection

Here's the relevant code:

https://gist.github.com/7ad314681162a48f0f2f

Basically, it's ignoring the second td in the tr, not sure why. If i simply do:

var_dump($result->children('td')->text());

I get both td's lodged together.

is() and not() should accept DOMElement

To remain compatible with jQuery 1.6, is() and not() should take DOMNode or DOMElement objects.

Maybe also callables.

http://bugs.jquery.com/ticket/2773

PEAR package puts QueryPath.php in wrong place

The current PEAR package (2.1.0beta2) puts QueryPath.php in QueryPath/QueryPath/QueryPath.php. That's one directory too many.

Add XInclude support

Add XInclude support for DOMDocument so that includes are automatically brought in.

Missing the "length" property

It would be really nice if QueryPath implemented the length property for matches. For example:

echo qp($doc, $path)->length;

As far as I can tell, there is no simple way to determine how many elements have been matched.

Undefined variable error when using "convert_to_encoding" option

When using convert_to_encoding, the following errors occur:

Notice: Undefined variable: to_encoding in QueryPath/QueryPath.php on line 3607
Notice: Undefined variable: from_encoding in QueryPath/QueryPath.php on line 3607
Warning: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specified in QueryPath/QueryPath.php on line 3607

The proper variable names are $to_enc and $from_enc.

warning: Invalid argument supplied for foreach()

I spotted a little bug.
When I call contents() and there are no contents (I suppose) I get "warning: Invalid argument supplied for foreach()" and it shows me the line 1044 in QueryPath.php in my QueryPath drupal module. This points to the line 2227 in your documentation (http://api.querypath.org/docs/class_query_path.html#a3c3eeaf9ed289e55cd34926feb82eabf)

I don't know the size of the contents() because the size() function always gives me 1. Also if I try $source->contents()->size() because then I will already call again the contents() function and it will give me the same warning.

xpath optimizations causing incorrect behavior

Matt,

I seem to have run into a bug in Querypath. If find() is used with a
.class or #id , it seems to search not just in the decendants/context
of the current match[es], but in the whole document. This seems to be
an issue in your special xpath optimization code for these two
selectors in find(). Using div.class or div.id etc. works fine because
they seem to get done by your CssParser .

I'm appending a simple testcase for it.

This seems pretty basic.. surprised it wasn't found earlier, unless
I'm doing something wrong.

Thanks
Hari
-- sorry, I'm not a member of github so I'm just sending this by email.

append('

'); $out->append('

') ->children(':last-child') ->append('

') ->parent() ->append('

') ->children(':last-child') ->append('

') ; // look for class b inside a2. should find just one node, with id=b2 echo "find using selector '.b' on a2. should find 1 match only\n"; $out->find('.b') ; printQp($out); $out->end(); echo "find using selector 'div.b' on a2. should find 1 match only\n"; $out->find('div.b') ; printQp($out); $out->end(); // look for id b1 inside a2. should find no matches echo "find using selector '#b1' on a2. should find no matches!\n"; $out->find('#b1') ; printQp($out); $out->end(); echo "find using selector 'div#b1' on a2. should find no matches!\n"; $out->find('div#b1') ; printQp($out); $out->end(); //$out->writeHTML(); return; function printQp($qp) { $i = 0 ; echo "Qp: size=" . $qp->size(). " : " ; foreach ($qp as $match) { echo "match[$i] " ; $i++; //var_dump($match->text()); $nodeType = $match->attr('nodeType') ; if (!empty($nodeType)){ echo "nodeType=$nodeType " ; } $id=$match->attr('id'); if (!empty($id)){ echo "id=$id " ; } $class=$match->attr('class'); if (!empty($class)){ echo "class=$class " ; } echo " ; " ; } echo "\n
" ; return $qp ; }

Can't parse " "

When i parse a file with "&n bsp;", i get the follow message.
Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::loadXML(): Entity 'nbsp' not defined in Entity

(The space between is added intentionally to prevent the automatic markdown)

Better HTML handling

HTML is often malformed and often requires some extra steps to parse. Can we add a method that encapsulates all of the additional handling?

CssEventHandler::removeQuotes does not check last character

The removeQuotes method does not check if value is quoted properly, it only checks that the first character is a quote. Both the first and last character should match for quotes to be removed.

remove() sets the wrong matches on the QueryPath object

The remove() method should remove the matched elements, return a QueryPath of the matched elements that were removed, but not change the matches for the present QueryPath.

Instead, remove() sets the current matches to the removed matches.

Workaround: use branch()->remove();

Force file to be parsed as HTML (or as XML)

Sometimes the autodetection in QP causes QP to parse with the non-desired parser.

Can we add a flag to allow developers to manually specify which parser should be used?

Poorly formed HTML is being treated as a filename instead of markup

Since QueryPath 2.0 beta 2, poorly formed markup is sometimes treated as a path instead of as a document. This seems to be a result of the change to QueryPath::isXMLish();

html() should check entities

In the method html(), it should check to see if replace entities needs to be called before calling:

$doc->appendXML($markup);

This seems to be handled correctly in append() and prepend() because they call prepareInsert() which does this.

Force parsing the cdata section

I need a way to parse the content in the cdata section.
I think it's good to have an option to force querypath to parse the cdata section.

querypath failing on importing xml with binary

On trying to import an xml file with ~30kb of binary data, query path returns this error:

QueryPathParseException: DOMDocument::loadXML(): CData section not finished in Entity, line: 179 (/home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php: 2789) in /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php on line 3324

Call Stack:
0.0009 100712 1. {main}() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/newsystemtest.php:0
0.0352 2328460 2. mailnode_send_email() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/newsystemtest.php:85
1.5770 2455976 3. qp() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/newsystemtest.php:105
1.5770 2457192 4. QueryPath->__construct() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php:164
1.5771 2458124 5. QueryPath->parseXMLString() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php:351

Need a way to set encoding in constructor

From a message to the QueryPath list:

About XML encoding, I had a look at the QueryPath.php code and it
seems that every time 'new DOMDocument()' is being used, it is done so
without specifying any encoding parameter. Thus, I guess there is no
way to specify the encoding to be used when using an xml string as the
QueryPath constructor parameter.

...

I was wondering if it would not be easier to be able to specify the
encoding parameters as part of the QueryPath constructor option
parameter, so that we could do something like:

$options = array('encoding' => 'UTF-8');
$qp = qp($xmlString, $options);

(Feature) top() should take a CSS selector

Like branch(), top() should take a CSS selector.

When a selector is given, QP should search from the top of the document, equivalent to top()->find('selector').

Child selector not working correctly

Just discovered this issue: I have the following HTML snippet:

$snippet = '<p>Some text with a <strong>nested child</strong> surrounded by text.</p>'

Since there is only a single tag at the top level, one would expect that after creating a QueryPath object thus: $qp = qp($snippet); the query $qp->top('body > *') should return a single element (the paragraph), since the > selector should only return direct children of the body container. However, what is actually returned is both the P and the STRONG elements, which is wrong. This would be expected had the query been 'body *', but it is incorrect for the query 'body > *'.