Code Monkey home page Code Monkey logo

querypath's Introduction

Use GravityPDF's QueryPath

QueryPath is now updated and maintained by the amazing folks at GravityPDF. https://github.com/GravityPDF/querypath. The version of QueryPath in this repository is no longer maintained. You are encouraged to use GravityPDF's version. -- Matt, Dec. 2022

QueryPath: Find your way.

Stability: Maintenance

Authors: Matt Butcher (lead), Emily Brand, and many others

Website | API Docs | VCS and Issue Tracking | Support List | Developer List | Pear channel |

This package is licensed under an MIT license (COPYING-MIT.txt).

At A Glance

QueryPath is a jQuery-like library for working with XML and HTML documents in PHP. It now contains support for HTML5 via the HTML5-PHP project.

Gettings Started

Assuming you have successfully installed QueryPath via Composer, you can parse documents like this:

require_once "vendor/autoload.php";

// HTML5 (new)
$qp = html5qp("path/to/file.html");

// Legacy HTML via libxml
$qp = htmlqp("path/to/file.html");

// XML or XHTML
$qp = qp("path/to/file.html");

// All of the above can take string markup instead of a file name:
$qp = qp("<?xml version='1.0'?><hello><world/></hello>")

But the real power comes from chaining. Check out the example below.

Example Usage

Say we have a document like this:

<?xml version="1.0"?>
<table>
  <tr id="row1">
    <td>one</td><td>two</td><td>three</td>
  </tr>
  <tr id="row2">
    <td>four</td><td>five</td><td>six</td>
  </tr>
</table>

And say that the above is stored in the variable $xml. Now we can use QueryPath like this:

<?php
// Add the attribute "foo=bar" to every "td" element.
qp($xml, 'td')->attr('foo', 'bar');

// Print the contents of the third TD in the second row:
print qp($xml, '#row2>td:nth(3)')->text();

// Append another row to the XML and then write the
// result to standard output:
qp($xml, 'tr:last')->after('<tr><td/><td/><td/></tr>')->writeXML();

?>

(This example is in examples/at-a-glance.php.)

With over 60 functions and robust support for chaining, you can accomplish sophisticated XML and HTML processing using QueryPath.

QueryPath Installers

The preferred method of installing QueryPath is via Composer.

You can also download the package from GitHub.

Composer (Preferred)

To add QueryPath as a library in your project, add this to the 'require' section of your composer.json:

{
  "require": {
    "querypath/QueryPath": ">=3.0.0"
  }
}

The run php composer.phar install in that directory.

To stay up to date on stable code, you can use dev-master instead of >=3.0.0.

Manual Install

You can either download a stable release from the GitHub Tags page or you can use git to clone this repository and work from the code.

Including QueryPath

As of QueryPath 3.x, QueryPath uses the Composer autoloader if you installed with composer:

<?php
require 'vendor/autoload.php';
?>

Without Composer, you can include QueryPath like this:

<?php
require 'QueryPath/src/qp.php';
?>

QueryPath can also be compiled into a Phar and then included like this:

<?php
require 'QueryPath.phar';
?>

From there, the main functions you will want to use are qp() (alias of QueryPath::with()) and htmlqp() (alias of QueryPath::withHTML()). Start with the API docs.

querypath's People

Contributors

danielbachhuber avatar eabrand avatar faryshta avatar fiveminuteargument avatar gdmac avatar hakre avatar mattfarina avatar noisan avatar sandeepshetty avatar technosophos avatar tomorrowtoday avatar yaph avatar zackkatz avatar zemistr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

querypath's Issues

:contains() should strip quotes from strings

The filters :contains("foo"), :contains('foo') and :contains(foo) should all be treated the same. This is to conform to jQuery's implementation of the CSS 3 Selector specification.

Retrieving nodes text contents concatenates all text contents without a separator enabled feature

Basically, when dealing with a given node, for which we would like to retrieve its text contents, we want to avoid the issue where text portions that should be separated by a space or dot are in fact concatenated without any separator.

This issue being due to situations such as: give me the node's text, where, for instant:

2ND FLOOR
SIXTY CIRCULAR ROAD
DOUGLAS IM1 1SA.

Applying the text() method to the td node will result in: "2ND FLOORSIXTY CIRCULAR ROADDOUGLAS IM1 1SA" instead of something like "2ND FLOOR, SIXTY CIRCULAR ROAD, DOUGLAS IM1 1SA"

To obtain such a result, obviously we could use something like this:
qp($node)->xpath('descendant::text()')->textImplode($separator);

Where $separator could be a whitespace, a dot, a comma, and so on...

Unfortunately, this won't work for nodes attributes. as attributes are not considered as node child.

Creating a new method dedicated to this feature would be a good idea.

QueryPathImpl parseXMLFile() assumes remote HTML files always end in .html

I'm trying to traverse an HTML document but the URI doesn't end in .html and it's remote so I get the failed to load file QueryPathException.

I'm just getting my feet wet with QueryPath, it's definitely sweet, but I'm not sure the fix here. Perhaps check the response headers to decide whether it's XML or HTML?

Coding errors in QueryPath.php

I found some problems in your QueryPath code, so I wnat to ask you:

In the line 450 to 456 of the file QueryPath.php :
    If no match is found, we set an empty.
but the setting is not outer the foreach!

In the line 469:
    $vals = explode(' ', $nl->item($i)->getAttribute('class')); 
    $nl->item($i) is a DOMNode and there is no method getAttribute in class DOMNode.

In line 2956:
    if ($lastDot !== FALSE && (strtolower(substr($filename, $lastDot)) == '.html'
    but many file's extension are ".htm" and not ".html". What i can do?

In line 2773
    $test = substr($string, 0, 255);
    The variant $test is never used again from here.

Hope hear form you soon!
Thanks!

css() method overwrites all styles, not just named one

Given a document with the following element:

...

The following code:

$qp->find('h1')->css('font-size', 'large')->css('font-weight', 'bold');

produces the following:

...

The css() method apparently overwrites all styles that came before it, which makes it difficult to use since one does not always know what existing styles there are in the element.

nth-of-type selector fails on absolute index value

The "nth-of-type" pseudo-class selector (and probably others) seems to fail on something like "p:nth-of-type(5)", which should select the fifth "

" element. It comes up with a message:

Warning: Missing argument 3 for QueryPathCssEventHandler::nthOfTypeChild(), called in ...CssEventHandler.php on line 469 and defined in ...CssEventHandler.php on line 911

and doesn't seem to select the desired element. I tried "0n+5" as an alternative with the same result.

This syntax should be a valid CSS selector (http://www.w3.org/TR/css3-selectors/#nth-child-pseudo), and browsers like Safari seem to accept it. I'll post an example when I get a chance.

CSS parser chokes on malformed XML namespaces

This test fails.

/**
 * @expectedException QueryPathException
 */
public function testFailedElementNS() {
$mock = $this->getMock('TestCssEventHandler', array('elementNS'));
$mock->expects($this->once())
  ->method('elementNS')
  ->with($this->equalTo('mytest'), $this->equalTo('myns'));


// Test a failed assumption about what an NS looks like.
$parser = new CssParser('myns\:mytest', $mock);
$parser->parse();
}

:contains not working

Using both 2.0.1 and current HEAD, the :contains filter is not working. Try the following examples:

  1. qp('http://php.net/', 'h1.summary a:contains("Released")') Should result in 4 (at this time) results, but has none.
  2. qp('http://httpd.apache.org/', 'a:contains("Released")') Should result in 4 results, but has none.

XHTML exceptions for tags

Certain tags in XHTML cannot use an unary form (). QueryPath needs to handle those cases in the xhtml() method.

Examples:
<h?></h?>

<script></script>

QP 2.1 Alpha 1 template problem

I'm trying out QP 2.1 to use the new abbr() option but I think this
version is causing a problem with templates.
The template documentation shows an example of filling a table. This
works just great in 2.0 but when I try it in 2.1 I get:

Warning: SplObjectStorage::attach() expects parameter 1 to be object,
null given in C:\wamp\www\cms\querypath\QueryPath.php on line 1206

Fatal error: Call to a member function createDocumentFragment() on a
non-object in C:\wamp\www\cms\querypath\QueryPath.php on line 1717

The simpler example of filling a list works fine in 2.1

Thanks, Steve.

Code:

';

// Define the data
$data['.header1'][] = 'Header One';
$data['.header2'][] = 'Header Two';
$data['.table-row'][] = array(
'.cell1' => 'Cell One',
'.cell2' => 'Cell Two',
);
$data['.table-row'][] = array(
'.cell1' => 'Cell Three',
'.cell2' => 'Cell Four',
);
$data['.table-row'][] = array(
'.cell1' => 'Cell Five',
'.cell2' => 'Cell Six',
);

// Merge the template and write out the results.
qp()->tpl($tpl, $data)->writeHTML();

?>

Attribute doubling in XHTML root element

Under some circumstances, reading an XHTML document and then writing it out results in the doubling of certain special-use attributes on the root element -- namely, xmlns and xml:lang.

This can be reproduced in plain DOM with something like this:

$doc = new DomDocument("1.0");
$doc->loadHTML($_html);
print $doc->saveXML();

It is unclear whether this has any truly negative side-effects, but it is incorrect nonetheless. Since it does appear to be a DOM bug, for QueryPath to fix it, it would have to be done as a post-processing step.

Pyrus builds incorrectly package documentation

Pyrus is not correctly setting the paths for documentation, which results in PEAR installation failures pretty much every time.

We either need a fixed Pyrus or we need to remove docs from the PEAR package.

Chokes on HTML5

Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Tag header invalid in...

Unless I am misreading this, QP is choking on the HTML5 tag

(which is what was located at the line the error pointed to). I am running 2.0.1

Entity escaping from innerHTML() and html()

When getting fragments of a document with html() or innerHTML(), some entities (NBSP, BULL) are not escaped on output, but are left in as UTF-8 character sequences.

This causes other PHP functions (like htmlentities()) to do weird things when the encoding argument is passed in.

Examle code:

<?php
$html = "<!DOCTYPE html><html><body>This is a string with a
non&nbsp;breaking space in it</body></html>";
$QP = htmlqp($html, 'html', array('convert_to_encoding' => 'utf-8'));
//$QP = htmlqp($html, 'html');
$QP = qp($html, 'html');

echo '1. ' . htmlentities($QP->html()) . PHP_EOL;
echo '2. ' . htmlentities($QP->top('body')->html()) . PHP_EOL;
echo '3. ' . htmlentities($QP->innerhtml()) . PHP_EOL;

echo '4. ' . htmlentities($QP->top('body')->html(), ENT_COMPAT, 'utf-8') . PHP_EOL;
?>

Only 1 and 4 encode the entities as expected.

Can't parse urls with "&"

When i parse a file with an url like this-> http://www. example.com/index.html&attr=x
i got the following message.
Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::loadXML(): EntityRef: expecting ';' in Entity

If i remove "&attr=x", it parses successfully.
As the urls need to be unchanged, i can't just replace the string.

Adding textAfter() and textBefore()

$qp->textAfter() would be something like :

$qp->get(0)->nextSibling->nodeValue;

It would be a good thing to have this in the core of querypath.

strip_low_ascii generating invalid html

If you have new lines in your html ("\n"), strip_low_ascii will replace these with &#10;

run from command line for demo:

php -r "echo filter_var(\"\n\", FILTER_UNSAFE_RAW, FILTER_FLAG_ENCODE_LOW);"

&#10; is causing loadHTML() to fall over in a heap.

Large files read into strings result in a seg fault

Medium and large files will cause a segmentation fault under cases like this:

$contents = file_get_contents('big.xml'); qp($contents); // segmentation fault

Things will not segfault when the file is read directly by QueryPath:

qp('big.xml'); // Parses fine. No errors.

Typo in CssEventHandler.php

In line 1124 of CssEventHandler.php, there is a typo in the function emptyElement(): "$kid->nodType" instead of "$kid->nodeType". This causes the :empty selector to generate PHP warnings and incorrectly select elements containing only text nodes.

Bug in Phar package

The Phar distribution does not have a proper alias set. Something in the deployment scripts is not working.

nth-last-of-type(-n+b) doesn't work correctly

The "nth-last-of-type" CSS pseudo-class doesn't seem to work correctly for the case (-n+b). The example below should delete the last three "div" elements, but instead QueryPath delete only the single "third last" element. The CSS style shows the same syntax properly interpreted by a browser, coloring the last three div elements in red. I tested this with the QueryPath 2.1 release.


<?php
require_once("QueryPath/QueryPath.php");
$html = <<<"EOD"
<!DOCTYPE html>
<html>
<head>
    <title>nth-last-of-type bug in QueryPath</title>
    <style type="text/css">
<!--
        .last3:nth-last-of-type(-n+3) {
            color:red;
        }
//-->
    </style>
</head>
<body>
The "fifth", "sixth" and "seventh" div elements should be removed, and the last three ("second", "third", "fourth") be in red.<br>
QueryPath 2.1 deletes only the single "fifth" div, even though the browser handles the same syntax correctly.<hr>
<div class="last3">I am the first div.</div>
<div class="last3">I am the second div.</div>
<div class="last3">I am the third div.</div>
<div class="last3">I am the fourth div.</div>
<div class="last3">I am the fifth div.</div>
<div class="last3">I am the sixth div.</div>
<div class="last3">I am the seventh div.</div>
</body>
</html>
EOD;

    $qp = htmlqp($html, NULL);

    if (true) {  // 
        $qp->remove("div:nth-last-of-type(-n+3)");  // should remove last three div elements, but doesn't
        $html2 = $qp->top()->html();
        $html2 = html_entity_decode($html2);  // any way to avoid the entitiy encoding?
    } else {  // just output original html with no qp editing
        $html2 = $html;
    }

echo $html2;

?>

Bug in cdata()

If there are utf8 characters in the cdata section,
unexpected trailing characters will appear in the cdata() output.
The utf8 characters are correctly handled.
Tested on v2.0.1 and the latest source.

String:< div>test< /div>
Output:< div>test< /div>

String:< div>お< /div>
Output:< div>お< /div>v>

String:< div>おお< /div>
Output:< div>おお< /div>div>

String:< div>おおお< /div>
Output:< div>おおお< /div>< /div>

textarea with name and id attributes

One issue I'm hitting is that it seems to be complaining about errors in the HTML that it did not care about in the past. Some of the errors are legit, but other are surprising, for example, it complains about the following HTML snippet:

<textarea name="somename" id="somename"></textarea>

I get an error here basically telling me that "somename" is already declared. If I change either the name or id, it works.

Add XInclude support

Add XInclude support for DOMDocument so that includes are automatically brought in.

Missing the "length" property

It would be really nice if QueryPath implemented the length property for matches. For example:

echo qp($doc, $path)->length;

As far as I can tell, there is no simple way to determine how many elements have been matched.

Undefined variable error when using "convert_to_encoding" option

When using convert_to_encoding, the following errors occur:

Notice: Undefined variable: to_encoding in QueryPath/QueryPath.php on line 3607
Notice: Undefined variable: from_encoding in QueryPath/QueryPath.php on line 3607
Warning: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specified in QueryPath/QueryPath.php on line 3607

The proper variable names are $to_enc and $from_enc.

warning: Invalid argument supplied for foreach()

I spotted a little bug.
When I call contents() and there are no contents (I suppose) I get "warning: Invalid argument supplied for foreach()" and it shows me the line 1044 in QueryPath.php in my QueryPath drupal module. This points to the line 2227 in your documentation (http://api.querypath.org/docs/class_query_path.html#a3c3eeaf9ed289e55cd34926feb82eabf)

I don't know the size of the contents() because the size() function always gives me 1. Also if I try $source->contents()->size() because then I will already call again the contents() function and it will give me the same warning.

xpath optimizations causing incorrect behavior

Matt,

I seem to have run into a bug in Querypath. If find() is used with a
.class or #id , it seems to search not just in the decendants/context
of the current match[es], but in the whole document. This seems to be
an issue in your special xpath optimization code for these two
selectors in find(). Using div.class or div.id etc. works fine because
they seem to get done by your CssParser .

I'm appending a simple testcase for it.

This seems pretty basic.. surprised it wasn't found earlier, unless
I'm doing something wrong.

Thanks
Hari
-- sorry, I'm not a member of github so I'm just sending this by email.


append('
'); $out->append('
') ->children(':last-child') ->append('
') ->parent() ->append('
') ->children(':last-child') ->append('
') ; // look for class b inside a2. should find just one node, with id=b2 echo "find using selector '.b' on a2. should find 1 match only\n"; $out->find('.b') ; printQp($out); $out->end(); echo "find using selector 'div.b' on a2. should find 1 match only\n"; $out->find('div.b') ; printQp($out); $out->end(); // look for id b1 inside a2. should find no matches echo "find using selector '#b1' on a2. should find no matches!\n"; $out->find('#b1') ; printQp($out); $out->end(); echo "find using selector 'div#b1' on a2. should find no matches!\n"; $out->find('div#b1') ; printQp($out); $out->end(); //$out->writeHTML(); return; function printQp($qp) { $i = 0 ; echo "Qp: size=" . $qp->size(). " : " ; foreach ($qp as $match) { echo "match[$i] " ; $i++; //var_dump($match->text()); $nodeType = $match->attr('nodeType') ; if (!empty($nodeType)){ echo "nodeType=$nodeType " ; } $id=$match->attr('id'); if (!empty($id)){ echo "id=$id " ; } $class=$match->attr('class'); if (!empty($class)){ echo "class=$class " ; } echo " ; " ; } echo "\n
" ; return $qp ; }

Can't parse "&nbsp;"

When i parse a file with "&n bsp;", i get the follow message.
Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::loadXML(): Entity 'nbsp' not defined in Entity

(The space between is added intentionally to prevent the automatic markdown)

Better HTML handling

HTML is often malformed and often requires some extra steps to parse. Can we add a method that encapsulates all of the additional handling?

remove() sets the wrong matches on the QueryPath object

The remove() method should remove the matched elements, return a QueryPath of the matched elements that were removed, but not change the matches for the present QueryPath.

Instead, remove() sets the current matches to the removed matches.

Workaround: use branch()->remove();

html() should check entities

In the method html(), it should check to see if replace entities needs to be called before calling:

$doc->appendXML($markup);

This seems to be handled correctly in append() and prepend() because they call prepareInsert() which does this.

Force parsing the cdata section

I need a way to parse the content in the cdata section.
I think it's good to have an option to force querypath to parse the cdata section.

querypath failing on importing xml with binary

On trying to import an xml file with ~30kb of binary data, query path returns this error:

QueryPathParseException: DOMDocument::loadXML(): CData section not finished in Entity, line: 179 (/home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php: 2789) in /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php on line 3324

Call Stack:
0.0009 100712 1. {main}() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/newsystemtest.php:0
0.0352 2328460 2. mailnode_send_email() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/newsystemtest.php:85
1.5770 2455976 3. qp() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/newsystemtest.php:105
1.5770 2457192 4. QueryPath->__construct() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php:164
1.5771 2458124 5. QueryPath->parseXMLString() /home/kyle/workspace/www/edully/sites/all/modules/custom_modules/mailnode/QueryPath/QueryPath.php:351

Need a way to set encoding in constructor

From a message to the QueryPath list:

About XML encoding, I had a look at the QueryPath.php code and it
seems that every time 'new DOMDocument()' is being used, it is done so
without specifying any encoding parameter. Thus, I guess there is no
way to specify the encoding to be used when using an xml string as the
QueryPath constructor parameter.

...

I was wondering if it would not be easier to be able to specify the
encoding parameters as part of the QueryPath constructor option
parameter, so that we could do something like:

$options = array('encoding' => 'UTF-8');
$qp = qp($xmlString, $options);

Child selector not working correctly

Just discovered this issue: I have the following HTML snippet:

$snippet = '<p>Some text with a <strong>nested child</strong> surrounded by text.</p>'

Since there is only a single tag at the top level, one would expect that after creating a QueryPath object thus: $qp = qp($snippet); the query $qp->top('body > *') should return a single element (the paragraph), since the > selector should only return direct children of the body container. However, what is actually returned is both the P and the STRONG elements, which is wrong. This would be expected had the query been 'body *', but it is incorrect for the query 'body > *'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.