Code Monkey home page Code Monkey logo

duzun / hquery.php Goto Github PK

View Code? Open in Web Editor NEW
351.0 24.0 75.0 3.4 MB

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.

Home Page: https://duzun.me/playground/hquery

License: MIT License

HTML 0.17% PHP 95.65% JavaScript 0.37% Shell 3.81%
hquery crawler scraper html parser psr-4 psr-0 php selectors domcrawler xml-parser html-parser xml fast broken-html invalid-html jquery-like jquery-selectors css-selectors

hquery.php's Issues

503 error code

I try to run hQuery but I receive 503 error code.
What is this?

how you get node text without it's children?

Hi, first of all, thanks for the nice lib.

for example, we have code like this:

<li id="listItem">
    This is some text
    <span id="firstSpan">First span text</span>
    <span id="secondSpan">Second span text</span>
</li>

We need to get This is some text only. the text() method will give us This is some textFirst span textSecond span text

There's solution for jQuery: http://stackoverflow.com/questions/3442394/jquery-using-text-to-retrieve-only-text-not-nested-in-child-tags

Is it possible with hQuery?
Thanks.

Exception Error warning

I use symfony 3.2 and got this warning.

php.DEBUG: Notice: Undefined index: CONTENT_ENCODING {"exception":"[object] (Symfony\Component\Debug\Exception\SilencedErrorContext: {"severity":8,"file":"/../USER/MYPROJECT/vendor/duzun/hquery/hquery.php","line":2967})"}

Notice: Undefined index: method also appears in line 2751.
Can you please check this?

how remove a node/element ?

Many html page contain script or other sort of unpleasant elements that need to be delete before fetching text. So a way to search and delete element is needed. so please if there is any way to delete a node describe how , and if not please plan to implement or guide me to add this option.

Thanks

Add condtions to element

How do I filter which element is being processed?

for example, I have mutiple classes:
$sels = "h2, .pnlDescription, .address";

I want to have a condition with the ".address" since I want to alter the text within it.

if(".address" == "somestring")
{
----code---
}

Where and how can I achieve this one?

There's an error on test

Fatal error: Using $this when not in object context in C:\xampp\htdocs\scraper\duzun.me_playground_hquery.php on line 13

Scrape in background

I love what you've done with this! I was wondering if there was any way to have hQuery queue up as a background process. I built a tool with this API, but while it's scraping, no other pages on my local server will load until it is completely finished. Is there some sort of functionality for this?

Redirection Follow: problem

Hello,

I have noticed some problems with redirection. The library fails on me and also on the demo: https://duzun.me/playground/hquery

Example URLs to check:
http://sorellhotels.com
http://bücher.ch
http://ipet.ch

As example to explain the problem and my debuging I will use http://sorellhotels.com

This url has 3 redirections as follows:

hQuery does this:

He changes the host wrong ... Instead of "sorellhotels.com" he uses "tls"

Can you please check this? Thx

Best regards

DS not defined

Line 1890 uses DS which is not defined?

Where should this be done?

Selector .class1.class2 seems to not work

If I want to get all elements that have both class1 and class2, it seems class2 is ignored and all results that have class1 are returned.

I'm using like this:
$item = $doc->find( '.header-product-info--price.margin-bottom-10' );

I get all the items that have class header-product-info--price, even the ones that don't have class margin-bottom-10.

Use proxy with fromURL

Hello,
how can I use a proxy with fromURL?

I use that code:

$doc = hQuery::fromUrl(
                $url
                , [
                    'Accept'     => $accept_html,
                    'User-Agent' => $user_agent,
                    'Referer'    => $referer,
                ]
 );

thank you
Jochen

Cache needed?

When using this library, is setting the cache folder absolutely necessary? What would happen if we didn't put it?

"Cannot redeclare class duzun\hQuery" right after install with composer

I'm installing hQuery in laravel with composer:

dev@laravel:~/apps/testapp$ composer require duzun/hquery
Using version ^1.5 for duzun/hquery
./composer.json has been updated
Warning: You should avoid overwriting already defined auth settings for github.com.
Loading composer repositories with package information
Updating dependencies (including require-dev)
  - Installing duzun/hquery (1.5.0)
    Downloading: 100%

Writing lock file
Generating autoload files
> Illuminate\Foundation\ComposerScripts::postUpdate
> php artisan optimize
Generating optimized class loader
dev@laravel:~/apps/testapp$ php artisan runcommand

  [ErrorException]
  Cannot redeclare class duzun\hQuery

runcommand contents:

use duzun\hQuery;
hQuery::$cache_path = '/home/dev/apps/testapp/storage/cache';

I've commented line //class_alias('hQuery', 'duzun\\hQuery'); in psr-4\hQuery.php and that solved the issue, but I'm not sure is that ok or not :)

Get Attribute value

How we can get attribute value from this string ?

<div id="cerberus-data-metrics" style="display: none;" data-asin="B00EAHSBV4" data-asin-price="24.55" data-asin-shipping="0" data-asin-currency-code="USD" data-substitute-count="-1" data-device-type="WEB" data-display-code="Asin is not eligible because it has a retail offer" ></div>

following does not work:
$p = $doc->find_text('#cerberus-data-metrics','data-asin-price');

javascript

Hi,
is it possible evaluate javascript in the page before the page will navigate?
Thanks

How to retrieve next element?

foreach($prod->find('.product-options dt') as $v) { echo str_replace('*','', $v->text()).' '; echo $v->next(); }
The next element is a dd, but echo next returns nothing. How do I get the next element object? I need to get the values for dt and dd within every dl, but dd is a select element with multiple values I need.

how i get a value from a span?

hi great stuff and very fast!!

but how i can get a value from a sapn:
i have that line in the html page:
<span class="price_value" itemprop="price">3,850 ₪</span>

ho wi get the value ? (that:3,850 ₪)
i need the faster way to get the value.

i try that:
$doc->find('span .price_value')->text()

and its work
but maybe have a better way to get the info or faster

and if i do a 3-5 times reffresh i get that error:
Fatal error: Call to a member function text() on null in

how i can fix that?

thanks :)

Find element by data attribute

Is it possible to find elements by their data attribute?

I have tried:

$dom->find('span[data-price]'))

But this doesn't find any spans with that data attribute which are there!

Does it work? multiple class

I'm trying to get information from .typeHighlight class from link below:
trulia.com/property/1061429905-West-End-Heights-273-Barfield-Ave-SW-Atlanta-GA-30310

Instead of 11 nodes, I get 5. It's odd because if I use a simple html, I get the right results.

ErrorException: reset() expects parameter 1 to be array

public function hasClass($className) {
$ret = $this->doc()->hasClass($this, $className);
if ( count($this) < 2 ) return reset($ret);
return max($ret);
}

I get the following error:

ErrorException: reset() expects parameter 1 to be array, boolean given in /home/vagrant/sellercrew/vendor/duzun/hquery/src/hQuery/Element.php:163

My best guess here is that this assumes the class to be present in at least one place in the whole DOM. If not present, the $ret becomes false.

can't get a website (help, question)

Hey, I have this code, it works perfectly on localhost, but it doesn't on my server

Can you help me?

Note: I know this is not a cute code, it is just for the example

$doc = hQuery::fromUrl(
    'http://www.submanga.com/Naruto'
  , array(
        'Accept'     => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
    )
);

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

how hasClass is implemented?

I see mentions in the code, could you please give an example of how should I use it.
Thanks! Great lib and I still use it :)

get_cache

How to utilize get_cache method? i.e., how to determine the particular cache file name/$fn argument?

Running with JS enabled

Guessing I know the answer to this, but the site I was scraping data from recently required JS to be enabled to view the main contents.

Do you know of any way around this?

Support for HTTP proxies

Hi @duzun,

Great work on the library. Are you open to adding support for HTTP proxies to it?

I've done a GIst test PHP which demos what's involved (currently without proxy authentication).

Alternatively, is there a way to use hQuery::fromFile() with create_stream_context()?

I'll investigate, and update this ticket ;).

Thanks,

Nick

Make faster when looping through many pages

Hi I have successfully used your package and it is really great, upon using it with only one url it's kinda fast, but when I try to use it with multple url through looping, 50 url can take upto 10minutes.

How do I make it work faster?

Error in file permissions

[2] An error occurred in file /var/www/html/app/vendor/duzun/hquery/src/hQuery.php on line 629: filemtime(): stat failed

I am using custom framework with blade templating engine. The cache files for blade is getting stored without any issues. But getting this error in your plugin. I have set the default permission for the cache folder as rwx for all users. But still getting error

Error on similar attributes

Hi!
I found an error when there are similar attributes on the same tag.

If you have src and src2 attribute on an img tag and try to use ->attr('src'), it founds nothing. If you use ->attr('src2') it found the attr text as expected.

Replace

Hi!
How can I:

  1. Replace tag? For ex: <title>News</title> replace to <title>My Site</title>
  2. Find all HTML code from to and delete them?

:nth-child support

Hi,

I tried to use :nth-child but it seems to not be working.

The rule I'm using is this one:
.product > .row > .col-12 > .row:nth-child(2)

Thanks

Undefined variable: te

Pri zaprose $data = hQuery::fromUrl("https://www.google.ru/");

Poluchaju vot takie preduprezhdenija!

Notice:  Undefined variable: te in ...\duzun\hquery\hquery.php on line 2907

Notice:  Undefined variable: te in ...\duzun\hquery\hquery.php on line 2909

Notice:  Undefined variable: te in ...\duzun\hquery\hquery.php on line 2915

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.