PHP Scraper is a scraper library for PHP, built with simplicity in mind. The main goal is to get stuff done instead of getting distracted with xPath selectors, preparing data structures, etc. Instead, you can just "go to a website" and get an array with all details relevant to your scraping project.
Under the hood, it uses Goutte and a few other packages. See composer.json.
This project is sponsored by:
Want to sponsor this project? Contact me.
Here are a few impressions of the way the library works. More examples are on the project website.
All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:
// Prep
$web = new \spekulatius\phpscraper;
$web->go('https://google.com');
// Returns "Google"
echo $web->title;
// Also returns "Google"
echo $web->title();
The following example shows how to collect links along with meta information:
$web = new \spekulatius\phpscraper;
/**
* Navigate to the test page. This page contains several links with different rel attributes. To save space only the first one:
*
* <a href="https://placekitten.com/432/287" rel="nofollow">external kitten</a>
*/
$web->go('https://test-pages.phpscraper.de/links/rel.html');
// Get the first link on the page.
$firstLink = $web->linksWithDetails[0];
/**
* $firstLink contains now:
*
* [
* 'url' => 'https://placekitten.com/432/287',
* 'protocol' => 'https',
* 'text' => 'external kitten',
* 'title' => null,
* 'target' => null,
* 'rel' => 'nofollow',
* 'isNofollow' => true,
* 'isUGC' => false,
* 'isNoopener' => false,
* 'isNoreferrer' => false,
* ]
*/
If there aren't any matching elements (here links) on the page, an empty array will be returned.
Scraping the images including the attributes of the img
-tags:
// Prep
$web = new \spekulatius\phpscraper;
/**
* Navigate to the test page.
*
* This page contains twice the image "cat.jpg".
* Once with a relative path and once with an absolute path.
*/
$web->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html');
var_dump($web->imagesWithDetails);
/**
* Contains:
*
* [
* 'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
* 'alt' => 'absolute path',
* 'width' => null,
* 'height' => null,
* ],
* [
* 'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
* 'alt' => 'relative path',
* 'width' => null,
* 'height' => null,
* ]
*/
You can configure proxy support with setConfig
:
$web->setConfig(['proxy' => 'http://user:[email protected]:3128']);
You can set the timeout
using setConfig
:
$web->setConfig(['timeout' => 15]);
Setting the timeout to zero will disable it.
While unrecommended, it might be required to disable SSL checks. You can do so using:
$web->setConfig(['disable_ssl' => true]);
You can call setConfig
multiple times. It stores the config and merges it with previous settings. This should be kept in mind in the unlikely use-case when unsetting values.
See the full documentation on the website for more information and many more examples.
Composer is used to install PHPScraper:
composer require spekulatius/phpscraper
After the installation, the package will be picked up by the Composer autoloader. You can start scraping now if you are using typical PHP applications or frameworks such as Laravel or Symfony. You can now use any of the examples on the website or examples in the tests/
-folder.
Please consider supporting PHPScraper with a star or sponsorship:
composer thanks
Thank you ๐ช