PHP Scraper

PHP Scraper is a scraper library for PHP, built with simplicity in mind. The main goal is to get stuff done instead of getting distracted with xPath selectors, preparing data structures, etc. Instead, you can just "go to a website" and get an array with all details relevant to your scraping project.

Under the hood, it uses Goutte and a few other packages. See composer.json.

Examples

Here are a few impressions of the way the library works. More examples are on the project website.

Basics: Get the Title of a Website

All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:

// Prep
$web = new \spekulatius\phpscraper;

$web->go('https://google.com');

// Returns "Google"
echo $web->title;

// Also returns "Google"
echo $web->title();

Links

The following example shows how to collect links along with meta information:

$web = new \spekulatius\phpscraper;

/**
 * Navigate to the test page. This page contains several links with different rel attributes. To save space only the first one:
 *
 * <a href="https://placekitten.com/432/287" rel="nofollow">external kitten</a>
 */
$web->go('https://test-pages.phpscraper.de/links/rel.html');

// Get the first link on the page.
$firstLink = $web->linksWithDetails[0];

/**
 * $firstLink contains now:
 *
 * [
 *     'url' => 'https://placekitten.com/432/287',
 *     'protocol' => 'https',
 *     'text' => 'external kitten',
 *     'title' => null,
 *     'target' => null,
 *     'rel' => 'nofollow',
 *     'isNofollow' => true,
 *     'isUGC' => false,
 *     'isNoopener' => false,
 *     'isNoreferrer' => false,
 * ]
 */

If there aren't any matching elements (here links) on the page, an empty array will be returned.

Scrape the Images from a Website

Scraping the images including the attributes of the img-tags:

// Prep
$web = new \spekulatius\phpscraper;

/**
 * Navigate to the test page.
 *
 * This page contains twice the image "cat.jpg".
 * Once with a relative path and once with an absolute path.
 */
$web->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html');

var_dump($web->imagesWithDetails);
/**
 * Contains:
 *
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'absolute path',
 *     'width' => null,
 *     'height' => null,
 * ],
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'relative path',
 *     'width' => null,
 *     'height' => null,
 * ]
 */

Proxy Support

You can configure proxy support with setConfig:

$web->setConfig(['proxy' => 'http://user:[email protected]:3128']);

Timeout

You can set the timeout using setConfig:

$web->setConfig(['timeout' => 15]);

Setting the timeout to zero will disable it.

Disabling SSL

While unrecommended, it might be required to disable SSL checks. You can do so using:

$web->setConfig(['disable_ssl' => true]);

You can call setConfig multiple times. It stores the config and merges it with previous settings. This should be kept in mind in the unlikely use-case when unsetting values.

See the full documentation on the website for more information and many more examples.

Installation

Composer is used to install PHPScraper:

composer require spekulatius/phpscraper

After the installation, the package will be picked up by the Composer autoloader. You can start scraping now if you are using typical PHP applications or frameworks such as Laravel or Symfony. You can now use any of the examples on the website or examples in the tests/-folder.

Please consider supporting PHPScraper with a star or sponsorship:

composer thanks

Thank you 💪

imzainmalik / phpscraper Goto Github PK

phpscraper's Introduction

PHP Scraper

Sponsors

Examples

Basics: Get the Title of a Website

Links

Scrape the Images from a Website

Proxy Support

Timeout

Disabling SSL

Installation

phpscraper's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent