Code Monkey home page Code Monkey logo

phpscraper's Introduction

PHP Scraper is a scraper library for PHP, built with simplicity in mind. The main goal is to get stuff done instead of getting distracted with xPath selectors, preparing data structures, etc. Instead, you can just "go to a website" and get an array with all details relevant to your scraping project.

Under the hood, it uses Goutte and a few other packages. See composer.json.

Sponsors

This project is sponsored by:

Want to sponsor this project? Contact me.

Examples

Here are a few impressions of the way the library works. More examples are on the project website.

Basics: Get the Title of a Website

All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:

// Prep
$web = new \spekulatius\phpscraper;

$web->go('https://google.com');

// Returns "Google"
echo $web->title;

// Also returns "Google"
echo $web->title();

Links

The following example shows how to collect links along with meta information:

$web = new \spekulatius\phpscraper;

/**
 * Navigate to the test page. This page contains several links with different rel attributes. To save space only the first one:
 *
 * <a href="https://placekitten.com/432/287" rel="nofollow">external kitten</a>
 */
$web->go('https://test-pages.phpscraper.de/links/rel.html');

// Get the first link on the page.
$firstLink = $web->linksWithDetails[0];

/**
 * $firstLink contains now:
 *
 * [
 *     'url' => 'https://placekitten.com/432/287',
 *     'protocol' => 'https',
 *     'text' => 'external kitten',
 *     'title' => null,
 *     'target' => null,
 *     'rel' => 'nofollow',
 *     'isNofollow' => true,
 *     'isUGC' => false,
 *     'isNoopener' => false,
 *     'isNoreferrer' => false,
 * ]
 */

If there aren't any matching elements (here links) on the page, an empty array will be returned.

Scrape the Images from a Website

Scraping the images including the attributes of the img-tags:

// Prep
$web = new \spekulatius\phpscraper;

/**
 * Navigate to the test page.
 *
 * This page contains twice the image "cat.jpg".
 * Once with a relative path and once with an absolute path.
 */
$web->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html');

var_dump($web->imagesWithDetails);
/**
 * Contains:
 *
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'absolute path',
 *     'width' => null,
 *     'height' => null,
 * ],
 * [
 *     'url' => 'https://test-pages.phpscraper.de/assets/cat.jpg',
 *     'alt' => 'relative path',
 *     'width' => null,
 *     'height' => null,
 * ]
 */

Proxy Support

You can configure proxy support with setConfig:

$web->setConfig(['proxy' => 'http://user:[email protected]:3128']);

Timeout

You can set the timeout using setConfig:

$web->setConfig(['timeout' => 15]);

Setting the timeout to zero will disable it.

Disabling SSL

While unrecommended, it might be required to disable SSL checks. You can do so using:

$web->setConfig(['disable_ssl' => true]);

You can call setConfig multiple times. It stores the config and merges it with previous settings. This should be kept in mind in the unlikely use-case when unsetting values.

See the full documentation on the website for more information and many more examples.

Installation

Composer is used to install PHPScraper:

composer require spekulatius/phpscraper

After the installation, the package will be picked up by the Composer autoloader. You can start scraping now if you are using typical PHP applications or frameworks such as Laravel or Symfony. You can now use any of the examples on the website or examples in the tests/-folder.

Please consider supporting PHPScraper with a star or sponsorship:

composer thanks

Thank you ๐Ÿ’ช

phpscraper's People

Contributors

spekulatius avatar dependabot[bot] avatar nathabonfim59 avatar imgbotapp avatar datlechin avatar tacman avatar vitormattos avatar fumiya5863 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.