Code Monkey home page Code Monkey logo

docker-puppeteer-api's Introduction

Puppeteer API

Headless chrome scraper API based on puppeteer.

This image contains:

  1. Google Chrome working in headless mode.
  2. ExpressJS application exposing single endpoint allowing to get contents of the website using puppeteer and headless chrome.
  3. Alternative command line interface for the above (CLI).

Features:

  1. Fetching content from URL generated by real browser on load event.
  2. Fetching content from URL generated by real browser on element appearance in DOM. This is especially useful when page you want to get loads some dynamic data through Ajax and pushes it into DOM after load event.
  3. Simple REST API endpoint with hash-based security.
  4. CLI to get data with single command.

How to use

With REST API

Start docker container using following command:

docker run -d -p 8000:8000 --restart unless-stopped --name puppeteer-api -e "SALT=abcdef" l0coful/puppeteer-api

Where:

  1. The container port 8000 is bound to 8000 host port making API accessible at http://localhost:8000 (also through the network).
  2. The SALT environment variable provided should be set to some random string and will be used for requests security.
  3. Alternatively, you can specify a file location to read the salt value from in the SALT_FILE environment variable.
  4. If you want to specify the User-Agent string to use, you can provide it in the USER_AGENT environment variable.

You can check if everything is OK tracing container logs:

$ docker logs puppeteer-api
[...]
Using 'abcdef' as salt
[local] Scraper API version 2.0.0 is listening on port: 8000

Fetching URL content

TL;DR

POST following JSON to http://localhost:8000/scrape:

{
	"url": "http://example.com",
	"selector": "h1",
	"hash": "129f2756eac7b62b5b7f428175e5a4e3"
}

Where:

  1. url is the URL to fetch.
  2. selector is an optional selector. If provided the content will be returned only after this selector returns non-empty elements array. Otherwise the content will be returned on page onload event.
  3. hash is the request signature done by md5(`${url}:${SALT}`).
On page load

By default you can fetch DOM untouched by javascript using CURL:

$ curl -s http://example.com | grep "h1"
<h1>Example Domain</h1>

To do the same but using this API you first need to calculate you URL hash using previously used SALT and URL you want to fetch:

$ echo -n "http://example.com:abcdef" | md5sum
129f2756eac7b62b5b7f428175e5a4e3 -

Having this signature you can now use the /fetch API endopint for the URL:

$ curl \
	-s \
	-X POST \
	-H "Content-Type: application/json" \
	-d '{"url": "http://example.com","hash":"129f2756eac7b62b5b7f428175e5a4e3"}' \
	http://localhost:8000/fetch \
| grep "h1"

<h1>Example Domain</h1>

In logs you can see that HTML has been returned after load event:

$ docker logs puppeteer-api
[...]
[68825600] requesting from: ::ffff:172.17.0.1 to fetch: http://example.com
[68825600] starting chrome browser
[68825600] going to: http://example.com
[68825600] page loaded, resolving content immediately
[68825600] closing chrome browser
[68825600] sending data with: 1262 bytes
On element appearance in DOM

In the second mode the API returns scraped content content only after element with given CSS selector appears in the DOM. This can be done this way using the /scrape API endopint (for h1 selector):

$ curl \
	-s \
	-X POST \
	-H "Content-Type: application/json" \
	-d '{"url": "http://example.com","selector":"h1","hash":"129f2756eac7b62b5b7f428175e5a4e3"}' \
	http://localhost:8000/scrape 

<h1>Example Domain</h1>

Related logs:

$ docker logs puppeteer-api
[51b45bd7] requesting from: ::ffff:172.17.0.1 to fetch: http://example.com
[51b45bd7] starting chrome browser
[51b45bd7] going to: http://example.com
[51b45bd7] page loaded, setting 1000 ms refresh interval
[51b45bd7] element with selector: 'h1' appeared, resolving content
[51b45bd7] clearing refresh interval
[51b45bd7] closing chrome browser
[51b45bd7] sending data with: 1262 bytes

With command line

The same options as above are executable from the command line in the host OS using temporary docker container.

On page load:

$ docker run --rm -it --entrypoint "/bin/bash" l0coful/puppeteer-api puppeteer fetch http://example.com | grep "h1"
<h1>Example Domain</h1>

On h1 element appearance in DOM:

$ docker run --rm -it --entrypoint "/bin/bash" l0coful/puppeteer-api puppeteer scrape -s "h1" http://example.com
<h1>Example Domain</h1>

How to build

docker build --tag l0coful/puppeteer-api .

docker-puppeteer-api's People

Contributors

l0co avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

mortee sneak

docker-puppeteer-api's Issues

Questions

Hi, thanks for this project. Need something like this for internal analysis. I got a few questions:

  • Multiple instances supported concurrently?
  • Proxy supported?

Thanks

image support?

It would be super cool if there could be an API for getting a thumbnail or screenshot of a page.

Many selectors

Hi, I have a question, how can I scrape for example three elementes of a website? I tried with the h1 sample and everything worked fine, but how could get for example specific h1, an span, an h3?
image
Always getting the second selector, but not the first one. Btw congrats for this project, so useful

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.