Code Monkey home page Code Monkey logo

scraper's Introduction

Scraper

Node.js based scraper using headless chrome

version dependecies build

Installation

$ npm install @jonstuebe/scraper

Features

  • Scrape top ecommerce sites (Amazon, Walmart, Target)
  • Return basic product information (title, price, image, description)
  • Easy to use API to scrape any website

API

Simply require the package and initialize with a url and pass a callback function to receive the data.

es5

const Scraper = require("@jonstuebe/scraper");

// run inside of an async function
(async () => {
  const data = await Scraper.scrapeAndDetect(
    "http://www.amazon.com/gp/product/B00X4WHP5E/"
  );
  console.log(data);
})();

es6

import Scraper from "@jonstuebe/scraper";

// run inside of an async function
(async () => {
  const data = await Scraper("http://www.amazon.com/gp/product/B00X4WHP5E/");
  console.log(data);
})();

with promises

import Scraper from "@jonstuebe/scraper";

Scraper("http://www.amazon.com/gp/product/B00X4WHP5E/").then(data => {
  console.log(data);
});

shared scraper instance

If you are going to be running the scraper a number of times in succession, it's recommended to share the same chromium instance for each sequential/parallel scrape.

import puppeteer from "puppeteer";
import Scraper from "@jonstuebe/scraper";

// run inside of an async function
(async () => {
  const browser = await puppeteer.launch();
  let products = [
    "https://www.target.com/p/corinna-angle-leg-side-table-wood-threshold-8482/-/A-53496420",
    "https://www.target.com/p/glasgow-metal-end-table-black-project-62-8482/-/A-52343433"
  ];

  let productsData = [];
  for (const product of products) {
    const productData = await Scraper(product, browser);
    productsData.push(productData);
  }

  await browser.close(); // make sure and close the browser otherwise the instances will continue to run in the backround on your machine

  console.table(productsData);
})();

emulate devices

If you want to emulate a device, pass in a puppeteer device as the third agument:

import puppeteer from "puppeteer";
import Scraper from "@jonstuebe/scraper";

// run inside of an async function
(async () => {
  const data = await Scraper(
    "http://www.amazon.com/gp/product/B00X4WHP5E/",
    null,
    puppeteer.devices["iPhone SE"]
  );
  console.log(data);
})();

custom scrapers

const Scraper = require("@jonstuebe/scraper");

(async () => {
  const site = {
    name: "npm",
    hosts: ["www.npmjs.com"],
    scrape: async page => {
      const name = await Scraper.getText("div.content-column > h1 > a", page);
      const version = await Scraper.getText(
        "div.sidebar > ul:nth-child(2) > li:nth-child(2) > strong",
        page
      );
      const author = await Scraper.getText(
        "div.sidebar > ul:nth-child(2) > li.last-publisher > a > span",
        page
      );

      return {
        name,
        version,
        author
      };
    }
  };

  const data = await Scraper.scrape(
    "https://www.npmjs.com/package/lodash",
    site
  );
  console.log(data);
})();

Contributing

If you want to add any sites, or just have an idea or feature, go ahead and fork this repo and send me a pull request. I'll be happy to take a look when I can and get back to you.

Issues

For any and all issues/bugs, please post a description and code sample to reproduce the problem on the issues page.

License

MIT

scraper's People

Contributors

jonstuebe avatar dependabot[bot] avatar

Stargazers

Mr. Rosario avatar

Watchers

Mr. Rosario avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.