Code Monkey home page Code Monkey logo

google-news-scraper's Introduction

Google News Scraper

A lightweight package that scrapes article data from Google News. Simply pass a keyword or phrase, and the results are returned as an array of JSON objects.

Installation

# Install via NPM
npm install google-news-scraper

# Install via Yarn
yarn add google-news-scraper

Usage

// Require the package
const googleNewsScraper = require('google-news-scraper')

// Execute within an async function, pass a config object (further documentation below)
const articles = await googleNewsScraper({
    searchTerm: "The Oscars",
    prettyURLs: false,
    queryVars: {
        hl:"en-US",
        gl:"US",
        ceid:"US:en"
      },
    timeframe: "5d",
    puppeteerArgs: []
})

Config

The config object passed to the function above has the following properties:

Search Term (required)

This is the search query you'd like to find articles for.

Query Vars (optional)

Additional query params to add to the URL.

Pretty URLs (required)

The URLs that Google News supplies for each article are "ugly" redirect links (eg: "https://news.google.com/articles/CAIiEPgfWP_e7PfrSwLwvWeb5msqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY?hl=en-GB&gl=GB&ceid=GB%3Aen").

You can optionally ask the scraper to follow the redirect and retrieve the actual "pretty" URL (eg: "https://www.nytimes.com/2020/01/22/movies/expanded-best-picture-oscar.html").

As you can imagine, this results in lots of additional HTTP requests, which negatively impact the scraper's performance. In testing, following redirects took around five times longer on average.

Timeframe (optional)

The results can be filtered to articles published within a given timeframe prior to the requesst. The default is 7 days.

Puppeteer Arguments (optional)

An array of Chromium flags to pass to the browser instance. By default, this will be an empty array. A full list of available flags can be found here. NB: if you are launching this in a Heroku app, you will need to pass the --no-sandbox and --disable-setuid-sandbox flags, as explained in this SO answer.

The format of the timeframe is a string comprised of a number, followed by a letter prepresenting the time operator. For example 1y would signify 1 year. Full list of operators below:

  • h = hours (eg: 12h)
  • d = days (eg: 7d)
  • m = months (eg: 6m)
  • y = years (eg: 1y)

Output

The output is an array of JSON objects, with each article following the structure below:

[
    {
        "title":  "Article title",
        "subtitle":  "Article subtitle",
        "link":  "http://url-to-website.com/path/to/article",
        "image":"http://url-to-website.com/path/to/image.jpg",
        "source":  "Name of publication",
        "time":  "Time/date published (human-readable)"
    }
]

Performance

My test query returned 104 results, which took 1.566 seconds without redirects, and 7.36 seconds with redirects. I'm on a fibre connection, and other queries may return a different number of results, so your mileage may vary.

Upkeep

Please note that this is a web-scraper, which relies on DOM selectors, so any fundamental changes in the markup on the Google News site will probably break this tool. I'll try my best to keep it up-to-date, but changes to the markup on Google News will be silent and therefore difficult to keep track of. Feel free to submit an issue if the tool stops working.

Issues

Please report bugs via the issue tracker.

Contribute

Feel free to submit a PR if you've fixed an open issue. Thank you.

google-news-scraper's People

Contributors

dependabot[bot] avatar lewisdonovan avatar sv-shivansh avatar hakizimana-fred avatar hugoduar avatar pandimadhubabu avatar thatfreakcoder avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.