Code Monkey home page Code Monkey logo

search-engines-scraper's Introduction

search_engines

A Python library that queries Google, Bing, Yahoo and other search engines and collects the results from multiple search engine results pages.
Please note that web-scraping may be against the TOS of some search engines, and may result in a temporary ban.

Supported search engines

Google
Bing
Yahoo
Duckduckgo
Startpage
Aol
Dogpile
Ask
Mojeek
Brave
Torch

Features

  • Creates output files (html, csv, json).
  • Supports search filters (url, title, text).
  • HTTP and SOCKS proxy support.
  • Collects dark web links with Torch.
  • Easy to add new search engines. You can add a new engine by creating a new class in search_engines/engines/ and add it to the search_engines_dict dictionary in search_engines/engines/__init__.py. The new class should subclass SearchEngine, and override the following methods: _selectors, _first_page, _next_page.
  • Python2 - Python3 compatible.

Requirements

Python 2.7 - 3.x with
Requests and
BeautifulSoup

Installation

Run the setup file: $ python setup.py install.
Done!

Usage

As a library:

from search_engines import Google

engine = Google()
results = engine.search("my query")
links = results.links()

print(links)

As a CLI script:

$ python search_engines_cli.py -e google,bing -q "my query" -o json,print

Other versions

search-engines-scraper's People

Contributors

csecht avatar hnrkcode avatar nikolasj5 avatar tasos-py avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

search-engines-scraper's Issues

How to stop repeating results?

As you said that if i use flag "-i" then, it will not repeat results, but it's not doing that. Can you please guide me how to use this feature?
Thank you!

Limit search results.

Is there anyway I can limit number of output results? Like I need only 10 Google search results...
I used below code and it gave too many results...
`from search_engines import Google

engine = Google()
results = engine.search("my query")
links = results.links()`

Thanks in advance.

duplicate search results

When I run multi searching missions,different csv file contains duplicate results and different 'query',that is second searching result containing the first searching result.

Bing can't get a result

The goole engine can be used successfully, while Bing can't.
I wrote test code like this:

engine = Bing()
results = engine.search("apple",pages=1)
links = results.links()
print(links)

However, the result is an empty list. I tested Google as well, it was fine. So I am really confused about this problem.
Thanks for your reply!

Qwant request

This tool has helped me a lot, I would like you to consider whether you can add Qwant, thank you

Host Search-Engines-Scraper on PyPi

Hi there

I'd like to use Search-Engines-Scraper on a web application that I'm working on.

To avoid having to upload the 233KB file with my web application and having to do run the setup.py and deploy the new code with each update, will this project be hosted on PyPi anytime soon?

It'd be easier to run updates through the requirements.txt file.

Thanks

Cant search multiples pages while working with Google()

from search_engines import Google, Bing, Yahoo, Ask, Dogpile, Duckduckgo, Aol, Mojeek, Startpage
google = {}
action_words = ['affimative action']

def search_word(words):
      for word in words:
            engine = Google()
            results = engine.search(word, pages = 10)
            links = results.links()
            num = 1
            for link in links:
                google[num] = link
                num = num+1

search_word(action_words)

The output shows only the first 10 links. It would be really helpful if you can let me know what I am doing wrong. I would like to extract all the links in the first 10 pages.

I checked with bing and yahoo which are working fine and I am able to get all the links in the first 10 pages.

Duckduckgo and Startpage are not working

Bing empty results

I am getting empty results for Bing search, is its scraping engine actual?

Feature request: return status code (and error msg)

In first of all many thanks for your work!
Currently if a Search Engine returns a non 200 code there is only a print message. It would be great if status code and error message would be accessible to detect bans (or other issues) by search engines.

page endpoint

Hello! Thank you for your amazing contribution!

I would wanted to ask about using proxy with email & password?
How can I configure that?

Many thanks!


i got it. Just enter like this "http://user:[email protected]:3128/"
Please, reply on the second question;)

Proxy setting issue

trying to using this great tool in my office. my company is using proxy for browsers. i tried input our server/port name in config file, and run into an SSL error?
is there a way to disable verify?

Increasing search results

Sorry if basic question, how do i change search results? when using bing there's ~170 results, google 10 results, ask ~60 results, not sure how to modify.

Thanks

Add sleeping

Add sleeping to avoid lockout (Especially for Google)
BTW. Amazing tool! ^^

How does the filter argument work?

How can I filter to exclude two hosts (wikipedia.org and facebook.com)?

According to the docs, filtering is done via -f argument.
'-f', filter results [url, title, text, host] is what I find in the script.

As -o json will output to JSON and is described as '-o', help='output file [html, csv, json]', I expected something along the lines of -f host REGEX but does not work.

Filename argument in search_engines_cli.py

Hello, thank you very much for this amazing project!

I'm using the search_engines_cli.py and when the query text is too long it gives an error because of file system limit of characters. Fixed for me adding an argument:

ap.add_argument('-n', help='filename for output file', default=cfg.OUTPUT_DIR+'output')

Also added a new require and an argument in engine.output in order to run fine.

Making a pull request as suggestion.

Thank you o/

ERROR HTTP 429

Hi, your code is working great, however, I am facing this too many requests error (429) after running 3 or 4 queries. I changed proxies as well, still, it gives the same error. Is it about user agents?

Yahoo title parsing improving

I noticed that title of Yahoo is extracted incorrectly:

URL: https://gist.github.com/soxoj/9d65c2f4d3bec5dd25949197ea73cf3a
Title: gist.github.com › soxoj › 9d65c2f4d3bec5dd25949197eamaigret.ipynb · GitHub

Title should be maigret.ipynb · GitHub

I did some fixes form my other project here

Showing zero results on bing

My Code:

from search_engines import Bing
engine = Bing()
results = engine.search("lums")
links = results.links()
print(links)
# Searching Bing                                                                 
# []  

I am getting links with zero length, whatever I search.

DuckDuckGo scraper fails

Hi there!

I noticed an issue with the DuckDuckGo scraper. The response HTML doesn't contain any search results, only some Javascript, so the selectors don't match anything.

I created a temporary fix below (based on the Google scraper), which uses html.duckduckgo.com instead of duckduckgo.com.

Thanks!

output:

Searching Duckduckgo
Traceback (most recent call last):
  File "C:\Users\default\scrape.py", line 10, in <module>
    engine.search('test')
  File "C:\Users\default\search_engines\engine.py", line 162, in search
    response = self._get_page(request['url'], request['data'])
  File "C:\Users\default\search_engines\engines\duckduckgo.py", line 44, in _get_page
    response = self._http_client.get(page)
  File "C:\Users\default\search_engines\http_client.py", line 21, in get
    page = self._quote(page)
  File "C:\Users\default\search_engines\http_client.py", line 41, in _quote
    if utl.decode_bytes(utl.unquote_url(url)) == utl.decode_bytes(url):
  File "C:\Users\default\search_engines\utils.py", line 15, in unquote_url
    return decode_bytes(requests.utils.unquote(url))
  File "C:\Users\default\AppData\Local\Programs\Python\Python310\lib\urllib\parse.py", line 655, in unquote
    if '%' not in string:

fix:

from ..engine import SearchEngine
from ..config import PROXY, TIMEOUT, FAKE_USER_AGENT
from ..utils import unquote_url, quote_url

class Duckduckgo(SearchEngine):
    '''Searches duckduckgo.com'''
    def __init__(self, proxy=PROXY, timeout=TIMEOUT):
        super(Duckduckgo, self).__init__(proxy, timeout)
        self._base_url = u'https://html.duckduckgo.com'
        self._current_page = 1
        self.set_headers({'User-Agent':FAKE_USER_AGENT})

    def _selectors(self, element):
        '''Returns the appropriate CSS selector.'''
        selectors = {
            'url': 'a.result__a', 
            'title': 'a.result__a', 
            'text': 'a.result__snippet',
            'links': 'div#links div.result',
            'next': 'input[value="next"]'
        }
        return selectors[element]
    
    def _first_page(self):
        '''Returns the initial page and query.'''
        url = u'{}/html/?q={}'.format(self._base_url, quote_url(self._query, ''))
        return {'url':url, 'data':None}
    
    def _next_page(self, tags):
        '''Returns the next page URL and post data (if any)'''
        self._current_page += 1
        selector = self._selectors('next').format(page=self._current_page)
        next_page = self._get_tag_item(tags.select_one(selector), 'href')
        url = None
        if next_page:
            url = self._base_url + next_page
        return {'url':url, 'data':None}

    def _get_url(self, tag, item='href'):
        '''Returns the URL of search results item.'''
        selector = self._selectors('url')
        url = self._get_tag_item(tag.select_one(selector), item)

        if url.startswith(u'/url?q='):
            url = url.replace(u'/url?q=', u'').split(u'&sa=')[0]
        return unquote_url(url)

Startpage search not working?

Hello, while other engines do provide results, Startpage gives no results for even most simple search terms. Can you please check it, does it work or maybe something changed with Startpage in meantime, so it does not work anymore?

Unsupported or invalid CSS selector

When searching with google I get this error.
`
Searching Google

page: 1 links: 9Traceback (most recent call last):
File "C:/Users/pegag/AppData/Local/Programs/Python/Python38/scrapesearch.py", line 4, in
results = engine.search("my query")
File "C:/Users/pegag/AppData/Local/Programs/Python/Python38\search_engines\engine.py", line 171, in search
request = self._next_page(tags)
File "C:/Users/pegag/AppData/Local/Programs/Python/Python38\search_engines\engines\google.py", line 36, in _next_page
next_page = self._get_tag_item(tags.select_one(selector), 'href')
File "C:\Users\pegag\AppData\Local\Programs\Python\Python38\lib\site-packages\bs4\element.py", line 1340, in select_one
value = self.select(selector, limit=1)
File "C:\Users\pegag\AppData\Local\Programs\Python\Python38\lib\site-packages\bs4\element.py", line 1476, in select
raise ValueError(
ValueError: Unsupported or invalid CSS selector: "a[href][aria-label=Page 2]"
`

Google text attribute is empty

Many thanks again for your work, it works great. Recently I noticed that if searching with Google the text field is empty for all results.

se = Google()
res = se.search("news", 1)
res.results()

[{'host': 'bbc.com',
  'link': 'https://www.bbc.com/news/world',
  'title': 'World - BBC Newshttps://www.bbc.com › news › world',
  'text': ''},
 {'host': 'edition.cnn.com',
  'link': 'https://edition.cnn.com/world',
  'title': 'World news – breaking news, videos and headlines - CNNhttps://edition.cnn.com › world',
  'text': ''},
 {'host': 'theguardian.com',
  'link': 'https://www.theguardian.com/world',
  'title': 'Latest news from around the world | The Guardianhttps://www.theguardian.com › world',
  'text': ''},
 {'host': 'hindustantimes.com',
  'link': 'https://www.hindustantimes.com/world-news',
  'title': 'World News, Latest World News, Breaking News and ...https://www.hindustantimes.com › World News',
  'text': ''},
 {'host': 'reuters.com',
  'link': 'https://www.reuters.com/news/archive/worldNews',
  'title': 'World News Headlines | Reutershttps://www.reuters.com › news › archive › worldNews',
  'text': ''},
 {'host': 'abcnews.go.com',
  'link': 'https://abcnews.go.com/International/',
  'title': 'International News | Latest World News, Videos & Photos ...https://abcnews.go.com › International',
  'text': ''},
 {'host': 'news.sky.com',
  'link': 'https://news.sky.com/world',
  'title': 'World News - Breaking international news and headlines | Sky ...https://news.sky.com › world',
  'text': ''},
 {'host': 'nytimes.com',
  'link': 'https://www.nytimes.com/section/world',
  'title': 'World News - The New York Timeshttps://www.nytimes.com › section › world',
  'text': ''},

How to stop the search

Sorry for posting a rudimentary question but there is no real instruction or documentation.

After putting in the example it keeps searching Google non-stop. How am I supposed to interrupt this or set other options?

engine = Google()
results = engine.search("my query")
links = results.links()

Error an ssl error occurred.

Hi, i like the code you wrote. unfortunately i am now experiencing the following error when scraping bing:
ERROR An SSL error occurred.

Are there any search engines that don't ban their users for scraping?

Kind regards,
Mart

Feature request; toggle console output for searches

Hi there!

In engine.py there are three uses of out.console(...) within search(), can this output be made into a toggle? It should be relatively easy and it would avoid console spam when using the search engine in e.g. a tqdm loop.

Bests

Startpage Ban Not Detected

Startpage.com doesn't send a 429 or other error code, but rather just returns a different page instead of the results and so the ban isn't detected.

Can I use this project for image search?

Hi there,
Thank you for your contribution.
I have a question while using this script.
Can I use this script for image search?
For example, instead of query string, can I put the byte string of data image to perform image search?
Kindly let me know what you think.
Thank you so much.

Possible fake data from Qwant API

Looks like all URLs from Qwant are fake for me:

Qwant results
1   http://ef.bf/teheuhi
2   http://cez.lr/di
3   http://veoca.la/gured
4   http://vamibif.th/unror
5   http://hicakda.hk/izwe
6   http://nofusive.mr/aco
7   http://ismuaka.pe/welagses
8   http://luho.mg/desler

Can you check if it is not only my problem?

Please support python 3.8

With python 3.8.5 on ubuntu this scaper cannot be used:
Error:

Traceback (most recent call last):
  File "search.py", line 4, in <module>
    results = engine.search("Something")
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/engine.py", line 161, in search
    response = self._get_page(request['url'], request['data'])
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/engine.py", line 66, in _get_page
    return self._http_client.get(page)
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/http_client.py", line 21, in get
    page = self._quote(page)
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/http_client.py", line 41, in _quote
    if utl.decode_bytes(utl.unquote_url(url)) == utl.decode_bytes(url):
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/utils.py", line 15, in unquote_url
    return decode_bytes(requests.utils.unquote(url))
AttributeError: module 'requests.utils' has no attribute 'unquote'

Content of file search.py

from search_engines import Bing

engine = Bing()
results = engine.search("Something")
links = results.links()
print(links)

many queries issue

When I run multiple statement cyclic queries, the result of the next query will contain the previous one.
How i can clear the previous one?

Results appear different than the browser

If I search for something on DuckDuckGo in the browser in Incognito I get one list of results. When running using this scraper I get a different set of results. Is there an explanation for this?

Thanks!

404 SERP links

Hi,

I am not sure if this is the write place to post this question, so my apologies in advance. I am scraping Google and getting the top-n link results for a query collection. Now I am trying to request those links to scrape the resulting pages. However, sometimes the links are broken -404 error- is not supposed that this should not happen? I mean the SE should filter out broken links.

On the other hand, sometimes my request petition gives me Client Error-404, despite the webpage actually exists. Could anyone provide me an orientation on this?

Thank you very much.
Best,
Marcos

list proposal

Would it be possible to add a list option, instead of a simple query? This could make it easier to automate some cases.
Thank you

Different results between the search engine scraper and google

Hello,

Thanks for the outstanding library!

I recently faced an issue with different results when using the scraper.

Search query is PROCTER & GAMBLE CO sustainability report.
From google web query, I can get the results as following:
image

However, when I use scraper,

from search_engines import Google
from googlesearch import search

query = 'PROCTER & GAMBLE CO sustainability report'
results = engine.search(query, 1)
links = results.links()

The output links are:

https://us.pg.com/ 
https://twitter.com/ProcterGamble?ref_src=twsrc^google|twcamp^serp|twgr^author 
https://en.wikipedia.org/wiki/Procter_&_Gamble 
https://www.pgcareers.com/ 
https://www.linkedin.com/company/procter-and-gamble 
https://www.facebook.com/proctergamble/ 
https://pginvestor.com/ 

May I know why this happens? How can I get the consistent result?

Many thanks!

Locale Handling

Great Project, thanks a ton!
Wonder how to set the locale, e.g. en-US, given a search engine object?

What i currently do is to set
self.session.headers.update({"Accept-Language": "en-US"})

in the http_client, e.g. right here.

Is there any other or maybe preferred way?

Difficulties installing on Macbook Air macOS Big Sur M1 Chip

I recently switched to a Mac machine and having trouble to install the library here.

  • MacBook Air
  • macOS Big Sur
  • M1 Chip

Maybe anyone can help out?

running install
error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

    [Errno 2] No such file or directory: '/Library/Python/3.8/site-packages/test-easy-install-41837.write-test'

The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    /Library/Python/3.8/site-packages/

This directory does not currently exist.  Please create it and try again, or
choose a different installation directory (using the -d or --install-dir
option).```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.