tasos-py / search-engines-scraper Goto Github PK

Search google, bing, yahoo, and other search engines with python

License: MIT License

Python 100.00%

python search-engine scraper google bing yahoo crawler

search-engines-scraper's Introduction

search_engines

A Python library that queries Google, Bing, Yahoo and other search engines and collects the results from multiple search engine results pages.
Please note that web-scraping may be against the TOS of some search engines, and may result in a temporary ban.

Supported search engines

Google
Bing
Yahoo
Duckduckgo
Startpage
Aol
Dogpile
Ask
Mojeek
Brave
Torch

Features

Creates output files (html, csv, json).
Supports search filters (url, title, text).
HTTP and SOCKS proxy support.
Collects dark web links with Torch.
Easy to add new search engines. You can add a new engine by creating a new class in search_engines/engines/ and add it to the search_engines_dict dictionary in search_engines/engines/__init__.py. The new class should subclass SearchEngine, and override the following methods: _selectors, _first_page, _next_page.
Python2 - Python3 compatible.

Requirements

Python 2.7 - 3.x with
Requests and
BeautifulSoup

Installation

Run the setup file: $ python setup.py install.
Done!

Usage

As a library:

from search_engines import Google

engine = Google()
results = engine.search("my query")
links = results.links()

print(links)

As a CLI script:

$ python search_engines_cli.py -e google,bing -q "my query" -o json,print

Other versions

async-search-scraper A really cool asynchronous implementation, written by @soxoj

search-engines-scraper's People

Contributors

Stargazers

Watchers

Forkers

p3t3rp4rk3r randomnoob mia-380 freshy969 gerarwmg mindaugasvaitkus2 zhuojingxie natrohz yousefsalah404 traestan bamboo06 davidlenz sershev rrosajp bendrysdale ibraahim6 bhardwajaditya113 5l1v3r1 alchemycyberblaze nikolasj5 christer19 rnproductions davecs1 tradinghost itsfreddyrb creatfgnrtyh gnh1201 jochemstoel rubetube95 iitians cvrobot gourav245 patilanup246 sean-bailey durga260996 fgaurat 2015wang baaadtrippp taghizad3h m4k3r-org dontknowwat tresca-msw dmgolembiowski nershman edanurbksn kazakh-shai fitzlm serixequip mughu94 nguyenkaos tijanidevit fran1303 c00renut newszeng abashinfection matteobodini gitlocked soxoj greenmind-sec an0nym0u5101 bcaso overreationary drdeviant hostn4me csecht tranchuyen282 legaldreadpiraterobert dasparagjyoti ljhofgithub rodrigogalhardo zandrewzg yaoc0009 r0x4r phoenix-journey asadmanzoor93 chirag127 satyasys cybersecurity-labs aluminum5 baronrustamov archesko thinker007 refugedesigns owoneko ard-skelling zinccat ykdojo samanfekri fcgca zdglf mohammadhprp cr0dix666 hamzaalwan wwlaoxi alexmerm yourshawn kaleemullahqasim jebikoh juanchobanano lpeng111

search-engines-scraper's Issues

How to stop repeating results?

As you said that if i use flag "-i" then, it will not repeat results, but it's not doing that. Can you please guide me how to use this feature?
Thank you!

Limit search results.

Is there anyway I can limit number of output results? Like I need only 10 Google search results...
I used below code and it gave too many results...
`from search_engines import Google

engine = Google()
results = engine.search("my query")
links = results.links()`

Thanks in advance.

duplicate search results

When I run multi searching missions,different csv file contains duplicate results and different 'query',that is second searching result containing the first searching result.

Dogpile empty results

I am getting empty results for Dogpile search, is its scraping engine actual?

Bing can't get a result

The goole engine can be used successfully, while Bing can't.
I wrote test code like this:

engine = Bing()
results = engine.search("apple",pages=1)
links = results.links()
print(links)

However, the result is an empty list. I tested Google as well, it was fine. So I am really confused about this problem.
Thanks for your reply!

Qwant request

This tool has helped me a lot, I would like you to consider whether you can add Qwant, thank you

double quotation search

Is there any support for double quotation search

Host Search-Engines-Scraper on PyPi

Hi there

I'd like to use Search-Engines-Scraper on a web application that I'm working on.

To avoid having to upload the 233KB file with my web application and having to do run the setup.py and deploy the new code with each update, will this project be hosted on PyPi anytime soon?

It'd be easier to run updates through the requirements.txt file.

Thanks

Cant search multiples pages while working with Google()

from search_engines import Google, Bing, Yahoo, Ask, Dogpile, Duckduckgo, Aol, Mojeek, Startpage
google = {}
action_words = ['affimative action']

def search_word(words):
      for word in words:
            engine = Google()
            results = engine.search(word, pages = 10)
            links = results.links()
            num = 1
            for link in links:
                google[num] = link
                num = num+1

search_word(action_words)

The output shows only the first 10 links. It would be really helpful if you can let me know what I am doing wrong. I would like to extract all the links in the first 10 pages.

I checked with bing and yahoo which are working fine and I am able to get all the links in the first 10 pages.

Duckduckgo and Startpage are not working

Bing empty results

I am getting empty results for Bing search, is its scraping engine actual?

Feature request: return status code (and error msg)

In first of all many thanks for your work!
Currently if a Search Engine returns a non 200 code there is only a print message. It would be great if status code and error message would be accessible to detect bans (or other issues) by search engines.

page endpoint

Hello! Thank you for your amazing contribution!

I would wanted to ask about using proxy with email & password?
How can I configure that?

Many thanks!

i got it. Just enter like this "http://user:[email protected]:3128/"
Please, reply on the second question;)

Proxy setting issue

trying to using this great tool in my office. my company is using proxy for browsers. i tried input our server/port name in config file, and run into an SSL error?
is there a way to disable verify?

Increasing search results

Sorry if basic question, how do i change search results? when using bing there's ~170 results, google 10 results, ask ~60 results, not sure how to modify.

Thanks

Feature request for YouTube

Hey, please consider adding YouTube results too, if possible.
Thanks :)

Add sleeping

Add sleeping to avoid lockout (Especially for Google)
BTW. Amazing tool! ^^

How does the filter argument work?

How can I filter to exclude two hosts (wikipedia.org and facebook.com)?

According to the docs, filtering is done via -f argument.
'-f', filter results [url, title, text, host] is what I find in the script.

As -o json will output to JSON and is described as '-o', help='output file [html, csv, json]', I expected something along the lines of -f host REGEX but does not work.

want to ask about proxy

Is there proxy support in this scrapper? so search engine do not block us

Filename argument in search_engines_cli.py

Hello, thank you very much for this amazing project!

I'm using the search_engines_cli.py and when the query text is too long it gives an error because of file system limit of characters. Fixed for me adding an argument:

ap.add_argument('-n', help='filename for output file', default=cfg.OUTPUT_DIR+'output')

Also added a new require and an argument in engine.output in order to run fine.

Making a pull request as suggestion.

Thank you o/

ERROR HTTP 429

Hi, your code is working great, however, I am facing this too many requests error (429) after running 3 or 4 queries. I changed proxies as well, still, it gives the same error. Is it about user agents?

Feature request Yandex and Baidu

Thanks for your work, Please consider adding Yandex and Baidu if possible

Yahoo title parsing improving

I noticed that title of Yahoo is extracted incorrectly:

URL: https://gist.github.com/soxoj/9d65c2f4d3bec5dd25949197ea73cf3a
Title: gist.github.com › soxoj › 9d65c2f4d3bec5dd25949197eamaigret.ipynb · GitHub

Title should be maigret.ipynb · GitHub

I did some fixes form my other project here

Showing zero results on bing

My Code:

from search_engines import Bing
engine = Bing()
results = engine.search("lums")
links = results.links()
print(links)
# Searching Bing                                                                 
# []

I am getting links with zero length, whatever I search.

--------

Hi. can I get result of the News section of google engine ? and second question ; can I store result of each page in a json file? thanks

is it possible to add baidu also?

this seems to be the major engine for chinese content.

Thanks!

DuckDuckGo scraper fails

Hi there!

I noticed an issue with the DuckDuckGo scraper. The response HTML doesn't contain any search results, only some Javascript, so the selectors don't match anything.

I created a temporary fix below (based on the Google scraper), which uses html.duckduckgo.com instead of duckduckgo.com.

Thanks!

output:

Searching Duckduckgo
Traceback (most recent call last):
  File "C:\Users\default\scrape.py", line 10, in <module>
    engine.search('test')
  File "C:\Users\default\search_engines\engine.py", line 162, in search
    response = self._get_page(request['url'], request['data'])
  File "C:\Users\default\search_engines\engines\duckduckgo.py", line 44, in _get_page
    response = self._http_client.get(page)
  File "C:\Users\default\search_engines\http_client.py", line 21, in get
    page = self._quote(page)
  File "C:\Users\default\search_engines\http_client.py", line 41, in _quote
    if utl.decode_bytes(utl.unquote_url(url)) == utl.decode_bytes(url):
  File "C:\Users\default\search_engines\utils.py", line 15, in unquote_url
    return decode_bytes(requests.utils.unquote(url))
  File "C:\Users\default\AppData\Local\Programs\Python\Python310\lib\urllib\parse.py", line 655, in unquote
    if '%' not in string:

fix:

from ..engine import SearchEngine
from ..config import PROXY, TIMEOUT, FAKE_USER_AGENT
from ..utils import unquote_url, quote_url

class Duckduckgo(SearchEngine):
    '''Searches duckduckgo.com'''
    def __init__(self, proxy=PROXY, timeout=TIMEOUT):
        super(Duckduckgo, self).__init__(proxy, timeout)
        self._base_url = u'https://html.duckduckgo.com'
        self._current_page = 1
        self.set_headers({'User-Agent':FAKE_USER_AGENT})

    def _selectors(self, element):
        '''Returns the appropriate CSS selector.'''
        selectors = {
            'url': 'a.result__a', 
            'title': 'a.result__a', 
            'text': 'a.result__snippet',
            'links': 'div#links div.result',
            'next': 'input[value="next"]'
        }
        return selectors[element]
    
    def _first_page(self):
        '''Returns the initial page and query.'''
        url = u'{}/html/?q={}'.format(self._base_url, quote_url(self._query, ''))
        return {'url':url, 'data':None}
    
    def _next_page(self, tags):
        '''Returns the next page URL and post data (if any)'''
        self._current_page += 1
        selector = self._selectors('next').format(page=self._current_page)
        next_page = self._get_tag_item(tags.select_one(selector), 'href')
        url = None
        if next_page:
            url = self._base_url + next_page
        return {'url':url, 'data':None}

    def _get_url(self, tag, item='href'):
        '''Returns the URL of search results item.'''
        selector = self._selectors('url')
        url = self._get_tag_item(tag.select_one(selector), item)

        if url.startswith(u'/url?q='):
            url = url.replace(u'/url?q=', u'').split(u'&sa=')[0]
        return unquote_url(url)

How to use proxy server?

Hello, can you please explain how to set and use a proxy?

Startpage search not working?

Hello, while other engines do provide results, Startpage gives no results for even most simple search terms. Can you please check it, does it work or maybe something changed with Startpage in meantime, so it does not work anymore?

Unsupported or invalid CSS selector

When searching with google I get this error.
`
Searching Google

page: 1 links: 9Traceback (most recent call last):
File "C:/Users/pegag/AppData/Local/Programs/Python/Python38/scrapesearch.py", line 4, in
results = engine.search("my query")
File "C:/Users/pegag/AppData/Local/Programs/Python/Python38\search_engines\engine.py", line 171, in search
request = self._next_page(tags)
File "C:/Users/pegag/AppData/Local/Programs/Python/Python38\search_engines\engines\google.py", line 36, in _next_page
next_page = self._get_tag_item(tags.select_one(selector), 'href')
File "C:\Users\pegag\AppData\Local\Programs\Python\Python38\lib\site-packages\bs4\element.py", line 1340, in select_one
value = self.select(selector, limit=1)
File "C:\Users\pegag\AppData\Local\Programs\Python\Python38\lib\site-packages\bs4\element.py", line 1476, in select
raise ValueError(
ValueError: Unsupported or invalid CSS selector: "a[href][aria-label=Page 2]"
`

Google text attribute is empty

Many thanks again for your work, it works great. Recently I noticed that if searching with Google the text field is empty for all results.

se = Google()
res = se.search("news", 1)
res.results()

[{'host': 'bbc.com',
  'link': 'https://www.bbc.com/news/world',
  'title': 'World - BBC Newshttps://www.bbc.com › news › world',
  'text': ''},
 {'host': 'edition.cnn.com',
  'link': 'https://edition.cnn.com/world',
  'title': 'World news – breaking news, videos and headlines - CNNhttps://edition.cnn.com › world',
  'text': ''},
 {'host': 'theguardian.com',
  'link': 'https://www.theguardian.com/world',
  'title': 'Latest news from around the world | The Guardianhttps://www.theguardian.com › world',
  'text': ''},
 {'host': 'hindustantimes.com',
  'link': 'https://www.hindustantimes.com/world-news',
  'title': 'World News, Latest World News, Breaking News and ...https://www.hindustantimes.com › World News',
  'text': ''},
 {'host': 'reuters.com',
  'link': 'https://www.reuters.com/news/archive/worldNews',
  'title': 'World News Headlines | Reutershttps://www.reuters.com › news › archive › worldNews',
  'text': ''},
 {'host': 'abcnews.go.com',
  'link': 'https://abcnews.go.com/International/',
  'title': 'International News | Latest World News, Videos & Photos ...https://abcnews.go.com › International',
  'text': ''},
 {'host': 'news.sky.com',
  'link': 'https://news.sky.com/world',
  'title': 'World News - Breaking international news and headlines | Sky ...https://news.sky.com › world',
  'text': ''},
 {'host': 'nytimes.com',
  'link': 'https://www.nytimes.com/section/world',
  'title': 'World News - The New York Timeshttps://www.nytimes.com › section › world',
  'text': ''},

How to stop the search

Sorry for posting a rudimentary question but there is no real instruction or documentation.

After putting in the example it keeps searching Google non-stop. How am I supposed to interrupt this or set other options?

engine = Google()
results = engine.search("my query")
links = results.links()

I have published this library to Pypi

Hi, thank you for this great project!
I just published this library to Pypi: https://pypi.org/project/Search-Engines-Scraper-Tasos/
Hope more people can have access to this library with ease :)

Error an ssl error occurred.

Hi, i like the code you wrote. unfortunately i am now experiencing the following error when scraping bing:
ERROR An SSL error occurred.

Are there any search engines that don't ban their users for scraping?

Kind regards,
Mart

Feature request; toggle console output for searches

Hi there!

In engine.py there are three uses of out.console(...) within search(), can this output be made into a toggle? It should be relatively easy and it would avoid console spam when using the search engine in e.g. a tqdm loop.

Bests

Startpage Ban Not Detected

Startpage.com doesn't send a 429 or other error code, but rather just returns a different page instead of the results and so the ban isn't detected.

Feature request Brave search engine

Thanks for your work, Please consider adding Brave search support if possible

Torch search engine outdated

Hi
I tried to run a test with torch but it seems that the torch url is no longer available, the new one is http://torchdeedp3i2jigzjdmfpn5ttjhthh5wbmda2rr3jvqjg5p77c54dqd.onion

Do you plan to upgrade the engine?

Thanks!

Can I use this project for image search?

Hi there,
Thank you for your contribution.
I have a question while using this script.
Can I use this script for image search?
For example, instead of query string, can I put the byte string of data image to perform image search?
Kindly let me know what you think.
Thank you so much.

google result has problem it can only give out single result

$ python3 search_engines_cli.py -e google -q 'search engines scraper' -p 1
Searching Google                                                                                                             
                                                                                                                             
Google results                                                                                                               
1   https://en.wikipedia.org/wiki/Search_engine_scraping

The project googlesearch is able to yield correct result. So far it seems the only tool can correctly scrap the google search engine.

Possible fake data from Qwant API

Looks like all URLs from Qwant are fake for me:

Qwant results
1   http://ef.bf/teheuhi
2   http://cez.lr/di
3   http://veoca.la/gured
4   http://vamibif.th/unror
5   http://hicakda.hk/izwe
6   http://nofusive.mr/aco
7   http://ismuaka.pe/welagses
8   http://luho.mg/desler

Can you check if it is not only my problem?

Please support python 3.8

With python 3.8.5 on ubuntu this scaper cannot be used:
Error:

Traceback (most recent call last):
  File "search.py", line 4, in <module>
    results = engine.search("Something")
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/engine.py", line 161, in search
    response = self._get_page(request['url'], request['data'])
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/engine.py", line 66, in _get_page
    return self._http_client.get(page)
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/http_client.py", line 21, in get
    page = self._quote(page)
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/http_client.py", line 41, in _quote
    if utl.decode_bytes(utl.unquote_url(url)) == utl.decode_bytes(url):
  File "/usr/local/lib/python3.8/dist-packages/search_engines-0.5-py3.8.egg/search_engines/utils.py", line 15, in unquote_url
    return decode_bytes(requests.utils.unquote(url))
AttributeError: module 'requests.utils' has no attribute 'unquote'

Content of file search.py

from search_engines import Bing

engine = Bing()
results = engine.search("Something")
links = results.links()
print(links)

ERROR A Connection error occurred.

no result

response:
Searching Yahoo
ERROR A Connection error occurred.

Yahoo results

many queries issue

When I run multiple statement cyclic queries, the result of the next query will contain the previous one.
How i can clear the previous one?

Results appear different than the browser

If I search for something on DuckDuckGo in the browser in Incognito I get one list of results. When running using this scraper I get a different set of results. Is there an explanation for this?

Thanks!

404 SERP links

Hi,

I am not sure if this is the write place to post this question, so my apologies in advance. I am scraping Google and getting the top-n link results for a query collection. Now I am trying to request those links to scrape the resulting pages. However, sometimes the links are broken -404 error- is not supposed that this should not happen? I mean the SE should filter out broken links.

On the other hand, sometimes my request petition gives me Client Error-404, despite the webpage actually exists. Could anyone provide me an orientation on this?

Thank you very much.
Best,
Marcos

list proposal

Would it be possible to add a list option, instead of a simple query? This could make it easier to automate some cases.
Thank you

Different results between the search engine scraper and google

Hello,

Thanks for the outstanding library!

I recently faced an issue with different results when using the scraper.

Search query is PROCTER & GAMBLE CO sustainability report.
From google web query, I can get the results as following:

However, when I use scraper,

from search_engines import Google
from googlesearch import search

query = 'PROCTER & GAMBLE CO sustainability report'
results = engine.search(query, 1)
links = results.links()

The output links are:

https://us.pg.com/ 
https://twitter.com/ProcterGamble?ref_src=twsrc^google|twcamp^serp|twgr^author 
https://en.wikipedia.org/wiki/Procter_&_Gamble 
https://www.pgcareers.com/ 
https://www.linkedin.com/company/procter-and-gamble 
https://www.facebook.com/proctergamble/ 
https://pginvestor.com/

May I know why this happens? How can I get the consistent result?

Many thanks!

Locale Handling

Great Project, thanks a ton!
Wonder how to set the locale, e.g. en-US, given a search engine object?

What i currently do is to set
self.session.headers.update({"Accept-Language": "en-US"})

in the http_client, e.g. right here.

Is there any other or maybe preferred way?

Difficulties installing on Macbook Air macOS Big Sur M1 Chip

I recently switched to a Mac machine and having trouble to install the library here.

MacBook Air
macOS Big Sur
M1 Chip

Maybe anyone can help out?

running install
error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

    [Errno 2] No such file or directory: '/Library/Python/3.8/site-packages/test-easy-install-41837.write-test'

The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    /Library/Python/3.8/site-packages/

This directory does not currently exist.  Please create it and try again, or
choose a different installation directory (using the -d or --install-dir
option).```

tasos-py / search-engines-scraper Goto Github PK

search-engines-scraper's Introduction

search_engines

Supported search engines

Features

Requirements

Installation

Usage

Other versions

search-engines-scraper's People

Contributors

Stargazers

Watchers

Forkers

search-engines-scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org