opsdisk / yagooglesearch Goto Github PK

View Code? Open in Web Editor NEW

239.0 7.0 42.0 182 KB

Yet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python search google googlesearch

yagooglesearch's People

Contributors

Stargazers

Watchers

Forkers

inksong kennbro rangersmyth74 jgabriellima pombredanne sts0mrg0 lzy-v skytina xiaolushuo 8-diagrams liiklin vtanrun 5l1v3r1 totoro2205 europress rlxdf yl2982 therapzkallion cokonas pyshawon yen5004 arshansgithub lolmarybaker haurhi cyberline-repo oguz-ali jorik041 mrmoshkovitz cloudsecdb sharkyfly vlaex cstenkamp pguridi kelly-blue libenc reshals vaimalaviya1233 amaris1994 nikmit 07h noxvsh

yagooglesearch's Issues

Bypass google captcha

#21

result_languages.txt file is not being included in setuptools package_data

the file result_languages.txt is not being installed with the package, resulting in this print statement everytime the library is loaded:

There was an issue loading the result languages file. Exception: [Errno 2] No such file or directory: '/Users/nippur/src/linkedin-scraper/venv/lib/python3.11/site-packages/yagooglesearch/result_languages.txt'

use paid proxy and get error while tunneling

hello,
i want to ask about some question about proxy, i have a paid proxy that used authentication
i trying with pure "requests" in blank project to fetch google and it works fine. but after i added it at yagooglesearch it have error like this
Traceback (most recent call last): File "C:\Users\budi\.virtualenvs\article-30Om8FEk\lib\site-packages\urllib3\connectionpool.py", line 700, in urlopen self._prepare_proxy(conn) File "C:\Users\budi\.virtualenvs\article-30Om8FEk\lib\site-packages\urllib3\connectionpool.py", line 996, in _prepare_proxy conn.connect() File "C:\Users\budi\.virtualenvs\article-30Om8FEk\lib\site-packages\urllib3\connection.py", line 369, in connect self._tunnel() File "C:\Users\budi\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 924, in _tunnel raise OSError(f"Tunnel connection failed: {code} {message.strip()}") OSError: Tunnel connection failed: 407 Proxy Authentication Required

i check several time and it can used in pure requests code.
is there any clue for this?

[Feature Request] Support for collecting Related searches

Using private proxies

Hello.
Please tell me the format for specifying private proxies, i.e. protected by login and password.

Google dorks yagoogle

limited scopes!!

it only gives 400 search urls
how can we maximize it

https://support.google.com/websearch/thread/24227169/i-can-t-see-all-search-results-there-are-less-results-than-google-thinks?hl=en

Add Language Support for Search Results

The current hl language option refers to the language of the HTML UI. However, to search for results in a particular language, the lr parameter must be set.

Please add support for both language options:

hl => HTML Language (UI)
lr => Language of Results (Linked Content)

For references, the enumeration of supported languages for results are as follows:

lang_af=Afrikaans
lang_ar=Arabic
lang_hy=Armenian
lang_be=Belarusian
lang_bg=Bulgarian
lang_ca=Catalan
lang_zh-CN=Chinese (Simplified)
lang_zh-TW=Chinese (Traditional)
lang_hr=Croatian
lang_cs=Czech
lang_da=Danish
lang_nl=Dutch
lang_en=English
lang_eo=Esperanto
lang_et=Estonian
lang_tl=Filipino
lang_fi=Finnish
lang_fr=French
lang_de=German
lang_el=Greek
lang_iw=Hebrew
lang_hi=Hindi
lang_hu=Hungarian
lang_is=Icelandic
lang_id=Indonesian
lang_it=Italian
lang_ja=Japanese
lang_ko=Korean
lang_lv=Latvian
lang_lt=Lithuanian
lang_no=Norwegian
lang_fa=Persian
lang_pl=Polish
lang_pt=Portuguese
lang_ro=Romanian
lang_ru=Russian
lang_sr=Serbian
lang_sk=Slovak
lang_sl=Slovenian
lang_es=Spanish
lang_sw=Swahili
lang_sv=Swedish
lang_th=Thai
lang_tr=Turkish
lang_uk=Ukrainian
lang_vi=Vietnamese

Some valid `lang_result` with capitalized parts will be incorrectly fallback to the `lang_en`

Some valid lang_results that have capitalized parts present:

lang_zh-CN=Chinese (Simplified)
lang_zh-TW=Chinese (Traditional)

Codes in __init__.py L147 and L169-L175 will cause these lang_result to be incorrectly fallback to the lang_en:

...
self.lang_result = lang_result.lower()
...
# Argument checks.
if self.lang_result not in result_languages_list:
    ROOT_LOGGER.error(
        f"{self.lang_result} is not a valid language result.  See {result_languages_file} for the list of valid "
        'languages.  Setting lang_result to "lang_en".'
    )
    self.lang_result = "lang_en"
...

Help Needed

hey i was doing a project that contains something like this that uses reqestuim so it uses the the request session object with the selenium driver so it can solve the reCaptcha using audio-to-speach selenium-recaptcha providing the cookies from the driver into the session so no block will happen but i get some problems with the cookies part so if your willing to help with the cookies part will be appreciate it
Here is the main parsing code

jar = http.cookiejar.CookieJar()
cookies = {}
isCookiesSet = False
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    'sec-ch-ua': '"Chromium";v="106", "Not.A/Brand";v="24", "Opera GX";v="92"',
    'sec-ch-ua-arch': '"x86"',
    'sec-ch-ua-bitness': '"64"',
    'sec-ch-ua-full-version': '"106.0.5249.119"',
    'sec-ch-ua-full-version-list': '"Chromium";v="106.0.5249.119", "Opera GX";v="106.0.5249.119", "Not;A=Brand";v="99.0.0.0"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-model': '""',
    'sec-ch-ua-platform': '"Windows"',
    'sec-ch-ua-platform-version': '"8.0.0"',
    'sec-ch-ua-wow64': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': UserAgent.random
}

class ConcurrencyManager:
    def __init__(self, initial_limit=5):
        self.limit = initial_limit
        self.semaphore = threading.BoundedSemaphore(initial_limit)

    def update_limit(self, new_limit):
        self.limit = new_limit
        self.semaphore = threading.BoundedSemaphore(new_limit)

    def acquire(self):
        return self.semaphore.acquire()

    def release(self):
        return self.semaphore.release()

class SeleniumGoogleFetcher:
    def __init__(self, urls, Window, concurrency_limit=5):
        self.drivers = []
        def getData():
            self.urls = urls
            self.concurrency_limit = ConcurrencyManager(concurrency_limit)
            self.Window = Window
            self.window = Window
            self.queue = queue.Queue()
            self.headless = Window.checkBox_37.isChecked()
            self.port = Window.spinBox_15.value()
            self.customArguments = Window.lineEdit_20.text()
            self.log = Window.checkBox_36.isChecked()
            self.useStealth = Window.checkBox_38.isChecked()
            self.chromeDriverPath = Window.lineEdit_21.placeholderText()
            self.bravePath = Window.lineEdit_22.placeholderText()
            self.browser = Window.comboBox_19.currentText()
            self.useProfiles = Window.checkBox_39.isChecked()
            self.isPaused = False
            self.isTerminated = False
        inmain(getData)
    @ErrorWrapper
    def signInToFirstProfile(self,options:ChromeOptions):
        options.add_argument(f"--user-data-dir=C:/Users/{getuser()}/AppData/Local/Google/Chrome/User Data")
        if path.exists(f"C:/Users/{getuser()}/AppData/Local/Google/Chrome/User Data\Profile 2"):
            options.add_argument("--profile-directory=Profile 2")
        else:
            file = ZipFile("./Profile 2.zip")
            file.extractall(f"C:/Users/{getuser()}/AppData/Local/Google/Chrome/User Data")
            options.add_argument("--profile-directory=Profile 2")
    def MakeDriver(self) -> Session:
        sleepStart = None
        sleepEnd = None
        def getData():
            nonlocal sleepStart
            nonlocal sleepEnd
            sleepStart = self.Window.doubleSpinBox_2.value()
            sleepEnd = self.Window.doubleSpinBox_3.value()
        inmain(getData)
        sleep(uniform(sleepStart,sleepEnd))
        Service = ChromeService(self.chromeDriverPath, port=self.port)
        Options = ChromeOptions()
        Options.accept_insecure_certs = True
        if self.customArguments != "":
            Options.add_argument(self.customArguments)
        Options.headless = self.headless
        if self.browser != "Chrome":
            Options.binary_location = self.bravePath
        if self.useProfiles:
            self.signInToFirstProfile(Options)
        driver = Chrome(service=Service, options=Options)
        if self.useStealth:
            print("USING STEALTH")
            stealth(
                driver,
                languages=["en-US", "en"],
                vendor="Google Inc.",
                platform="Win32",
                webgl_vendor="Intel Inc.",
                renderer="Intel Iris OpenGL Engine",
                fix_hairline=True,
            )
        session = Session(driver=driver)
        self.drivers.append(session)
        return session

    def pause(self):
        self.isPaused = True

    def resume(self):
        self.isPaused = False

    def setItem(self, item, list_widget, label):
        def main():
            if item != "" and item != " " and item != len(item) > 3:
                list_widget.addItem(item)
                label.setText(str(list_widget.count()))

        inmain(main)

    def addOne(self):
        def main():
            self.window.label_36.setText(str(int(self.window.label_36.text()) + 1))

        inmain(main)
    def terminate(self):
        for driver in self.drivers:
            try:
                driver.driver.quit()
                driver.close()
            except:
                conn = http.client.HTTPConnection(driver.driver.service.service_url.split("//")[1])
                conn.request("GET", "/shutdown")
                conn.close()
                try:
                    del driver.driver
                    del driver
                except:
                    pass
                return
        self.isTerminated = True
    @ErrorWrapper
    def fetch(self, url):
        sleepStart = None
        sleepEnd = None
        def getData():
            nonlocal sleepStart
            nonlocal sleepEnd
            sleepStart = self.Window.doubleSpinBox_2.value()
            sleepEnd = self.Window.doubleSpinBox_3.value()
        inmain(getData)
        try:
            sleep(uniform(sleepStart-1,sleepEnd-1))
        except:
            sleep(uniform(sleepStart,sleepEnd))

        while self.isPaused:
            sleep(1)

        if self.isTerminated:
            return
        print(driver.driver.get_cookies())
        try:
            if self.isTerminated:
                return
            try:
                driver = self.MakeDriver()
                driver.transfer_driver_cookies_to_session()
            except:
                print(traceback.format_exc())
                try:
                    driver = self.MakeDriver()
                except:
                    return
            print(f"SENDING GET REQUEST TO {url}")
            print(driver.cookies.items())
            driver.transfer_driver_cookies_to_session()
            response = driver.get(url,headers=headers)
            wait = WebDriverWait(driver,1)
            def isCap():
                return """Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>""" in response.response.text
            while isCap():
                print(response.response.text)
                driver.transfer_session_cookies_to_driver()
                driver.driver.get(url)
                print("Solving reCaptcha")
                Recaptcha_Solver(
                    driver=driver.driver,
                    debug=True
                ).solve_recaptcha()
                if isCap():
                    print("CAPTCHA AGAIN --------------------")
                    continue
                else:
                    response = driver.get(url)
                    break
            print(response.response.text)
            driver.transfer_driver_cookies_to_session()
            try:
                parser = GoogleParser(response.response.text)
                main_class = parser.GetMainResultsClass()
                results = parser.ParseAllResults(main_class)
                for result in results.values():
                    self.setItem(result["url"], self.window.listWidget_9, self.window.label_18)
                    self.setItem(result["title"], self.window.listWidget_10, self.window.label_29)
                    self.setItem(result["description"], self.window.listWidget_11, self.window.label_29)
                try:
                    driver.driver.quit()
                    driver.close()
                except:
                    conn = http.client.HTTPConnection(driver.driver.service.service_url.split("//")[1])
                    conn.request("GET", "/shutdown")
                    conn.close()
                    try:
                        del driver.driver
                        del driver
                    except:
                        pass
                return            
            except:
                print(traceback.format_exc())
        except Exception as e:
            print(url)
            print(traceback.format_exc())
            try:
                driver.driver.quit()
                driver.close()
            except:
                conn = http.client.HTTPConnection(driver.driver.service.service_url.split("//")[1])
                conn.request("GET", "/shutdown")
                conn.close()
                try:
                    del driver.driver
                    del driver
                except:
                    pass
                return
            return

    def fetch_all(self):
        with ThreadPoolExecutor(max_workers=self.concurrency_limit.limit) as executor:
            print("FETCHING ALL")
            executor.map(self.worker, self.urls)
    def worker(self,url):
        if self.isTerminated:
            self.terminate()
            return

        try:
            self.fetch(url)
        except:
            print(traceback.format_exc())
    @ErrorWrapper
    def main(self):
        self.fetch_all()

extra_params={"uule": f'{uule}'},

uule = "w+CAIQICIiV2Fyc2F3LE1hc292aWFuIFZvaXZvZGVzaGlwLFBvbGFuZA"
but in url we have
&uule=w%2BCAIQICIiV2Fyc2F3LE1hc292aWFuIFZvaXZvZGVzaGlwLFBvbGFuZA
simbol "+" is "%2B" but it's not correct

Sometimes the search results are empty, how should I troubleshoot the issue?

I used a proxy IP for the search, and sometimes when using the exact same keywords, the search results are empty, while other times there are results. I suspect it is due to the quality of the proxy IP, but I am not sure how to confirm the cause of the problem or how to resolve it.
Can this library add a return similar to HTTP_429_DETECTED, which is a prompt returned in case of network issues, for example: HTTP_ERROR? This can be used in the script to handle this result, such as changing the proxy

AttributeError: 'dict' object has no attribute 'lower'

Traceback (most recent call last):
File "C:\Users\sadd\Downloads\pwandname\crawl.py", line 31, in
client = yagooglesearch.SearchClient(
File "C:\Users\sadd\AppData\Local\Programs\Python\Python310\lib\site-packages\yagooglesearch_init_.py", line 183, in init
self.proxy = proxy.lower()
AttributeError: 'dict' object has no attribute 'lower'

Got no results

I wrote such a simple function and ran it. It returned an empty array. This was my 1st time using the module so shouldn't be a rate-limiting thing. I also waited for a long time and retried, still no results.

    gg_query = "topic cluster"
    gg_search = yagooglesearch.SearchClient(
        gg_query,
        verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
    )
    gg_search.assign_random_user_agent()

    urls = gg_search.search()

Result:

2021-11-06 09:07:23,558 [MainThread  ] [INFO] Requesting URL: https://www.google.com/
2021-11-06 09:07:23,727 [MainThread  ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=100
2021-11-06 09:07:23,727 [MainThread  ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=en&filter=0
2021-11-06 09:07:23,906 [MainThread  ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page.  That implies there won't be any search results on the next page either.  Moving on...

whereas with the original googlesearch library, I get a result with this code:

    from googlesearch import search
    for url in search(gg_query, stop=20):
        print(url)

Is there a way to turn off cool_off_time and make search request fail instead of retry?

Thanks for the great library, but as far as I understand it does not provide a way to handle 429 response by yourself(or at least I didn't find one) and it's just trying to make another request after certain cool_off time. It would be great if there was some parameter like "retry" that could be passed to client something like "retry=False" to make it raise an Error if 429 response was received.

Can only the URL be collected, and not the title and description?

How do I only get the main links

How do I only get the main links and not the additional links attached to the main link (which are additional links #...), @opsdisk

google image search ?

can u suggest me open source for google images like yagooglesearch ? Thanks u @opsdisk

Always no results on next page

First of all, thanks for the great tool, anyway it seems it fails to perform the search for any 2nd page: "No valid search results found on this page. Moving on..."

added captcha handle with unicaps

hello is there any chance to added captcha handle with unicaps library?
it might nice to have alternative to escaped blocked by google

what you think about this?

unnecessary data

I need just to fetch the urls of sites

Add option for self-signed certs

Proxies with self-signed certs won't work. See opsdisk/pagodo#60 (comment)

Search always return an empty result list

Hello,
I'm using the version 1.10.0 of the package (Python version 3.12), on Windows, from Belgium.
Each time I'm calling the search() function, it returns an empty result list.
When I try in my browser, it's working well and it does return some results.
And when I try with the package https://github.com/MarioVilas/googlesearch, it's working too.

I managed to reproduce the issue by opening the link in a private window and noticed that it was because the content of the page is :

I found my problem similar to the issue #5 , but not exactly the same. I guess this has something to do with cookies but don't really know how to solve it. I tried with multiple configuration of the SearchClient but it's always the same problem.

Here are the logs.txt

Do you have an idea ?