opsdisk / yagooglesearch Goto Github PK
View Code? Open in Web Editor NEWYet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.
License: BSD 3-Clause "New" or "Revised" License
Yet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.
License: BSD 3-Clause "New" or "Revised" License
the file result_languages.txt
is not being installed with the package, resulting in this print statement everytime the library is loaded:
There was an issue loading the result languages file. Exception: [Errno 2] No such file or directory: '/Users/nippur/src/linkedin-scraper/venv/lib/python3.11/site-packages/yagooglesearch/result_languages.txt'
hello,
i want to ask about some question about proxy, i have a paid proxy that used authentication
i trying with pure "requests" in blank project to fetch google and it works fine. but after i added it at yagooglesearch it have error like this
Traceback (most recent call last): File "C:\Users\budi\.virtualenvs\article-30Om8FEk\lib\site-packages\urllib3\connectionpool.py", line 700, in urlopen self._prepare_proxy(conn) File "C:\Users\budi\.virtualenvs\article-30Om8FEk\lib\site-packages\urllib3\connectionpool.py", line 996, in _prepare_proxy conn.connect() File "C:\Users\budi\.virtualenvs\article-30Om8FEk\lib\site-packages\urllib3\connection.py", line 369, in connect self._tunnel() File "C:\Users\budi\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 924, in _tunnel raise OSError(f"Tunnel connection failed: {code} {message.strip()}") OSError: Tunnel connection failed: 407 Proxy Authentication Required
i check several time and it can used in pure requests code.
is there any clue for this?
Hello.
Please tell me the format for specifying private proxies, i.e. protected by login and password.
it only gives 400 search urls
how can we maximize it
The current hl
language option refers to the language of the HTML UI. However, to search for results in a particular language, the lr
parameter must be set.
Please add support for both language options:
hl
=> HTML Language (UI)lr
=> Language of Results (Linked Content)For references, the enumeration of supported languages for results are as follows:
lang_af=Afrikaans
lang_ar=Arabic
lang_hy=Armenian
lang_be=Belarusian
lang_bg=Bulgarian
lang_ca=Catalan
lang_zh-CN=Chinese (Simplified)
lang_zh-TW=Chinese (Traditional)
lang_hr=Croatian
lang_cs=Czech
lang_da=Danish
lang_nl=Dutch
lang_en=English
lang_eo=Esperanto
lang_et=Estonian
lang_tl=Filipino
lang_fi=Finnish
lang_fr=French
lang_de=German
lang_el=Greek
lang_iw=Hebrew
lang_hi=Hindi
lang_hu=Hungarian
lang_is=Icelandic
lang_id=Indonesian
lang_it=Italian
lang_ja=Japanese
lang_ko=Korean
lang_lv=Latvian
lang_lt=Lithuanian
lang_no=Norwegian
lang_fa=Persian
lang_pl=Polish
lang_pt=Portuguese
lang_ro=Romanian
lang_ru=Russian
lang_sr=Serbian
lang_sk=Slovak
lang_sl=Slovenian
lang_es=Spanish
lang_sw=Swahili
lang_sv=Swedish
lang_th=Thai
lang_tr=Turkish
lang_uk=Ukrainian
lang_vi=Vietnamese
Some valid lang_results
that have capitalized parts present:
lang_zh-CN=Chinese (Simplified)
lang_zh-TW=Chinese (Traditional)
Codes in __init__.py
L147 and L169-L175 will cause these lang_result
to be incorrectly fallback to the lang_en
:
...
self.lang_result = lang_result.lower()
...
# Argument checks.
if self.lang_result not in result_languages_list:
ROOT_LOGGER.error(
f"{self.lang_result} is not a valid language result. See {result_languages_file} for the list of valid "
'languages. Setting lang_result to "lang_en".'
)
self.lang_result = "lang_en"
...
hey i was doing a project that contains something like this that uses reqestuim so it uses the the request session object with the selenium driver so it can solve the reCaptcha using audio-to-speach selenium-recaptcha providing the cookies from the driver into the session so no block will happen but i get some problems with the cookies part so if your willing to help with the cookies part will be appreciate it
Here is the main parsing code
jar = http.cookiejar.CookieJar()
cookies = {}
isCookiesSet = False
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Chromium";v="106", "Not.A/Brand";v="24", "Opera GX";v="92"',
'sec-ch-ua-arch': '"x86"',
'sec-ch-ua-bitness': '"64"',
'sec-ch-ua-full-version': '"106.0.5249.119"',
'sec-ch-ua-full-version-list': '"Chromium";v="106.0.5249.119", "Opera GX";v="106.0.5249.119", "Not;A=Brand";v="99.0.0.0"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-model': '""',
'sec-ch-ua-platform': '"Windows"',
'sec-ch-ua-platform-version': '"8.0.0"',
'sec-ch-ua-wow64': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': UserAgent.random
}
class ConcurrencyManager:
def __init__(self, initial_limit=5):
self.limit = initial_limit
self.semaphore = threading.BoundedSemaphore(initial_limit)
def update_limit(self, new_limit):
self.limit = new_limit
self.semaphore = threading.BoundedSemaphore(new_limit)
def acquire(self):
return self.semaphore.acquire()
def release(self):
return self.semaphore.release()
class SeleniumGoogleFetcher:
def __init__(self, urls, Window, concurrency_limit=5):
self.drivers = []
def getData():
self.urls = urls
self.concurrency_limit = ConcurrencyManager(concurrency_limit)
self.Window = Window
self.window = Window
self.queue = queue.Queue()
self.headless = Window.checkBox_37.isChecked()
self.port = Window.spinBox_15.value()
self.customArguments = Window.lineEdit_20.text()
self.log = Window.checkBox_36.isChecked()
self.useStealth = Window.checkBox_38.isChecked()
self.chromeDriverPath = Window.lineEdit_21.placeholderText()
self.bravePath = Window.lineEdit_22.placeholderText()
self.browser = Window.comboBox_19.currentText()
self.useProfiles = Window.checkBox_39.isChecked()
self.isPaused = False
self.isTerminated = False
inmain(getData)
@ErrorWrapper
def signInToFirstProfile(self,options:ChromeOptions):
options.add_argument(f"--user-data-dir=C:/Users/{getuser()}/AppData/Local/Google/Chrome/User Data")
if path.exists(f"C:/Users/{getuser()}/AppData/Local/Google/Chrome/User Data\Profile 2"):
options.add_argument("--profile-directory=Profile 2")
else:
file = ZipFile("./Profile 2.zip")
file.extractall(f"C:/Users/{getuser()}/AppData/Local/Google/Chrome/User Data")
options.add_argument("--profile-directory=Profile 2")
def MakeDriver(self) -> Session:
sleepStart = None
sleepEnd = None
def getData():
nonlocal sleepStart
nonlocal sleepEnd
sleepStart = self.Window.doubleSpinBox_2.value()
sleepEnd = self.Window.doubleSpinBox_3.value()
inmain(getData)
sleep(uniform(sleepStart,sleepEnd))
Service = ChromeService(self.chromeDriverPath, port=self.port)
Options = ChromeOptions()
Options.accept_insecure_certs = True
if self.customArguments != "":
Options.add_argument(self.customArguments)
Options.headless = self.headless
if self.browser != "Chrome":
Options.binary_location = self.bravePath
if self.useProfiles:
self.signInToFirstProfile(Options)
driver = Chrome(service=Service, options=Options)
if self.useStealth:
print("USING STEALTH")
stealth(
driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
session = Session(driver=driver)
self.drivers.append(session)
return session
def pause(self):
self.isPaused = True
def resume(self):
self.isPaused = False
def setItem(self, item, list_widget, label):
def main():
if item != "" and item != " " and item != len(item) > 3:
list_widget.addItem(item)
label.setText(str(list_widget.count()))
inmain(main)
def addOne(self):
def main():
self.window.label_36.setText(str(int(self.window.label_36.text()) + 1))
inmain(main)
def terminate(self):
for driver in self.drivers:
try:
driver.driver.quit()
driver.close()
except:
conn = http.client.HTTPConnection(driver.driver.service.service_url.split("//")[1])
conn.request("GET", "/shutdown")
conn.close()
try:
del driver.driver
del driver
except:
pass
return
self.isTerminated = True
@ErrorWrapper
def fetch(self, url):
sleepStart = None
sleepEnd = None
def getData():
nonlocal sleepStart
nonlocal sleepEnd
sleepStart = self.Window.doubleSpinBox_2.value()
sleepEnd = self.Window.doubleSpinBox_3.value()
inmain(getData)
try:
sleep(uniform(sleepStart-1,sleepEnd-1))
except:
sleep(uniform(sleepStart,sleepEnd))
while self.isPaused:
sleep(1)
if self.isTerminated:
return
print(driver.driver.get_cookies())
try:
if self.isTerminated:
return
try:
driver = self.MakeDriver()
driver.transfer_driver_cookies_to_session()
except:
print(traceback.format_exc())
try:
driver = self.MakeDriver()
except:
return
print(f"SENDING GET REQUEST TO {url}")
print(driver.cookies.items())
driver.transfer_driver_cookies_to_session()
response = driver.get(url,headers=headers)
wait = WebDriverWait(driver,1)
def isCap():
return """Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>""" in response.response.text
while isCap():
print(response.response.text)
driver.transfer_session_cookies_to_driver()
driver.driver.get(url)
print("Solving reCaptcha")
Recaptcha_Solver(
driver=driver.driver,
debug=True
).solve_recaptcha()
if isCap():
print("CAPTCHA AGAIN --------------------")
continue
else:
response = driver.get(url)
break
print(response.response.text)
driver.transfer_driver_cookies_to_session()
try:
parser = GoogleParser(response.response.text)
main_class = parser.GetMainResultsClass()
results = parser.ParseAllResults(main_class)
for result in results.values():
self.setItem(result["url"], self.window.listWidget_9, self.window.label_18)
self.setItem(result["title"], self.window.listWidget_10, self.window.label_29)
self.setItem(result["description"], self.window.listWidget_11, self.window.label_29)
try:
driver.driver.quit()
driver.close()
except:
conn = http.client.HTTPConnection(driver.driver.service.service_url.split("//")[1])
conn.request("GET", "/shutdown")
conn.close()
try:
del driver.driver
del driver
except:
pass
return
except:
print(traceback.format_exc())
except Exception as e:
print(url)
print(traceback.format_exc())
try:
driver.driver.quit()
driver.close()
except:
conn = http.client.HTTPConnection(driver.driver.service.service_url.split("//")[1])
conn.request("GET", "/shutdown")
conn.close()
try:
del driver.driver
del driver
except:
pass
return
return
def fetch_all(self):
with ThreadPoolExecutor(max_workers=self.concurrency_limit.limit) as executor:
print("FETCHING ALL")
executor.map(self.worker, self.urls)
def worker(self,url):
if self.isTerminated:
self.terminate()
return
try:
self.fetch(url)
except:
print(traceback.format_exc())
@ErrorWrapper
def main(self):
self.fetch_all()
uule = "w+CAIQICIiV2Fyc2F3LE1hc292aWFuIFZvaXZvZGVzaGlwLFBvbGFuZA"
but in url we have
&uule=w%2BCAIQICIiV2Fyc2F3LE1hc292aWFuIFZvaXZvZGVzaGlwLFBvbGFuZA
simbol "+" is "%2B" but it's not correct
I used a proxy IP for the search, and sometimes when using the exact same keywords, the search results are empty, while other times there are results. I suspect it is due to the quality of the proxy IP, but I am not sure how to confirm the cause of the problem or how to resolve it.
Can this library add a return similar to HTTP_429_DETECTED
, which is a prompt returned in case of network issues, for example: HTTP_ERROR
? This can be used in the script to handle this result, such as changing the proxy
Traceback (most recent call last):
File "C:\Users\sadd\Downloads\pwandname\crawl.py", line 31, in
client = yagooglesearch.SearchClient(
File "C:\Users\sadd\AppData\Local\Programs\Python\Python310\lib\site-packages\yagooglesearch_init_.py", line 183, in init
self.proxy = proxy.lower()
AttributeError: 'dict' object has no attribute 'lower'
I wrote such a simple function and ran it. It returned an empty array. This was my 1st time using the module so shouldn't be a rate-limiting thing. I also waited for a long time and retried, still no results.
gg_query = "topic cluster"
gg_search = yagooglesearch.SearchClient(
gg_query,
verbosity=4, # Logging level: DEBUG (CRITICAL:50, ERROR: 40, WARNING: 30, INFO: 20 -> 4, DEBUG: 10 -> 5, NOTSET: 0 -> 6)
)
gg_search.assign_random_user_agent()
urls = gg_search.search()
Result:
2021-11-06 09:07:23,558 [MainThread ] [INFO] Requesting URL: https://www.google.com/
2021-11-06 09:07:23,727 [MainThread ] [INFO] Stats: start=0, num=100, total_valid_links_found=0 / max_search_result_urls_to_return=100
2021-11-06 09:07:23,727 [MainThread ] [INFO] Requesting URL: https://www.google.com/search?hl=en&q=topic+cluster&num=100&btnG=Google+Search&tbs=0&safe=off&cr=en&filter=0
2021-11-06 09:07:23,906 [MainThread ] [INFO] The number of valid search results (0) was not the requested max results to pull back at once num=(100) for this page. That implies there won't be any search results on the next page either. Moving on...
whereas with the original googlesearch library, I get a result with this code:
from googlesearch import search
for url in search(gg_query, stop=20):
print(url)
Thanks for the great library, but as far as I understand it does not provide a way to handle 429 response by yourself(or at least I didn't find one) and it's just trying to make another request after certain cool_off time. It would be great if there was some parameter like "retry" that could be passed to client something like "retry=False" to make it raise an Error if 429 response was received.
How do I only get the main links and not the additional links attached to the main link (which are additional links #...), @opsdisk
can u suggest me open source for google images like yagooglesearch ? Thanks u @opsdisk
First of all, thanks for the great tool, anyway it seems it fails to perform the search for any 2nd page: "No valid search results found on this page. Moving on..."
hello is there any chance to added captcha handle with unicaps library?
it might nice to have alternative to escaped blocked by google
what you think about this?
I need just to fetch the urls of sites
Proxies with self-signed certs won't work. See opsdisk/pagodo#60 (comment)
Hello,
I'm using the version 1.10.0 of the package (Python version 3.12), on Windows, from Belgium.
Each time I'm calling the search() function, it returns an empty result list.
When I try in my browser, it's working well and it does return some results.
And when I try with the package https://github.com/MarioVilas/googlesearch, it's working too.
I managed to reproduce the issue by opening the link in a private window and noticed that it was because the content of the page is :
I found my problem similar to the issue #5 , but not exactly the same. I guess this has something to do with cookies but don't really know how to solve it. I tried with multiple configuration of the SearchClient but it's always the same problem.
Here are the logs.txt
Do you have an idea ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.