eliasdabbas / advertools Goto Github PK
View Code? Open in Web Editor NEWadvertools - online marketing productivity and analysis tools
Home Page: https://advertools.readthedocs.io
License: MIT License
advertools - online marketing productivity and analysis tools
Home Page: https://advertools.readthedocs.io
License: MIT License
My apologies in advance, I am still a python padawan. I swept through the documentation but couldn't find an answer.
I am assuming that after i run;
import advertools as adv
adv.crawl('https://example.com', 'my_output_file.jl', follow_links=True)
and then
import pandas as pd
crawl_df = pd.read_json('my_output_file.jl', lines=True)
I am supposed to see the data from the scrape? (I did put in a real URL, and the output file is a 1.6MB .jl file was created. But when I run that cell, nothing happens. No errors but no data either. I am testing this in a jupyter notebook... all requirements are installed etc.
Also, if I may, as an SEO practitioner, how do I output these results into a CSV file that can viewed in excel or google sheets for further analysis? If you're willing, can you provide an example command as to how to convert the .jl file to .csv? I tried to install json-lines but apparently it's no longer supported. I assume what I need is to import csv but as far as how to structure it to encapsulate the data with column headers etc is a bit intimidating.
Thanks
After application of adv.url_to_df(logs_df['request'])
on my dataset the dataframe explodes to more than 120 columns with names like:
'query_template',
'query_archive',
'query_key',
'query_per',
'query_x',
"query_[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0][
Applied on referer produces another 40 columns. Is this behavior intended?
Is there a way to specify the no of threads for crawling? Currently it uses all the threads on the system and causes issues to refresh the page.
Currently, versions are not specified for the dependencies mentioned in the setup.py
Current config:
requirements = [
'pandas',
'pyasn1',
'scrapy',
'twython',
'pyarrow',
]
So, on every install, latest versions will be downloaded and installed for these requried packages.
Instead, we can adopt a common practice by specyfying the required versions to avoid any breaking changes in the newer versions of these dependent packages.
requirements = [
'pandas==1.4.2',
'pyasn1==0.4.8',
'scrapy==2.6.1',
'twython==3.9.1',
'pyarrow==8.0.0',
]
Hi. I'm working on a streamlit app and I'm having an issue with scrapped data specifically jsonld_sameAs
column. It has mixed datatype and streamlit throws error when I try to show the table.
Here is my code
adv.crawl(urls, 'pages.jl', follow_links=False)
crawl_df = pd.read_json('pages.jl', lines=True)
st.dataframe(crawl_df)
Here is the error.
{StreamlitAPIException}('cannot mix list and non-list, non-null values', 'Conversion failed for column jsonld_sameAs with type object')
When I checked the datatype of the column it is mixed with str and list.
It should be list not str to avoid issues.
Some field observations with a small data set of only 12 sites:
8 place all their json-ld tags in a hierarchy under @graph (wrapped in a single script tag), and 4 spread place their items in seperate script tags.
advertools treats each occurrence of as a seperate entity, so for those in a hierarchy there is a single json-dl @graph column with nested object, and those without a hierachy get spread out over multiple columns (json_1_ etc...).
I'm building a scraper that regularly and often scrapes the same sites. With the current functionaly of advertools I will need to inspect each site and write conditions to scrape columns depending on how json-ld tags have been stored.
I think it would be helpful to treat json-ld items, whether they are in a hierachy under @graph or in distinct script tags, as being equal. As far as I can tell the only difference apart from being nested is that the @context entry appears as a sibling to @graph and so only appears once in the whole scheme.
NESTED:
{
"@context": "https://schema.org",
"@graph": [
{
"@type": "WebSite",
...
NON-NESTED:
{
"@context": "http://schema.org",
"@type": "Website",
...
Then as a bonus break out each @type to a column with the whole record within.
adv.crawl('https://cettire.com', 'my_output_file.jl', follow_links=True)
crawl_df = pd.read_json('my_output_file.jl', lines=True)
Hi,
Thanks so much for providing such a valuable Python package to marketing researchers.
I tried running the robotstxt_test()
function as described in documetation, but the example does not work correctly. I propose the following solution (3rd example below) and a change of the documetation:
# 1. current example includes wrong syntax
robotstxt_test('https://www.example.com/robots.txt',
useragents=['Googlebot', 'baiduspider', 'Bingbot']
urls=['/', '/hello', '/some-page.html']])
# 2. example includes right syntax but does not return meaningful information (i.e., returns HTTP Error 404)
adv.robotstxt_test('https://www.example.com/robots.txt',
user_agents=['Googlebot', 'baiduspider', 'Bingbot'],
urls=['/', '/hello', '/some-page.html'])
# 3. example includes right syntax and returns meaningful information
adv.robotstxt_test('https://www.amazon.com/robots.txt',
user_agents=['Googlebot', 'baiduspider', 'Bingbot'],
urls=['/', '/hello', '/some-page.html'])
Video sitemaps allow multiple <video:tag> elements in a sitemap entry. sitemap_to_df
only handles the last given tag. It should extract all of them.
Hi can we a have a small readme for how to get started with development?
Hi Elias,
Comments on Instagram that include mentions with periods are currently truncated on advertools.__version__ = 0.14.2
.
Example: @elias.dabbas -> [@elias]
I propose adding adding a .
to the MENTIONS in the REGEX module.
MENTION = re.compile(
r"""(?i) # case-insensitive
(?<!\w) # word character doesn't precede mention
([@@] # either of two @ signs
[a-z0-9_.]+) # A to Z, numbers and underscores AND PERIODS only
\b # end with a word boundary
""", re.VERBOSE)
This change works for me, but I haven't tested edge cases or other social media platforms.
(base) wenke@wenkedeMac-mini gradio-demo % python zapier.py
2023-12-03 00:40:48 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2023-12-03 00:40:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.4, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.9.16 (main, Mar 8 2023, 04:29:44) - [Clang 14.0.6 ], pyOpenSSL 23.0.0 (OpenSSL 1.1.1t 7 Feb 2023), cryptography 39.0.1, Platform macOS-10.15.7-x86_64-i386-64bit
2023-12-03 00:40:49 [scrapy.addons] INFO: Enabled addons:
[]
2023-12-03 00:40:49 [py.warnings] WARNING: /Users/wenke/miniconda3/lib/python3.9/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2023-12-03 00:40:49 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-12-03 00:40:49 [scrapy.extensions.telnet] INFO: Telnet Password: 4f579800aa59aff0
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-12-03 00:40:50 [scrapy.crawler] INFO: Overridden settings:
{'ROBOTSTXT_OBEY': True,
'SPIDER_LOADER_WARN_ONLY': True,
'USER_AGENT': 'advertools/0.13.5'}
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-12-03 00:40:50 [scrapy.core.engine] INFO: Spider opened
2023-12-03 00:40:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-03 00:40:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com/robots.txt> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com/robots.txt> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://zapier.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://zapier.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
Traceback (most recent call last):
File "/Users/wenke/miniconda3/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://zapier.com> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:51 [seo_spider] ERROR: <twisted.python.failure.Failure twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]>
2023-12-03 00:40:51 [scrapy.core.scraper] DEBUG: Scraped from [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:51 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-03 00:40:51 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: zapier.jl
2023-12-03 00:40:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6,
'downloader/request_bytes': 1248,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 0.71527,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 12, 2, 16, 40, 51, 34244, tzinfo=datetime.timezone.utc),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/ERROR': 4,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'memusage/max': 123461632,
'memusage/startup': 123461632,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
"robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2023, 12, 2, 16, 40, 50, 318974, tzinfo=datetime.timezone.utc)}
2023-12-03 00:40:51 [scrapy.core.engine] INFO: Spider closed (finished)
import advertools as adv
adv.crawl('https://zapier.com', 'zapier.jl', follow_links=True)
Weblog files may contain the domain name (eg in case a system hosts several webserver) as 10th column. This domain name may be missing, which is the expectation of method "logs_to_df()
But sometimes the field appears in a weblog file, as an entry or an empty entry (eg '-' or '"-"'). logs_to_df() cannot handle this extra field and ignores these entries.
In weblogs quotes mark fields. Sometimes quotes are part of a field string and escaped by "\". logs_to_df() does not catch escaped character \" and ignores these entries.
If you try sitemap_to_df for wrangler.com, you will notice that there is a recursion in the sitemap. It calls the sitemap index again and again without terminating. There should be a check to keep track of visited Sitemaps.
is there a backlink checker feature?
Is it possible to add functionality so we don't have to write to disc before being able ot analyze the results?
Directly to a df or some other python object would be great!
Hi @eliasdabbas
The sitemap_to_df
function is throwing the following warning, so I thought it would be a good idea to bring it to your notice.
2022-04-21 18:46:26,560 | INFO | sitemaps.py:419 | sitemap_to_df | Getting https://xyz.com/sitemap/site.xml
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/advertools/sitemaps.py:421:
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version.
Use pandas.concat instead.
sitemap_df = sitemap_df.append(elem_df, ignore_index=True)
Regards.
Hello Elias,
I hope everything goes well.
Any hint to bypass protection like cloudflare ?
I am trying to scrape pages like https://www.welcometothejungle.com/fr/jobs?query=assistant&sortBy=mostRecent
in terminal with scrapy shell when I launch view(response) it appears that there is an error on the page.
Any help is welcome,
best,
caro
Hi Everyone,
I am trying to run Advertools in a Python Venv and Ubuntu.
I tried with the standard Python that comes with this Ubuntu version (3.10.12) and I also tried to install Python 3.9.18 as I saw someone posting an issue similar and suggesting to use this version but the issue is the same.
This is what I did:
mkdir /home/abc/advertools/
cd /home/abc/advertools/
python3 -m venv .
source bin/activate
python3 -m pip install advertools
When I try to type:
adv
or
advertools --version
I get
Illegal instruction (core dumped)
Do you guys have some suggestions?
For Python9 I tried:
add-apt-repository ppa:deadsnakes/ppa
apt install python3.9
apt install python3.9-venv
python3.9 -m venv .
source bin/activate
python3 -m pip install advertools
same error: Illegal instruction (core dumped)
Thank you very much
Here is the list of URLs I'm trying to scrape, which are stuck and never finishes.
https://www.si.com/showcase/fitness/best-boxing-gloves
https://www.verywellfit.com/best-boxing-gloves-4158917
https://www.rollingstone.com/product-recommendations/lifestyle/best-boxing-gloves-1234690811/
https://www.gearpatrol.com/fitness/g40446087/best-boxing-gloves/
https://boxingglovesreviews.com/top-ten-boxing-gloves/
https://sweetscienceoffighting.com/best-boxing-gloves/
https://www.shape.com/fitness/gear/best-boxing-gloves
https://www.t3.com/features/best-boxing-gloves
https://bleacherreport.com/articles/1286577-breaking-down-different-brands-of-boxing-gloves-worn-by-the-pros
https://www.youtube.com/watch?v=tWoucO2nIlE
https://expertboxing.com/best-boxing-gloves-review
https://thekarateblog.com/best-boxing-gloves/
https://boxupnation.com/blogs/news/my-top-5-favorite-boxing-glove-brands-and-why
https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131
https://www.tabletenniscoach.me.uk/sport-equipment-guides/best-boxing-gloves-for-beginners/
https://myboxinglife.com/best-boxing-gloves-for-beginners/
https://www.youtube.com/watch?v=rHepbZOCxfY
https://wayofmartialarts.com/best-boxing-gloves-worth-your-money/
https://www.hayabusafight.com/products/t3-boxing-gloves
https://www.dickssportinggoods.com/o/best-boxing-gloves-for-pad-work
https://revgear.com/gear/boxing-gloves/
https://blog.joinfightcamp.com/boxing-equipment/how-to-choose-the-best-boxing-gloves-for-beginners/
https://www.ebay.com/t/Boxing-Gloves/30102/bn_1943751
https://cletoreyesboxing.com/
https://www.walmart.com/c/lists/top-rated-boxing-gloves
https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1
https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves
https://m.timesofindia.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms
https://www.everlast.com/fight/boxing/gloves
https://www.msmfightshop.com/blogs/news/top-3-boxing-gloves-in-the-world
https://www.quora.com/What-companies-make-the-best-quality-boxing-gloves
https://www.titleboxing.com/gloves
https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-professionals/articleshow/97128538.cms
https://skilspo.com/gb/blog/1_how-to-choose-the-best-boxing-gloves.html
https://bravose.com/collections/training-gloves
https://sanabulsports.com/blogs/news/the-best-boxing-gloves-for-training
https://anthonyjoshua.com/blogs/news/anthony-joshua-how-to-choose-the-best-boxing-gloves
https://www.nakmuaywholesale.com/top-3-boxing-gloves-for-small-hands-2022/
https://mmagearaddict.com/best-boxing-gloves/
https://issuu.com/punchequipment/docs/get_the_best_boxing_gloves_for_a_winning_performan
https://tufwear-germany.de/en/blogs/news/was-sind-die-besten-boxhandschuhe-der-boxhandschuh-guide-fur-deinen-kauf
https://yokkao.com/pages/boxing-gloves-guide
https://topboxer.com/collections/boxing-gloves
https://warriorpunch.com/best-boxing-gloves-for-beginners/
https://nypost.com/article/best-boxing-equipment-per-experts/
https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves
https://www.infinitudefight.com/buy-the-best-boxing-gloves/
https://cashkaro.com/blog/best-boxing-gloves-in-india/201246
https://www.popsugar.com/fitness/Best-Boxing-Gloves-Women-45472473
https://kdvr.com/reviews/br/sports-fitness-br/boxing-br/best-title-boxing-gloves/
https://www.expertreviews.co.uk/health-and-grooming/1407584/best-boxing-gloves
https://branded.disruptsports.com/blogs/blog/which-boxing-gloves-to-buy-for-beginners
https://www.flipkart.com/sports/boxing/boxing-gloves/pr?sid=abc%2Cppq%2Cbb6&page=2
https://www.reddit.com/r/amateur_boxing/comments/2ykhau/the_top_15_best_boxing_gloves_ranking_the_best/
https://fightquality.com/2018/10/12/best-custom-gloves/
https://fightingadvice.com/best-boxing-gloves-under-200/
https://glovesaddict.com/best-boxing-gloves-on-amazon/
https://www.k2promos.com/best-beginner-boxing-gloves/
https://absolutelymartialarts.com/best-boxing-gloves-beginners/
https://www.healthyprinciples.co.uk/best-boxing-gloves-for-kids-review/
https://breakinggrips.com/best-kids-boxing-gloves/
https://www.proboxingequipment.com/Boxing-Gloves_c_196.html
https://www.mmahive.com/best-boxing-gloves-for-wrist-support/
https://bwsgym.com/etiquette-produit/best-boxing-gloves/
https://www.dontwasteyourmoney.com/products/hawk-sports-heavy-bag-boxing-gloves/
https://www.bestproducts.com/fitness/equipment/g1009/boxing-gloves-mitts/
https://www.wbcme.co.uk/ringside/best-boxing-gloves-for-beginners/
https://www.momjunction.com/articles/best-boxing-gloves-for-kids_00514921/
https://middleeasy.com/reviews/gear/gloves-cardio-kickboxing/
https://www.fightingking.com/boxing-gloves-brands-reviews/
https://www.mightyfighter.com/top-10-best-boxing-gloves/
https://www.stylecraze.com/articles/best-heavy-bag-gloves/
https://linealboxing.com/best-boxing-glove-brands-2022/
https://blackbeltmag.com/best-boxing-gloves
https://smartmma.com/best-boxing-gloves-for-heavy-bag/
https://www.fullcontactway.com/best-sparring-gloves/
https://www.attacktheback.com/best-cheap-boxing-gloves/
https://www.boxingear.com/shop-2/grant-gloves/lace-up/best-boxing-gloves-for-sparring-grant-gloves/
https://www.kreedon.com/best-boxing-gloves-brands/
https://bestreviews.com/sports-fitness/boxing/best-boxing-gloves
https://cletoreyesuk.com/blogs/news/what-are-the-best-boxing-gloves-for-beginners
https://www.fitnessbaddies.com/amateur-boxing-gloves/
https://www.boxingison.com/best-boxing-gloves-for-training-and-sparring/
https://boxingready.com/ringside/best-boxing-gloves-wrist-support/
https://www.msn.com/en-gb/lifestyle/rf-best-products-uk/best-boxing-gloves-for-men-12oz-reviews
https://www.pragmaticmom.com/2019/11/best-boxing-gloves-for-women/
https://thewiredshopper.com/best-boxing-gloves-to-buy/
https://www.standard.co.uk/shopping/esbest/health-fitness/fitness-wear/best-womens-boxing-gloves-for-beginners-a4272321.html
https://www.gloveworx.com/blog/how-choose-best-boxing-gloves-beginners/
https://www.lowkickmma.com/best-boxing-gloves/
https://www.sportsdirect.com/boxing/boxing-gloves
https://themmaguru.com/best-youth-boxing-gloves/
https://brawlbros.com/best-boxing-gloves-on-amazon/
https://thechamplair.com/sports/best-beginners-boxing-gloves/
https://www.dmarge.com/best-boxing-gloves
https://www.nytimes.com/video/style/1194840632119/gear-test-boxing-gloves.html
https://findbestboxinggloves.com/best-boxing-gloves-for-heavy-bag-the-complete-guide/
https://www.hungry4fitness.co.uk/post/10-best-boxing-mitts-an-ultimate-guide
https://www.gearhungry.com/best-boxing-gloves/
https://hiconsumption.com/best-boxing-gloves/
Here is the log
/home/irfan/.pyenv/versions/TES/bin/python /home/irfan/PycharmProjects/TES-SAAS/tests/scprapping.py
2023-05-05 06:52:32 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2023-05-05 06:52:32 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.9 (default, Jan 23 2022, 07:32:51) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Linux-5.4.0-148-generic-x86_64-with-debian-bullseye-sid
2023-05-05 06:52:32 [scrapy.crawler] INFO: Overridden settings:
{'ROBOTSTXT_OBEY': True,
'SPIDER_LOADER_WARN_ONLY': True,
'USER_AGENT': 'advertools/0.13.2'}
2023-05-05 06:52:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2023-05-05 06:52:32 [scrapy.extensions.telnet] INFO: Telnet Password: 2dcb88ca688b5e23
2023-05-05 06:52:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-05-05 06:52:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-05-05 06:52:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-05-05 06:52:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-05-05 06:52:33 [scrapy.core.engine] INFO: Spider opened
2023-05-05 06:52:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-05-05 06:52:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sweetscienceoffighting.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rollingstone.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [filelock] DEBUG: Attempting to acquire lock 140227121181328 on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [filelock] DEBUG: Lock 140227121181328 acquired on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [filelock] DEBUG: Attempting to release lock 140227121181328 on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [filelock] DEBUG: Lock 140227121181328 released on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.t3.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.si.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearpatrol.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.shape.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.verywellfit.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.si.com/showcase/fitness/best-boxing-gloves> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingglovesreviews.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.t3.com/features/best-boxing-gloves> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bleacherreport.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.si.com/showcase/fitness/best-boxing-gloves>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rollingstone.com/product-recommendations/lifestyle/best-boxing-gloves-1234690811/> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://expertboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.t3.com/features/best-boxing-gloves>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rollingstone.com/product-recommendations/lifestyle/best-boxing-gloves-1234690811/>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.verywellfit.com/best-boxing-gloves-4158917> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.verywellfit.com/best-boxing-gloves-4158917>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.shape.com/fitness/gear/best-boxing-gloves> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thekarateblog.com/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.shape.com/fitness/gear/best-boxing-gloves>
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sweetscienceoffighting.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearpatrol.com/fitness/g40446087/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://sweetscienceoffighting.com/best-boxing-gloves/>
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxupnation.com/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gearpatrol.com/fitness/g40446087/best-boxing-gloves/>
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tabletenniscoach.me.uk/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=tWoucO2nIlE> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bleacherreport.com/articles/1286577-breaking-down-different-brands-of-boxing-gloves-worn-by-the-pros> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingglovesreviews.com/top-ten-boxing-gloves/> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=tWoucO2nIlE>
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bleacherreport.com/articles/1286577-breaking-down-different-brands-of-boxing-gloves-worn-by-the-pros>
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://boxingglovesreviews.com/top-ten-boxing-gloves/>
2023-05-05 06:52:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (failed 1 times): 429 Unknown Status
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxupnation.com/blogs/news/my-top-5-favorite-boxing-glove-brands-and-why> (referer: None)
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://boxupnation.com/blogs/news/my-top-5-favorite-boxing-glove-brands-and-why>
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wayofmartialarts.com/robots.txt> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thekarateblog.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://myboxinglife.com/robots.txt> (referer: None)
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://thekarateblog.com/best-boxing-gloves/>
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tabletenniscoach.me.uk/sport-equipment-guides/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://expertboxing.com/best-boxing-gloves-review> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.dickssportinggoods.com/robots.txt> (referer: None)
2023-05-05 06:52:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (failed 2 times): 429 Unknown Status
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tabletenniscoach.me.uk/sport-equipment-guides/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://expertboxing.com/best-boxing-gloves-review>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=rHepbZOCxfY> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hayabusafight.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://revgear.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=rHepbZOCxfY>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.dickssportinggoods.com/o/best-boxing-gloves-for-pad-work> (referer: None)
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.dickssportinggoods.com/o/best-boxing-gloves-for-pad-work>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://myboxinglife.com/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://myboxinglife.com/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.joinfightcamp.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebay.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ringsport.com.au/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (failed 3 times): 429 Unknown Status
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (referer: None) ['partial']
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <429 https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://made4fighters.com/robots.txt> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1> (referer: None)
2023-05-05 06:52:38 [seo_spider] ERROR: Invalid control character at: line 5 column 19 (char 78) 200 https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1
Traceback (most recent call last):
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
response.css('script[type="application/ld+json"]::text').getall()]
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
ld = [json.loads(s.replace('\r', '')) for s in
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 5 column 19 (char 78)
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.joinfightcamp.com/boxing-equipment/how-to-choose-the-best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.joinfightcamp.com/boxing-equipment/how-to-choose-the-best-boxing-gloves-for-beginners/>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebay.com/t/Boxing-Gloves/30102/bn_1943751> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wayofmartialarts.com/best-boxing-gloves-worth-your-money/> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msmfightshop.com/robots.txt> (referer: None)
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ebay.com/t/Boxing-Gloves/30102/bn_1943751>
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wayofmartialarts.com/best-boxing-gloves-worth-your-money/>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.quora.com/robots.txt> (referer: None)
2023-05-05 06:52:38 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.quora.com/What-companies-make-the-best-quality-boxing-gloves>
2023-05-05 06:52:39 [seo_spider] ERROR: Invalid control character at: line 20 column 226 (char 698) 200 https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves
Traceback (most recent call last):
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
response.css('script[type="application/ld+json"]::text').getall()]
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
ld = [json.loads(s.replace('\r', '')) for s in
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 20 column 226 (char 698)
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves>
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hayabusafight.com/products/t3-boxing-gloves> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msmfightshop.com/blogs/news/top-3-boxing-gloves-in-the-world> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.everlast.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesboxing.com/> (referer: None)
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hayabusafight.com/products/t3-boxing-gloves>
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.msmfightshop.com/blogs/news/top-3-boxing-gloves-in-the-world>
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cletoreyesboxing.com/>
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://revgear.com/gear/boxing-gloves/> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://m.timesofindia.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/c/lists/top-rated-boxing-gloves> (referer: None)
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://revgear.com/gear/boxing-gloves/>
2023-05-05 06:52:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms?from=mdr> from <GET https://m.timesofindia.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms>
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.walmart.com/c/lists/top-rated-boxing-gloves>
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.titleboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bravose.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sanabulsports.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://timesofindia.indiatimes.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://anthonyjoshua.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sanabulsports.com/blogs/news/the-best-boxing-gloves-for-training> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.everlast.com/fight/boxing/gloves> (referer: None)
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://sanabulsports.com/blogs/news/the-best-boxing-gloves-for-training>
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nakmuaywholesale.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.everlast.com/fight/boxing/gloves>
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mmagearaddict.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://anthonyjoshua.com/blogs/news/anthony-joshua-how-to-choose-the-best-boxing-gloves> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bravose.com/collections/training-gloves> (referer: None)
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://anthonyjoshua.com/blogs/news/anthony-joshua-how-to-choose-the-best-boxing-gloves>
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bravose.com/collections/training-gloves>
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://issuu.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tufwear-germany.de/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.titleboxing.com/gloves> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-professionals/articleshow/97128538.cms> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://yokkao.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.titleboxing.com/gloves>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tufwear-germany.de/en/blogs/news/was-sind-die-besten-boxhandschuhe-der-boxhandschuh-guide-fur-deinen-kauf> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mmagearaddict.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://topboxer.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nakmuaywholesale.com/top-3-boxing-gloves-for-small-hands-2022/> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-professionals/articleshow/97128538.cms>
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://tufwear-germany.de/en/blogs/news/was-sind-die-besten-boxhandschuhe-der-boxhandschuh-guide-fur-deinen-kauf>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://issuu.com/punchequipment/docs/get_the_best_boxing_gloves_for_a_winning_performan> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://nypost.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://mmagearaddict.com/best-boxing-gloves/>
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.nakmuaywholesale.com/top-3-boxing-gloves-for-small-hands-2022/>
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://issuu.com/punchequipment/docs/get_the_best_boxing_gloves_for_a_winning_performan>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms?from=mdr> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://warriorpunch.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms?from=mdr>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://yokkao.com/pages/boxing-gloves-guide> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://topboxer.com/collections/boxing-gloves> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://yokkao.com/pages/boxing-gloves-guide>
2023-05-05 06:52:41 [seo_spider] ERROR: Invalid control character at: line 15 column 21 (char 385) 200 https://topboxer.com/collections/boxing-gloves
Traceback (most recent call last):
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
response.css('script[type="application/ld+json"]::text').getall()]
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
ld = [json.loads(s.replace('\r', '')) for s in
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 15 column 21 (char 385)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://topboxer.com/collections/boxing-gloves>
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://nypost.com/article/best-boxing-equipment-per-experts/> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kdvr.com/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cashkaro.com/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://nypost.com/article/best-boxing-equipment-per-experts/>
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://origympersonaltrainercourses.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.popsugar.com/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.expertreviews.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cashkaro.com/blog/best-boxing-gloves-in-india/201246> (referer: None)
2023-05-05 06:52:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cashkaro.com/blog/best-boxing-gloves-in-india/201246>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://warriorpunch.com/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.popsugar.com/fitness/Best-Boxing-Gloves-Women-45472473> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kdvr.com/reviews/br/sports-fitness-br/boxing-br/best-title-boxing-gloves/> (referer: None)
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://warriorpunch.com/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.popsugar.com/fitness/Best-Boxing-Gloves-Women-45472473>
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://kdvr.com/reviews/br/sports-fitness-br/boxing-br/best-title-boxing-gloves/>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://branded.disruptsports.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.expertreviews.co.uk/health-and-grooming/1407584/best-boxing-gloves> (referer: None)
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.expertreviews.co.uk/health-and-grooming/1407584/best-boxing-gloves>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://branded.disruptsports.com/blogs/blog/which-boxing-gloves-to-buy-for-beginners> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightquality.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://branded.disruptsports.com/blogs/blog/which-boxing-gloves-to-buy-for-beginners>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.flipkart.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.infinitudefight.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 35 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 69 without any user agent to enforce it on.
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://absolutelymartialarts.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.k2promos.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.infinitudefight.com/buy-the-best-boxing-gloves/> (referer: None)
2023-05-05 06:52:44 [seo_spider] ERROR: Expecting value: line 1 column 1 (char 0) 200 https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves
Traceback (most recent call last):
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
response.css('script[type="application/ld+json"]::text').getall()]
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
ld = [json.loads(s.replace('\r', '')) for s in
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightingadvice.com/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.infinitudefight.com/buy-the-best-boxing-gloves/>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.proboxingequipment.com/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.proboxingequipment.com/Boxing-Gloves_c_196.html> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://glovesaddict.com/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.proboxingequipment.com/Boxing-Gloves_c_196.html>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://absolutelymartialarts.com/best-boxing-gloves-beginners/> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/r/amateur_boxing/comments/2ykhau/the_top_15_best_boxing_gloves_ranking_the_best/> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.healthyprinciples.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://absolutelymartialarts.com/best-boxing-gloves-beginners/>
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reddit.com/r/amateur_boxing/comments/2ykhau/the_top_15_best_boxing_gloves_ranking_the_best/>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mmahive.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bwsgym.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightquality.com/2018/10/12/best-custom-gloves/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightingadvice.com/best-boxing-gloves-under-200/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.k2promos.com/best-beginner-boxing-gloves/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fightquality.com/2018/10/12/best-custom-gloves/>
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.flipkart.com/sports/boxing/boxing-gloves/pr?sid=abc%2Cppq%2Cbb6&page=2> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dontwasteyourmoney.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fightingadvice.com/best-boxing-gloves-under-200/>
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.k2promos.com/best-beginner-boxing-gloves/>
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.flipkart.com/sports/boxing/boxing-gloves/pr?sid=abc%2Cppq%2Cbb6&page=2>
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bwsgym.com/etiquette-produit/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://middleeasy.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bwsgym.com/etiquette-produit/best-boxing-gloves/>
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.healthyprinciples.co.uk/best-boxing-gloves-for-kids-review/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bestproducts.com/robots.txt> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.healthyprinciples.co.uk/best-boxing-gloves-for-kids-review/>
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mmahive.com/best-boxing-gloves-for-wrist-support/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.momjunction.com/robots.txt> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dontwasteyourmoney.com/products/hawk-sports-heavy-bag-boxing-gloves/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mmahive.com/best-boxing-gloves-for-wrist-support/>
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dontwasteyourmoney.com/products/hawk-sports-heavy-bag-boxing-gloves/>
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://glovesaddict.com/best-boxing-gloves-on-amazon/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://middleeasy.com/reviews/gear/gloves-cardio-kickboxing/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://breakinggrips.com/robots.txt> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://glovesaddict.com/best-boxing-gloves-on-amazon/>
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://middleeasy.com/reviews/gear/gloves-cardio-kickboxing/>
2023-05-05 06:52:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/robots.txt> (failed 1 times): 429 Unknown Status
2023-05-05 06:52:46 [py.warnings] WARNING: /home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/scrapy/core/engine.py:276: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
return self.download(result, spider) if isinstance(result, Request) else result
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.momjunction.com/articles/best-boxing-gloves-for-kids_00514921/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.momjunction.com/articles/best-boxing-gloves-for-kids_00514921/>
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bestproducts.com/fitness/equipment/g1009/boxing-gloves-mitts/> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/robots.txt> (failed 2 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://breakinggrips.com/best-kids-boxing-gloves/> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mightyfighter.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bestproducts.com/fitness/equipment/g1009/boxing-gloves-mitts/>
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://breakinggrips.com/best-kids-boxing-gloves/>
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.stylecraze.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.fightingking.com/robots.txt> (failed 3 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.fightingking.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 6 without any user agent to enforce it on.
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://linealboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wbcme.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (failed 1 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blackbeltmag.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mightyfighter.com/top-10-best-boxing-gloves/> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://smartmma.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://linealboxing.com/best-boxing-glove-brands-2022/> (referer: None)
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mightyfighter.com/top-10-best-boxing-gloves/>
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.stylecraze.com/articles/best-heavy-bag-gloves/> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (failed 2 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://linealboxing.com/best-boxing-glove-brands-2022/>
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wbcme.co.uk/ringside/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:48 [seo_spider] ERROR: Invalid control character at: line 28 column 64 (char 1740) 200 https://www.stylecraze.com/articles/best-heavy-bag-gloves/
Traceback (most recent call last):
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
response.css('script[type="application/ld+json"]::text').getall()]
File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
ld = [json.loads(s.replace('\r', '')) for s in
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 28 column 64 (char 1740)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.stylecraze.com/articles/best-heavy-bag-gloves/>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kreedon.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.kreedon.com/best-boxing-gloves-brands/>
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.wbcme.co.uk/ringside/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:48 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (failed 3 times): 429 Unknown Status
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (referer: None)
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.attacktheback.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <429 https://www.fightingking.com/boxing-gloves-brands-reviews/>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.boxingear.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blackbeltmag.com/best-boxing-gloves> (referer: None)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blackbeltmag.com/best-boxing-gloves>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesuk.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.attacktheback.com/best-cheap-boxing-gloves/> (referer: None)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.attacktheback.com/best-cheap-boxing-gloves/>
2023-05-05 06:52:48 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://sites.google.com/view> from <GET https://www.boxingear.com/shop-2/grant-gloves/lace-up/best-boxing-gloves-for-sparring-grant-gloves/>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fullcontactway.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesuk.com/blogs/news/what-are-the-best-boxing-gloves-for-beginners> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fitnessbaddies.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bestreviews.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.boxingison.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cletoreyesuk.com/blogs/news/what-are-the-best-boxing-gloves-for-beginners>
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://thewiredshopper.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 38 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 49 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 50 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 51 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 52 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 53 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 54 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 55 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 56 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 57 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 58 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 59 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 60 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 61 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 67 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 72 without any user agent to enforce it on.
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msn.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fullcontactway.com/best-sparring-gloves/> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://thewiredshopper.com/best-boxing-gloves-to-buy/> (referer: None)
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.fullcontactway.com/best-sparring-gloves/>
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://smartmma.com/best-boxing-gloves-for-heavy-bag/> (referer: None)
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <403 https://thewiredshopper.com/best-boxing-gloves-to-buy/>
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://smartmma.com/best-boxing-gloves-for-heavy-bag/>
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msn.com/en-gb/lifestyle/rf-best-products-uk/best-boxing-gloves-for-men-12oz-reviews> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bestreviews.com/sports-fitness/boxing/best-boxing-gloves> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gloveworx.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.msn.com/en-gb/lifestyle/rf-best-products-uk/best-boxing-gloves-for-men-12oz-reviews>
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bestreviews.com/sports-fitness/boxing/best-boxing-gloves>
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fitnessbaddies.com/amateur-boxing-gloves/> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.standard.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sites.google.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.fitnessbaddies.com/amateur-boxing-gloves/>
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.pragmaticmom.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lowkickmma.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.standard.co.uk/shopping/esbest/health-fitness/fitness-wear/best-womens-boxing-gloves-for-beginners-a4272321.html> (referer: None)
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.standard.co.uk/shopping/esbest/health-fitness/fitness-wear/best-womens-boxing-gloves-for-beginners-a4272321.html>
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingready.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sportsdirect.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lowkickmma.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://sites.google.com/view> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lowkickmma.com/best-boxing-gloves/>
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <404 https://sites.google.com/view>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://themmaguru.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dmarge.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.pragmaticmom.com/2019/11/best-boxing-gloves-for-women/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.pragmaticmom.com/2019/11/best-boxing-gloves-for-women/>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dmarge.com/best-boxing-gloves> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.boxingison.com/best-boxing-gloves-for-training-and-sparring/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dmarge.com/best-boxing-gloves>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sportsdirect.com/boxing/boxing-gloves> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gloveworx.com/blog/how-choose-best-boxing-gloves-beginners/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.boxingison.com/best-boxing-gloves-for-training-and-sparring/>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thechamplair.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://brawlbros.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://themmaguru.com/best-youth-boxing-gloves/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.sportsdirect.com/boxing/boxing-gloves>
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gloveworx.com/blog/how-choose-best-boxing-gloves-beginners/>
2023-05-05 06:52:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://themmaguru.com/best-youth-boxing-gloves/>
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nytimes.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearhungry.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hungry4fitness.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://findbestboxinggloves.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hiconsumption.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thechamplair.com/sports/best-beginners-boxing-gloves/> (referer: None)
2023-05-05 06:52:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://thechamplair.com/sports/best-beginners-boxing-gloves/>
2023-05-05 06:52:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://brawlbros.com/best-boxing-gloves-on-amazon/> (referer: None)
2023-05-05 06:52:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://brawlbros.com/best-boxing-gloves-on-amazon/>
2023-05-05 06:52:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hiconsumption.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://hiconsumption.com/best-boxing-gloves/>
2023-05-05 06:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hungry4fitness.co.uk/post/10-best-boxing-mitts-an-ultimate-guide> (referer: None)
2023-05-05 06:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearhungry.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hungry4fitness.co.uk/post/10-best-boxing-mitts-an-ultimate-guide>
2023-05-05 06:52:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gearhungry.com/best-boxing-gloves/>
2023-05-05 06:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingready.com/ringside/best-boxing-gloves-wrist-support/> (referer: None)
2023-05-05 06:52:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://boxingready.com/ringside/best-boxing-gloves-wrist-support/>
2023-05-05 06:52:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nytimes.com/video/style/1194840632119/gear-test-boxing-gloves.html> (referer: None)
2023-05-05 06:52:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.nytimes.com/video/style/1194840632119/gear-test-boxing-gloves.html>
2023-05-05 06:52:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://findbestboxinggloves.com/best-boxing-gloves-for-heavy-bag-the-complete-guide/> (referer: None)
2023-05-05 06:52:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://findbestboxinggloves.com/best-boxing-gloves-for-heavy-bag-the-complete-guide/>
2023-05-05 06:53:33 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 196 pages/min), scraped 97 items (at 97 items/min)
2023-05-05 06:54:33 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 0 pages/min), scraped 97 items (at 0 items/min)
2023-05-05 06:54:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://skilspo.com/robots.txt> (failed 1 times): TCP connection timed out: 110: Connection timed out.
When trying to retrieve a large recursive sitemap, I am getting http error 429, too many requests. Currently it seems like there is no way to limit the number of requests it makes, specify a cooldown period or limit the speed of requests. So nothing is ever retrieved with that function.
Is there a way to just extract the information I want, by default it extracts too much, if the web page is large, the json line file will be very large.
for example, I just want to extract just the title.
Do you plan to add feature to create sitemap for this app. I have a lot of big websites that need to create a sitemap
For the function word_frequency
mainly, as the default value for rm_words
param.
Get the full list of stopwords for several languages, to be imported and potentially used in function calls, or separately, e.g.:
import advertools as adv
adv.stop_words['en']
adv.stop_words['fr']
Hi all,
I'm following the documentation with this line of code
adv.crawl('https://example.com', 'my_output_file.jl', follow_links=True)
But it returns this error:
FileNotFoundError: [WinError 2] The system cannot find the file specified
Even though my directory looks like this:
- SEO.py
- my_output_file.jl
Here is the complete trace:
Traceback (most recent call last):
File "c:/Users/Henrique/Desktop/SEO/SEO.py", line 6, in <module>
adv.crawl('https://example.com', 'my_output_file.jl', follow_links=True)
File "C:\Users\Henrique\AppData\Roaming\Python\Python38\site-packages\advertools\spider.py", line 971, in crawl
subprocess.run(command)
File "C:\Python38\lib\subprocess.py", line 489, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Python38\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Python38\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
As you can see it doesn't specify which file was not found but I assume it is the output file.
Any help is greatly appreciated!
Originally posted by @henriquearaujo-98 in #247
In the sitemaps extraction script there's a relatively new warning
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
sitemap_df = sitemap_df.append(elem_df, ignore_index=True)
Thanks,
Bill
Not fatal, but just an issue note:
Seems there is a issue with 3.10/3.11
python/cpython#103142
Mac Intel
and containers using
FROM python:3.11-slim
FROM python:3.10-slim
This url:
https://opentopography.org/sitemap.xml
gets redirected to:
https://portal.opentopography.org/sitemap.xml
If i just use https://portal.opentopography.org/sitemap.xml it works fine.
File "/Users//development/dev_earthcube/earthcube_utilities/venv311/lib/python3.11/site-packages/advertools/sitemaps.py", line 491, in sitemap_to_df
xml_text = urlopen(Request(sitemap_url, headers=headers))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 519, in open
response = self._open(req, data)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 496, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1002)>
hi sir,
I am having problems with the start parameter on serp_goog. I want to query from the first page to the 4th.
Please advise.
thank you
packages version:
pandas
In [5]: print(pd.version)
0.25.0
advertools
In [6]: print(adv.version)
0.7.3
----> 4 next_result_1=adv.serp_goog(cx=cx, key=key, q=queri, gl=['id'],start=[1, 11, 21])
KeyError: "['start'] not in index"
C:\Anaconda3\lib\site-packages\advertools\serp.py in serp_goog(q, cx, key, c2coff, cr, dateRestrict, exactTerms, excludeTerms, fileType, filter, gl, highRange, hl, hq, imgColorType, imgDominantColor, imgSize, imgType, linkSite, lowRange, lr, num, orTerms, relatedSite, rights, safe, searchType, siteSearch, siteSearchFilter, sort, start)
700 specified_cols)
701 non_ordered = result_df.columns.difference(set(ordered_cols))
--> 702 final_df = result_df[ordered_cols + list(non_ordered)]
703 return final_df
704
C:\Anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key)
2979 if is_iterator(key):
2980 key = list(key)
-> 2981 indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
2982
2983 # take() does not accept boolean indexers
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing)
1269 # When setting, missing keys are not allowed, even with .loc:
1270 kwargs = {"raise_missing": True if is_setter else raise_missing}
-> 1271 return self._get_listlike_indexer(obj, axis, **kwargs)[1]
1272 else:
1273 try:
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1076
1077 self._validate_read_indexer(
-> 1078 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1079 )
1080 return keyarr, indexer
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1169 if not (self.name == "loc" and not raise_missing):
1170 not_found = list(set(key) - set(ax))
-> 1171 raise KeyError("{} not in index".format(not_found))
1172
1173 # we skip the warning on Categorical/Interval
KeyError: "['start'] not in index"
Hello Elias,
Advertools is a really great package ! Many thanks for the splendid work.
Nevertheless I have a little problem for which I have not found a good workaround (except crawling urls one by one)
I was wondering how to get the initial url that is crawled.
Advertools returns the url after redirection and not the url before. So when you want to merge data it can becone tricky if you have no "reference".
Can we also imagine to "inject" specific user params as string to get them in the output ? I've tried to do so in the xpath_selectors, but completely failed.
Many thanks,
Caro
Hello Elias,
I had already posted the topic some time ago on #328, but I don't think you had seen it.
Thank you for the fantastic work you're doing with advertools.
However, I have an issue with websites that have a cookie wall, like on https://www.interflora.fr/p/roses-passion.
When I do
scrapy shell view(response)
I can clearly see that I am blocked.
There is absolutely no element like the title, the button or body_text
So, I was wondering if you might have a fantastic idea to work around this issue.
Thanks a million !
sitemap_to_df only returns the first image loc in the sitemap. If there are multiple images it ignores them. example https://www.levi.in/sitemap_0.xml
How can we make this work for multiple images?
Method: adv.url_to_df()
advertools/urlytics.py:198: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
.assign(last_dir=dirs_df
The new (and amazing!) crawl_headers function seems quite aggressive.
Although it is very fast, I noticed that quite a few URLs fail and respond with an error code in column 'status' (although they load quite fine in a browser).
Is there a way to throttle or retry?
I've made it as far as examining the dataFrame returned from a crawl!
Looking through the docs I was expecting separately labelled columns for the various bits of jsonls data. Instead I'm seeing a column with lots of Objects within it. Is this a quirk of viewing a datFrame within VSCode? Or are the docs out of date? Or something else? I'm a little stuck!
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.