Code Monkey home page Code Monkey logo

advertools's People

Contributors

amrrs avatar andypayne avatar bilalmirza74 avatar danielp77 avatar dreadedhamish avatar eliasdabbas avatar lgtm-migrator avatar pyup-bot avatar takluyver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

advertools's Issues

opening .jl file command doesn't show 'my_output_file.jl'

My apologies in advance, I am still a python padawan. I swept through the documentation but couldn't find an answer.

I am assuming that after i run;

import advertools as adv
adv.crawl('https://example.com', 'my_output_file.jl', follow_links=True)

and then

import pandas as pd
crawl_df = pd.read_json('my_output_file.jl', lines=True)

I am supposed to see the data from the scrape? (I did put in a real URL, and the output file is a 1.6MB .jl file was created. But when I run that cell, nothing happens. No errors but no data either. I am testing this in a jupyter notebook... all requirements are installed etc.

Also, if I may, as an SEO practitioner, how do I output these results into a CSV file that can viewed in excel or google sheets for further analysis? If you're willing, can you provide an example command as to how to convert the .jl file to .csv? I tried to install json-lines but apparently it's no longer supported. I assume what I need is to import csv but as far as how to structure it to encapsulate the data with column headers etc is a bit intimidating.

Thanks

request_url_df creates wide list?

After application of adv.url_to_df(logs_df['request']) on my dataset the dataframe explodes to more than 120 columns with names like:

'query_template',
'query_archive',
'query_key',
'query_per',
'query_x',
"query_[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0]['[0][

Applied on referer produces another 40 columns. Is this behavior intended?

No of threads for Crawling

Is there a way to specify the no of threads for crawling? Currently it uses all the threads on the system and causes issues to refresh the page.

Freezing versions for dependencies

@eliasdabbas

Currently, versions are not specified for the dependencies mentioned in the setup.py
Current config:
requirements = [
'pandas',
'pyasn1',
'scrapy',
'twython',
'pyarrow',
]

So, on every install, latest versions will be downloaded and installed for these requried packages.
Instead, we can adopt a common practice by specyfying the required versions to avoid any breaking changes in the newer versions of these dependent packages.

requirements = [
'pandas==1.4.2',
'pyasn1==0.4.8',
'scrapy==2.6.1',
'twython==3.9.1',
'pyarrow==8.0.0',
]

jsonld_sameAs data is mixed with str and list.

Hi. I'm working on a streamlit app and I'm having an issue with scrapped data specifically jsonld_sameAs column. It has mixed datatype and streamlit throws error when I try to show the table.

Here is my code

adv.crawl(urls, 'pages.jl', follow_links=False)
crawl_df = pd.read_json('pages.jl', lines=True)
st.dataframe(crawl_df)

Here is the error.

{StreamlitAPIException}('cannot mix list and non-list, non-null values', 'Conversion failed for column jsonld_sameAs with type object')

When I checked the datatype of the column it is mixed with str and list.

issue

It should be list not str to avoid issues.

Suggestion - don't treat jsonld items in distinct script tags as distinct.

Some field observations with a small data set of only 12 sites:
8 place all their json-ld tags in a hierarchy under @graph (wrapped in a single script tag), and 4 spread place their items in seperate script tags.
advertools treats each occurrence of as a seperate entity, so for those in a hierarchy there is a single json-dl @graph column with nested object, and those without a hierachy get spread out over multiple columns (json_1_ etc...).

I'm building a scraper that regularly and often scrapes the same sites. With the current functionaly of advertools I will need to inspect each site and write conditions to scrape columns depending on how json-ld tags have been stored.

I think it would be helpful to treat json-ld items, whether they are in a hierachy under @graph or in distinct script tags, as being equal. As far as I can tell the only difference apart from being nested is that the @context entry appears as a sibling to @graph and so only appears once in the whole scheme.

NESTED:
{
"@context": "https://schema.org",
"@graph": [
{
"@type": "WebSite",
...

NON-NESTED:
{
"@context": "http://schema.org",
"@type": "Website",
...

Then as a bonus break out each @type to a column with the whole record within.

Bulk robots.txt Tester Documentation

Hi,

Thanks so much for providing such a valuable Python package to marketing researchers.

I tried running the robotstxt_test() function as described in documetation, but the example does not work correctly. I propose the following solution (3rd example below) and a change of the documetation:

# 1. current example includes wrong syntax
robotstxt_test('https://www.example.com/robots.txt',
               useragents=['Googlebot', 'baiduspider', 'Bingbot']
               urls=['/', '/hello', '/some-page.html']])

# 2. example includes right syntax but does not return meaningful information (i.e., returns HTTP Error 404)
adv.robotstxt_test('https://www.example.com/robots.txt',
               user_agents=['Googlebot', 'baiduspider', 'Bingbot'],
               urls=['/', '/hello', '/some-page.html'])

# 3. example includes right syntax and returns meaningful information
adv.robotstxt_test('https://www.amazon.com/robots.txt',
               user_agents=['Googlebot', 'baiduspider', 'Bingbot'],
               urls=['/', '/hello', '/some-page.html'])

Instagram Mentions Allows Periods

Hi Elias,

Comments on Instagram that include mentions with periods are currently truncated on advertools.__version__ = 0.14.2.
Example: @elias.dabbas -> [@elias]

I propose adding adding a . to the MENTIONS in the REGEX module.

MENTION = re.compile(
    r"""(?i)     # case-insensitive
    (?<!\w)      # word character doesn't precede mention
    ([@@]       # either of two @ signs
    [a-z0-9_.]+)  # A to Z, numbers and underscores AND PERIODS only
    \b           # end with a word boundary
    """, re.VERBOSE)

This change works for me, but I haven't tested edge cases or other social media platforms.

browser can get https://zapier.com but when run scrape failed

(base) wenke@wenkedeMac-mini gradio-demo % python zapier.py 
2023-12-03 00:40:48 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2023-12-03 00:40:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.4, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.9.16 (main, Mar  8 2023, 04:29:44) - [Clang 14.0.6 ], pyOpenSSL 23.0.0 (OpenSSL 1.1.1t  7 Feb 2023), cryptography 39.0.1, Platform macOS-10.15.7-x86_64-i386-64bit
2023-12-03 00:40:49 [scrapy.addons] INFO: Enabled addons:
[]
2023-12-03 00:40:49 [py.warnings] WARNING: /Users/wenke/miniconda3/lib/python3.9/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-12-03 00:40:49 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-12-03 00:40:49 [scrapy.extensions.telnet] INFO: Telnet Password: 4f579800aa59aff0
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-12-03 00:40:50 [scrapy.crawler] INFO: Overridden settings:
{'ROBOTSTXT_OBEY': True,
 'SPIDER_LOADER_WARN_ONLY': True,
 'USER_AGENT': 'advertools/0.13.5'}
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-03 00:40:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-12-03 00:40:50 [scrapy.core.engine] INFO: Spider opened
2023-12-03 00:40:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-03 00:40:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com/robots.txt> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com/robots.txt> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://zapier.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://zapier.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
Traceback (most recent call last):
  File "/Users/wenke/miniconda3/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://zapier.com> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:50 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://zapier.com> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:51 [seo_spider] ERROR: <twisted.python.failure.Failure twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]>
2023-12-03 00:40:51 [scrapy.core.scraper] DEBUG: Scraped from [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2023-12-03 00:40:51 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-03 00:40:51 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: zapier.jl
2023-12-03 00:40:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6,
 'downloader/request_bytes': 1248,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'elapsed_time_seconds': 0.71527,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 12, 2, 16, 40, 51, 34244, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/ERROR': 4,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'memusage/max': 123461632,
 'memusage/startup': 123461632,
 'retry/count': 4,
 'retry/max_reached': 2,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
 "robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
 'robotstxt/request_count': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2023, 12, 2, 16, 40, 50, 318974, tzinfo=datetime.timezone.utc)}
2023-12-03 00:40:51 [scrapy.core.engine] INFO: Spider closed (finished)

import advertools as adv
adv.crawl('https://zapier.com', 'zapier.jl', follow_links=True)

logs_to_df() Limitation

1) Missing domain_entry

Weblog files may contain the domain name (eg in case a system hosts several webserver) as 10th column. This domain name may be missing, which is the expectation of method "logs_to_df()

But sometimes the field appears in a weblog file, as an entry or an empty entry (eg '-' or '"-"'). logs_to_df() cannot handle this extra field and ignores these entries.

2) Failure on escaped quotes (\")

In weblogs quotes mark fields. Sometimes quotes are part of a field string and escaped by "\". logs_to_df() does not catch escaped character \" and ignores these entries.

sitemap_to_df cannot handle recursion well

If you try sitemap_to_df for wrangler.com, you will notice that there is a recursion in the sitemap. It calls the sitemap index again and again without terminating. There should be a check to keep track of visited Sitemaps.

Feature Request - Alternative Crawl Output

Is it possible to add functionality so we don't have to write to disc before being able ot analyze the results?

Directly to a df or some other python object would be great!

pandas frame.append method is deprecated

Hi @eliasdabbas

The sitemap_to_df function is throwing the following warning, so I thought it would be a good idea to bring it to your notice.

2022-04-21 18:46:26,560 | INFO | sitemaps.py:419 | sitemap_to_df | Getting https://xyz.com/sitemap/site.xml
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/advertools/sitemaps.py:421: 
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. 
Use pandas.concat instead.
  sitemap_df = sitemap_df.append(elem_df, ignore_index=True)

Regards.

Advertools in Ubuntu in a Venv (Python 3.10.12 and Python 3.9.18)

Hi Everyone,

I am trying to run Advertools in a Python Venv and Ubuntu.

I tried with the standard Python that comes with this Ubuntu version (3.10.12) and I also tried to install Python 3.9.18 as I saw someone posting an issue similar and suggesting to use this version but the issue is the same.

This is what I did:

mkdir /home/abc/advertools/
cd /home/abc/advertools/
python3 -m venv .
source bin/activate
python3 -m pip install advertools

When I try to type:
adv
or
advertools --version

I get

Illegal instruction (core dumped)

Do you guys have some suggestions?

For Python9 I tried:
add-apt-repository ppa:deadsnakes/ppa
apt install python3.9
apt install python3.9-venv
python3.9 -m venv .
source bin/activate
python3 -m pip install advertools

same error: Illegal instruction (core dumped)

Thank you very much

Scraps forever

Here is the list of URLs I'm trying to scrape, which are stuck and never finishes.

https://www.si.com/showcase/fitness/best-boxing-gloves
https://www.verywellfit.com/best-boxing-gloves-4158917
https://www.rollingstone.com/product-recommendations/lifestyle/best-boxing-gloves-1234690811/
https://www.gearpatrol.com/fitness/g40446087/best-boxing-gloves/
https://boxingglovesreviews.com/top-ten-boxing-gloves/
https://sweetscienceoffighting.com/best-boxing-gloves/
https://www.shape.com/fitness/gear/best-boxing-gloves
https://www.t3.com/features/best-boxing-gloves
https://bleacherreport.com/articles/1286577-breaking-down-different-brands-of-boxing-gloves-worn-by-the-pros
https://www.youtube.com/watch?v=tWoucO2nIlE
https://expertboxing.com/best-boxing-gloves-review
https://thekarateblog.com/best-boxing-gloves/
https://boxupnation.com/blogs/news/my-top-5-favorite-boxing-glove-brands-and-why
https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131
https://www.tabletenniscoach.me.uk/sport-equipment-guides/best-boxing-gloves-for-beginners/
https://myboxinglife.com/best-boxing-gloves-for-beginners/
https://www.youtube.com/watch?v=rHepbZOCxfY
https://wayofmartialarts.com/best-boxing-gloves-worth-your-money/
https://www.hayabusafight.com/products/t3-boxing-gloves
https://www.dickssportinggoods.com/o/best-boxing-gloves-for-pad-work
https://revgear.com/gear/boxing-gloves/
https://blog.joinfightcamp.com/boxing-equipment/how-to-choose-the-best-boxing-gloves-for-beginners/
https://www.ebay.com/t/Boxing-Gloves/30102/bn_1943751
https://cletoreyesboxing.com/
https://www.walmart.com/c/lists/top-rated-boxing-gloves
https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1
https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves
https://m.timesofindia.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms
https://www.everlast.com/fight/boxing/gloves
https://www.msmfightshop.com/blogs/news/top-3-boxing-gloves-in-the-world
https://www.quora.com/What-companies-make-the-best-quality-boxing-gloves
https://www.titleboxing.com/gloves
https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-professionals/articleshow/97128538.cms
https://skilspo.com/gb/blog/1_how-to-choose-the-best-boxing-gloves.html
https://bravose.com/collections/training-gloves
https://sanabulsports.com/blogs/news/the-best-boxing-gloves-for-training
https://anthonyjoshua.com/blogs/news/anthony-joshua-how-to-choose-the-best-boxing-gloves
https://www.nakmuaywholesale.com/top-3-boxing-gloves-for-small-hands-2022/
https://mmagearaddict.com/best-boxing-gloves/
https://issuu.com/punchequipment/docs/get_the_best_boxing_gloves_for_a_winning_performan
https://tufwear-germany.de/en/blogs/news/was-sind-die-besten-boxhandschuhe-der-boxhandschuh-guide-fur-deinen-kauf
https://yokkao.com/pages/boxing-gloves-guide
https://topboxer.com/collections/boxing-gloves
https://warriorpunch.com/best-boxing-gloves-for-beginners/
https://nypost.com/article/best-boxing-equipment-per-experts/
https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves
https://www.infinitudefight.com/buy-the-best-boxing-gloves/
https://cashkaro.com/blog/best-boxing-gloves-in-india/201246
https://www.popsugar.com/fitness/Best-Boxing-Gloves-Women-45472473
https://kdvr.com/reviews/br/sports-fitness-br/boxing-br/best-title-boxing-gloves/
https://www.expertreviews.co.uk/health-and-grooming/1407584/best-boxing-gloves
https://branded.disruptsports.com/blogs/blog/which-boxing-gloves-to-buy-for-beginners
https://www.flipkart.com/sports/boxing/boxing-gloves/pr?sid=abc%2Cppq%2Cbb6&page=2
https://www.reddit.com/r/amateur_boxing/comments/2ykhau/the_top_15_best_boxing_gloves_ranking_the_best/
https://fightquality.com/2018/10/12/best-custom-gloves/
https://fightingadvice.com/best-boxing-gloves-under-200/
https://glovesaddict.com/best-boxing-gloves-on-amazon/
https://www.k2promos.com/best-beginner-boxing-gloves/
https://absolutelymartialarts.com/best-boxing-gloves-beginners/
https://www.healthyprinciples.co.uk/best-boxing-gloves-for-kids-review/
https://breakinggrips.com/best-kids-boxing-gloves/
https://www.proboxingequipment.com/Boxing-Gloves_c_196.html
https://www.mmahive.com/best-boxing-gloves-for-wrist-support/
https://bwsgym.com/etiquette-produit/best-boxing-gloves/
https://www.dontwasteyourmoney.com/products/hawk-sports-heavy-bag-boxing-gloves/
https://www.bestproducts.com/fitness/equipment/g1009/boxing-gloves-mitts/
https://www.wbcme.co.uk/ringside/best-boxing-gloves-for-beginners/
https://www.momjunction.com/articles/best-boxing-gloves-for-kids_00514921/
https://middleeasy.com/reviews/gear/gloves-cardio-kickboxing/
https://www.fightingking.com/boxing-gloves-brands-reviews/
https://www.mightyfighter.com/top-10-best-boxing-gloves/
https://www.stylecraze.com/articles/best-heavy-bag-gloves/
https://linealboxing.com/best-boxing-glove-brands-2022/
https://blackbeltmag.com/best-boxing-gloves
https://smartmma.com/best-boxing-gloves-for-heavy-bag/
https://www.fullcontactway.com/best-sparring-gloves/
https://www.attacktheback.com/best-cheap-boxing-gloves/
https://www.boxingear.com/shop-2/grant-gloves/lace-up/best-boxing-gloves-for-sparring-grant-gloves/
https://www.kreedon.com/best-boxing-gloves-brands/
https://bestreviews.com/sports-fitness/boxing/best-boxing-gloves
https://cletoreyesuk.com/blogs/news/what-are-the-best-boxing-gloves-for-beginners
https://www.fitnessbaddies.com/amateur-boxing-gloves/
https://www.boxingison.com/best-boxing-gloves-for-training-and-sparring/
https://boxingready.com/ringside/best-boxing-gloves-wrist-support/
https://www.msn.com/en-gb/lifestyle/rf-best-products-uk/best-boxing-gloves-for-men-12oz-reviews
https://www.pragmaticmom.com/2019/11/best-boxing-gloves-for-women/
https://thewiredshopper.com/best-boxing-gloves-to-buy/
https://www.standard.co.uk/shopping/esbest/health-fitness/fitness-wear/best-womens-boxing-gloves-for-beginners-a4272321.html
https://www.gloveworx.com/blog/how-choose-best-boxing-gloves-beginners/
https://www.lowkickmma.com/best-boxing-gloves/
https://www.sportsdirect.com/boxing/boxing-gloves
https://themmaguru.com/best-youth-boxing-gloves/
https://brawlbros.com/best-boxing-gloves-on-amazon/
https://thechamplair.com/sports/best-beginners-boxing-gloves/
https://www.dmarge.com/best-boxing-gloves
https://www.nytimes.com/video/style/1194840632119/gear-test-boxing-gloves.html
https://findbestboxinggloves.com/best-boxing-gloves-for-heavy-bag-the-complete-guide/
https://www.hungry4fitness.co.uk/post/10-best-boxing-mitts-an-ultimate-guide
https://www.gearhungry.com/best-boxing-gloves/
https://hiconsumption.com/best-boxing-gloves/

Here is the log

/home/irfan/.pyenv/versions/TES/bin/python /home/irfan/PycharmProjects/TES-SAAS/tests/scprapping.py 
2023-05-05 06:52:32 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2023-05-05 06:52:32 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.9 (default, Jan 23 2022, 07:32:51) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Linux-5.4.0-148-generic-x86_64-with-debian-bullseye-sid
2023-05-05 06:52:32 [scrapy.crawler] INFO: Overridden settings:
{'ROBOTSTXT_OBEY': True,
 'SPIDER_LOADER_WARN_ONLY': True,
 'USER_AGENT': 'advertools/0.13.2'}
2023-05-05 06:52:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2023-05-05 06:52:32 [scrapy.extensions.telnet] INFO: Telnet Password: 2dcb88ca688b5e23
2023-05-05 06:52:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-05-05 06:52:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-05-05 06:52:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-05-05 06:52:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-05-05 06:52:33 [scrapy.core.engine] INFO: Spider opened
2023-05-05 06:52:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-05-05 06:52:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sweetscienceoffighting.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rollingstone.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [filelock] DEBUG: Attempting to acquire lock 140227121181328 on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [filelock] DEBUG: Lock 140227121181328 acquired on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [filelock] DEBUG: Attempting to release lock 140227121181328 on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [filelock] DEBUG: Lock 140227121181328 released on /home/irfan/.cache/python-tldextract/3.7.9.final__TES__f2586e__tldextract-3.3.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.t3.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.si.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearpatrol.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.shape.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.verywellfit.com/robots.txt> (referer: None)
2023-05-05 06:52:33 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.si.com/showcase/fitness/best-boxing-gloves> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingglovesreviews.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.t3.com/features/best-boxing-gloves> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bleacherreport.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.si.com/showcase/fitness/best-boxing-gloves>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rollingstone.com/product-recommendations/lifestyle/best-boxing-gloves-1234690811/> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://expertboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.t3.com/features/best-boxing-gloves>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/robots.txt> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rollingstone.com/product-recommendations/lifestyle/best-boxing-gloves-1234690811/>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.verywellfit.com/best-boxing-gloves-4158917> (referer: None)
2023-05-05 06:52:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.verywellfit.com/best-boxing-gloves-4158917>
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.shape.com/fitness/gear/best-boxing-gloves> (referer: None)
2023-05-05 06:52:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thekarateblog.com/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.shape.com/fitness/gear/best-boxing-gloves>
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sweetscienceoffighting.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearpatrol.com/fitness/g40446087/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://sweetscienceoffighting.com/best-boxing-gloves/>
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxupnation.com/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gearpatrol.com/fitness/g40446087/best-boxing-gloves/>
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tabletenniscoach.me.uk/robots.txt> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=tWoucO2nIlE> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bleacherreport.com/articles/1286577-breaking-down-different-brands-of-boxing-gloves-worn-by-the-pros> (referer: None)
2023-05-05 06:52:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingglovesreviews.com/top-ten-boxing-gloves/> (referer: None)
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=tWoucO2nIlE>
2023-05-05 06:52:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bleacherreport.com/articles/1286577-breaking-down-different-brands-of-boxing-gloves-worn-by-the-pros>
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://boxingglovesreviews.com/top-ten-boxing-gloves/>
2023-05-05 06:52:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (failed 1 times): 429 Unknown Status
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxupnation.com/blogs/news/my-top-5-favorite-boxing-glove-brands-and-why> (referer: None)
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://boxupnation.com/blogs/news/my-top-5-favorite-boxing-glove-brands-and-why>
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wayofmartialarts.com/robots.txt> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thekarateblog.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://myboxinglife.com/robots.txt> (referer: None)
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://thekarateblog.com/best-boxing-gloves/>
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tabletenniscoach.me.uk/sport-equipment-guides/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://expertboxing.com/best-boxing-gloves-review> (referer: None)
2023-05-05 06:52:36 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.dickssportinggoods.com/robots.txt> (referer: None)
2023-05-05 06:52:36 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (failed 2 times): 429 Unknown Status
2023-05-05 06:52:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tabletenniscoach.me.uk/sport-equipment-guides/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://expertboxing.com/best-boxing-gloves-review>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=rHepbZOCxfY> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hayabusafight.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://revgear.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=rHepbZOCxfY>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.dickssportinggoods.com/o/best-boxing-gloves-for-pad-work> (referer: None)
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.dickssportinggoods.com/o/best-boxing-gloves-for-pad-work>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://myboxinglife.com/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://myboxinglife.com/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.joinfightcamp.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebay.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ringsport.com.au/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:37 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (failed 3 times): 429 Unknown Status
2023-05-05 06:52:37 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131> (referer: None) ['partial']
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <429 https://www.amazon.com/Best-Sellers-Boxing-Training-Gloves/zgbs/sporting-goods/3400131>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://made4fighters.com/robots.txt> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1> (referer: None)
2023-05-05 06:52:38 [seo_spider] ERROR: Invalid control character at: line 5 column 19 (char 78) 200 https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1
Traceback (most recent call last):
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
    response.css('script[type="application/ld+json"]::text').getall()]
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
    ld = [json.loads(s.replace('\r', '')) for s in
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 5 column 19 (char 78)
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ringsport.com.au/blogs/ringsport-blog/boxing-glove-guide-part-1>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.joinfightcamp.com/boxing-equipment/how-to-choose-the-best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.joinfightcamp.com/boxing-equipment/how-to-choose-the-best-boxing-gloves-for-beginners/>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebay.com/t/Boxing-Gloves/30102/bn_1943751> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wayofmartialarts.com/best-boxing-gloves-worth-your-money/> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msmfightshop.com/robots.txt> (referer: None)
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ebay.com/t/Boxing-Gloves/30102/bn_1943751>
2023-05-05 06:52:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://wayofmartialarts.com/best-boxing-gloves-worth-your-money/>
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves> (referer: None)
2023-05-05 06:52:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.quora.com/robots.txt> (referer: None)
2023-05-05 06:52:38 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.quora.com/What-companies-make-the-best-quality-boxing-gloves>
2023-05-05 06:52:39 [seo_spider] ERROR: Invalid control character at: line 20 column 226 (char 698) 200 https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves
Traceback (most recent call last):
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
    response.css('script[type="application/ld+json"]::text').getall()]
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
    ld = [json.loads(s.replace('\r', '')) for s in
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 20 column 226 (char 698)
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://made4fighters.com/blogs/default-blog/top-womens-boxing-gloves>
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hayabusafight.com/products/t3-boxing-gloves> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msmfightshop.com/blogs/news/top-3-boxing-gloves-in-the-world> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.everlast.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesboxing.com/> (referer: None)
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hayabusafight.com/products/t3-boxing-gloves>
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.msmfightshop.com/blogs/news/top-3-boxing-gloves-in-the-world>
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cletoreyesboxing.com/>
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://revgear.com/gear/boxing-gloves/> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://m.timesofindia.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/c/lists/top-rated-boxing-gloves> (referer: None)
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://revgear.com/gear/boxing-gloves/>
2023-05-05 06:52:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms?from=mdr> from <GET https://m.timesofindia.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms>
2023-05-05 06:52:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.walmart.com/c/lists/top-rated-boxing-gloves>
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.titleboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bravose.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sanabulsports.com/robots.txt> (referer: None)
2023-05-05 06:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://timesofindia.indiatimes.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://anthonyjoshua.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sanabulsports.com/blogs/news/the-best-boxing-gloves-for-training> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.everlast.com/fight/boxing/gloves> (referer: None)
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://sanabulsports.com/blogs/news/the-best-boxing-gloves-for-training>
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nakmuaywholesale.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.everlast.com/fight/boxing/gloves>
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mmagearaddict.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://anthonyjoshua.com/blogs/news/anthony-joshua-how-to-choose-the-best-boxing-gloves> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bravose.com/collections/training-gloves> (referer: None)
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://anthonyjoshua.com/blogs/news/anthony-joshua-how-to-choose-the-best-boxing-gloves>
2023-05-05 06:52:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bravose.com/collections/training-gloves>
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://issuu.com/robots.txt> (referer: None)
2023-05-05 06:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tufwear-germany.de/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.titleboxing.com/gloves> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-professionals/articleshow/97128538.cms> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://yokkao.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.titleboxing.com/gloves>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tufwear-germany.de/en/blogs/news/was-sind-die-besten-boxhandschuhe-der-boxhandschuh-guide-fur-deinen-kauf> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mmagearaddict.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://topboxer.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nakmuaywholesale.com/top-3-boxing-gloves-for-small-hands-2022/> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-professionals/articleshow/97128538.cms>
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://tufwear-germany.de/en/blogs/news/was-sind-die-besten-boxhandschuhe-der-boxhandschuh-guide-fur-deinen-kauf>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://issuu.com/punchequipment/docs/get_the_best_boxing_gloves_for_a_winning_performan> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://nypost.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://mmagearaddict.com/best-boxing-gloves/>
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.nakmuaywholesale.com/top-3-boxing-gloves-for-small-hands-2022/>
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://issuu.com/punchequipment/docs/get_the_best_boxing_gloves_for_a_winning_performan>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms?from=mdr> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://warriorpunch.com/robots.txt> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://timesofindia.indiatimes.com/most-searched-products/sports-equipment/boxing-gloves-for-beginners-best-picks/articleshow/97912567.cms?from=mdr>
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://yokkao.com/pages/boxing-gloves-guide> (referer: None)
2023-05-05 06:52:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://topboxer.com/collections/boxing-gloves> (referer: None)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://yokkao.com/pages/boxing-gloves-guide>
2023-05-05 06:52:41 [seo_spider] ERROR: Invalid control character at: line 15 column 21 (char 385) 200 https://topboxer.com/collections/boxing-gloves
Traceback (most recent call last):
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
    response.css('script[type="application/ld+json"]::text').getall()]
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
    ld = [json.loads(s.replace('\r', '')) for s in
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 15 column 21 (char 385)
2023-05-05 06:52:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://topboxer.com/collections/boxing-gloves>
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://nypost.com/article/best-boxing-equipment-per-experts/> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kdvr.com/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cashkaro.com/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://nypost.com/article/best-boxing-equipment-per-experts/>
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://origympersonaltrainercourses.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.popsugar.com/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.expertreviews.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cashkaro.com/blog/best-boxing-gloves-in-india/201246> (referer: None)
2023-05-05 06:52:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cashkaro.com/blog/best-boxing-gloves-in-india/201246>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://warriorpunch.com/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.popsugar.com/fitness/Best-Boxing-Gloves-Women-45472473> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kdvr.com/reviews/br/sports-fitness-br/boxing-br/best-title-boxing-gloves/> (referer: None)
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://warriorpunch.com/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.popsugar.com/fitness/Best-Boxing-Gloves-Women-45472473>
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://kdvr.com/reviews/br/sports-fitness-br/boxing-br/best-title-boxing-gloves/>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://branded.disruptsports.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.expertreviews.co.uk/health-and-grooming/1407584/best-boxing-gloves> (referer: None)
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.expertreviews.co.uk/health-and-grooming/1407584/best-boxing-gloves>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://branded.disruptsports.com/blogs/blog/which-boxing-gloves-to-buy-for-beginners> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightquality.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://branded.disruptsports.com/blogs/blog/which-boxing-gloves-to-buy-for-beginners>
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.flipkart.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.infinitudefight.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 35 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2023-05-05 06:52:43 [protego] DEBUG: Rule at line 69 without any user agent to enforce it on.
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://absolutelymartialarts.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.k2promos.com/robots.txt> (referer: None)
2023-05-05 06:52:43 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.infinitudefight.com/buy-the-best-boxing-gloves/> (referer: None)
2023-05-05 06:52:44 [seo_spider] ERROR: Expecting value: line 1 column 1 (char 0) 200 https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves
Traceback (most recent call last):
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
    response.css('script[type="application/ld+json"]::text').getall()]
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
    ld = [json.loads(s.replace('\r', '')) for s in
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://origympersonaltrainercourses.co.uk/blog/best-boxing-gloves>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightingadvice.com/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.infinitudefight.com/buy-the-best-boxing-gloves/>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.proboxingequipment.com/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.proboxingequipment.com/Boxing-Gloves_c_196.html> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://glovesaddict.com/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.proboxingequipment.com/Boxing-Gloves_c_196.html>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://absolutelymartialarts.com/best-boxing-gloves-beginners/> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/r/amateur_boxing/comments/2ykhau/the_top_15_best_boxing_gloves_ranking_the_best/> (referer: None)
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.healthyprinciples.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://absolutelymartialarts.com/best-boxing-gloves-beginners/>
2023-05-05 06:52:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reddit.com/r/amateur_boxing/comments/2ykhau/the_top_15_best_boxing_gloves_ranking_the_best/>
2023-05-05 06:52:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mmahive.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bwsgym.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightquality.com/2018/10/12/best-custom-gloves/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fightingadvice.com/best-boxing-gloves-under-200/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.k2promos.com/best-beginner-boxing-gloves/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fightquality.com/2018/10/12/best-custom-gloves/>
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.flipkart.com/sports/boxing/boxing-gloves/pr?sid=abc%2Cppq%2Cbb6&page=2> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dontwasteyourmoney.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://fightingadvice.com/best-boxing-gloves-under-200/>
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.k2promos.com/best-beginner-boxing-gloves/>
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.flipkart.com/sports/boxing/boxing-gloves/pr?sid=abc%2Cppq%2Cbb6&page=2>
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bwsgym.com/etiquette-produit/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://middleeasy.com/robots.txt> (referer: None)
2023-05-05 06:52:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bwsgym.com/etiquette-produit/best-boxing-gloves/>
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.healthyprinciples.co.uk/best-boxing-gloves-for-kids-review/> (referer: None)
2023-05-05 06:52:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bestproducts.com/robots.txt> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.healthyprinciples.co.uk/best-boxing-gloves-for-kids-review/>
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mmahive.com/best-boxing-gloves-for-wrist-support/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.momjunction.com/robots.txt> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dontwasteyourmoney.com/products/hawk-sports-heavy-bag-boxing-gloves/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mmahive.com/best-boxing-gloves-for-wrist-support/>
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dontwasteyourmoney.com/products/hawk-sports-heavy-bag-boxing-gloves/>
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://glovesaddict.com/best-boxing-gloves-on-amazon/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://middleeasy.com/reviews/gear/gloves-cardio-kickboxing/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://breakinggrips.com/robots.txt> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://glovesaddict.com/best-boxing-gloves-on-amazon/>
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://middleeasy.com/reviews/gear/gloves-cardio-kickboxing/>
2023-05-05 06:52:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/robots.txt> (failed 1 times): 429 Unknown Status
2023-05-05 06:52:46 [py.warnings] WARNING: /home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/scrapy/core/engine.py:276: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
  return self.download(result, spider) if isinstance(result, Request) else result

2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.momjunction.com/articles/best-boxing-gloves-for-kids_00514921/> (referer: None)
2023-05-05 06:52:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.momjunction.com/articles/best-boxing-gloves-for-kids_00514921/>
2023-05-05 06:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bestproducts.com/fitness/equipment/g1009/boxing-gloves-mitts/> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/robots.txt> (failed 2 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://breakinggrips.com/best-kids-boxing-gloves/> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mightyfighter.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bestproducts.com/fitness/equipment/g1009/boxing-gloves-mitts/>
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://breakinggrips.com/best-kids-boxing-gloves/>
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.stylecraze.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.fightingking.com/robots.txt> (failed 3 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.fightingking.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 6 without any user agent to enforce it on.
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://linealboxing.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wbcme.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (failed 1 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blackbeltmag.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mightyfighter.com/top-10-best-boxing-gloves/> (referer: None)
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://smartmma.com/robots.txt> (referer: None)
2023-05-05 06:52:47 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://linealboxing.com/best-boxing-glove-brands-2022/> (referer: None)
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mightyfighter.com/top-10-best-boxing-gloves/>
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.stylecraze.com/articles/best-heavy-bag-gloves/> (referer: None)
2023-05-05 06:52:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (failed 2 times): 429 Unknown Status
2023-05-05 06:52:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://linealboxing.com/best-boxing-glove-brands-2022/>
2023-05-05 06:52:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wbcme.co.uk/ringside/best-boxing-gloves-for-beginners/> (referer: None)
2023-05-05 06:52:48 [seo_spider] ERROR: Invalid control character at: line 28 column 64 (char 1740) 200 https://www.stylecraze.com/articles/best-heavy-bag-gloves/
Traceback (most recent call last):
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 761, in parse
    response.css('script[type="application/ld+json"]::text').getall()]
  File "/home/irfan/.pyenv/versions/TES/lib/python3.7/site-packages/advertools/spider.py", line 760, in <listcomp>
    ld = [json.loads(s.replace('\r', '')) for s in
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/irfan/.pyenv/versions/3.7.9/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 28 column 64 (char 1740)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.stylecraze.com/articles/best-heavy-bag-gloves/>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kreedon.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.kreedon.com/best-boxing-gloves-brands/>
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.wbcme.co.uk/ringside/best-boxing-gloves-for-beginners/>
2023-05-05 06:52:48 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (failed 3 times): 429 Unknown Status
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.fightingking.com/boxing-gloves-brands-reviews/> (referer: None)
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.attacktheback.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <429 https://www.fightingking.com/boxing-gloves-brands-reviews/>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.boxingear.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blackbeltmag.com/best-boxing-gloves> (referer: None)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blackbeltmag.com/best-boxing-gloves>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesuk.com/robots.txt> (referer: None)
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.attacktheback.com/best-cheap-boxing-gloves/> (referer: None)
2023-05-05 06:52:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.attacktheback.com/best-cheap-boxing-gloves/>
2023-05-05 06:52:48 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://sites.google.com/view> from <GET https://www.boxingear.com/shop-2/grant-gloves/lace-up/best-boxing-gloves-for-sparring-grant-gloves/>
2023-05-05 06:52:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fullcontactway.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://cletoreyesuk.com/blogs/news/what-are-the-best-boxing-gloves-for-beginners> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fitnessbaddies.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bestreviews.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.boxingison.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cletoreyesuk.com/blogs/news/what-are-the-best-boxing-gloves-for-beginners>
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://thewiredshopper.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 38 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 49 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 50 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 51 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 52 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 53 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 54 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 55 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 56 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 57 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 58 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 59 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 60 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 61 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 67 without any user agent to enforce it on.
2023-05-05 06:52:49 [protego] DEBUG: Rule at line 72 without any user agent to enforce it on.
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msn.com/robots.txt> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fullcontactway.com/best-sparring-gloves/> (referer: None)
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://thewiredshopper.com/best-boxing-gloves-to-buy/> (referer: None)
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.fullcontactway.com/best-sparring-gloves/>
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://smartmma.com/best-boxing-gloves-for-heavy-bag/> (referer: None)
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <403 https://thewiredshopper.com/best-boxing-gloves-to-buy/>
2023-05-05 06:52:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://smartmma.com/best-boxing-gloves-for-heavy-bag/>
2023-05-05 06:52:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.msn.com/en-gb/lifestyle/rf-best-products-uk/best-boxing-gloves-for-men-12oz-reviews> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bestreviews.com/sports-fitness/boxing/best-boxing-gloves> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gloveworx.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.msn.com/en-gb/lifestyle/rf-best-products-uk/best-boxing-gloves-for-men-12oz-reviews>
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bestreviews.com/sports-fitness/boxing/best-boxing-gloves>
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fitnessbaddies.com/amateur-boxing-gloves/> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.standard.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sites.google.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.fitnessbaddies.com/amateur-boxing-gloves/>
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.pragmaticmom.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lowkickmma.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.standard.co.uk/shopping/esbest/health-fitness/fitness-wear/best-womens-boxing-gloves-for-beginners-a4272321.html> (referer: None)
2023-05-05 06:52:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.standard.co.uk/shopping/esbest/health-fitness/fitness-wear/best-womens-boxing-gloves-for-beginners-a4272321.html>
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingready.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sportsdirect.com/robots.txt> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lowkickmma.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://sites.google.com/view> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lowkickmma.com/best-boxing-gloves/>
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <404 https://sites.google.com/view>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://themmaguru.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dmarge.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.pragmaticmom.com/2019/11/best-boxing-gloves-for-women/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.pragmaticmom.com/2019/11/best-boxing-gloves-for-women/>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dmarge.com/best-boxing-gloves> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.boxingison.com/best-boxing-gloves-for-training-and-sparring/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.dmarge.com/best-boxing-gloves>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sportsdirect.com/boxing/boxing-gloves> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gloveworx.com/blog/how-choose-best-boxing-gloves-beginners/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.boxingison.com/best-boxing-gloves-for-training-and-sparring/>
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thechamplair.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://brawlbros.com/robots.txt> (referer: None)
2023-05-05 06:52:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://themmaguru.com/best-youth-boxing-gloves/> (referer: None)
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <403 https://www.sportsdirect.com/boxing/boxing-gloves>
2023-05-05 06:52:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gloveworx.com/blog/how-choose-best-boxing-gloves-beginners/>
2023-05-05 06:52:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://themmaguru.com/best-youth-boxing-gloves/>
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nytimes.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearhungry.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hungry4fitness.co.uk/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://findbestboxinggloves.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hiconsumption.com/robots.txt> (referer: None)
2023-05-05 06:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thechamplair.com/sports/best-beginners-boxing-gloves/> (referer: None)
2023-05-05 06:52:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://thechamplair.com/sports/best-beginners-boxing-gloves/>
2023-05-05 06:52:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://brawlbros.com/best-boxing-gloves-on-amazon/> (referer: None)
2023-05-05 06:52:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://brawlbros.com/best-boxing-gloves-on-amazon/>
2023-05-05 06:52:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hiconsumption.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://hiconsumption.com/best-boxing-gloves/>
2023-05-05 06:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hungry4fitness.co.uk/post/10-best-boxing-mitts-an-ultimate-guide> (referer: None)
2023-05-05 06:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gearhungry.com/best-boxing-gloves/> (referer: None)
2023-05-05 06:52:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hungry4fitness.co.uk/post/10-best-boxing-mitts-an-ultimate-guide>
2023-05-05 06:52:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gearhungry.com/best-boxing-gloves/>
2023-05-05 06:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://boxingready.com/ringside/best-boxing-gloves-wrist-support/> (referer: None)
2023-05-05 06:52:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://boxingready.com/ringside/best-boxing-gloves-wrist-support/>
2023-05-05 06:52:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nytimes.com/video/style/1194840632119/gear-test-boxing-gloves.html> (referer: None)
2023-05-05 06:52:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.nytimes.com/video/style/1194840632119/gear-test-boxing-gloves.html>
2023-05-05 06:52:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://findbestboxinggloves.com/best-boxing-gloves-for-heavy-bag-the-complete-guide/> (referer: None)
2023-05-05 06:52:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://findbestboxinggloves.com/best-boxing-gloves-for-heavy-bag-the-complete-guide/>
2023-05-05 06:53:33 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 196 pages/min), scraped 97 items (at 97 items/min)
2023-05-05 06:54:33 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 0 pages/min), scraped 97 items (at 0 items/min)
2023-05-05 06:54:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://skilspo.com/robots.txt> (failed 1 times): TCP connection timed out: 110: Connection timed out.

Need some way to rate limit requests for sitemap_to_df

When trying to retrieve a large recursive sitemap, I am getting http error 429, too many requests. Currently it seems like there is no way to limit the number of requests it makes, specify a cooldown period or limit the speed of requests. So nothing is ever retrieved with that function.

Getting NaN values for serp_goog function

I tried running the following query:
df = adv.serp_goog(q=search_term, cx=cse_id, key=api_key)

However, it just returns a bunch of NaN values for rank, title, snippet, displaylink, link
image

Questions about custom spider

Is there a way to just extract the information I want, by default it extracts too much, if the web page is large, the json line file will be very large.
for example, I just want to extract just the title.
Do you plan to add feature to create sitemap for this app. I have a lot of big websites that need to create a sitemap

Stop words list very limited

For the function word_frequency mainly, as the default value for rm_words param.
Get the full list of stopwords for several languages, to be imported and potentially used in function calls, or separately, e.g.:

import advertools as adv
adv.stop_words['en'] 
adv.stop_words['fr']

File not found on crawl method

  Hi all,

I'm following the documentation with this line of code

adv.crawl('https://example.com', 'my_output_file.jl', follow_links=True)

But it returns this error:

FileNotFoundError: [WinError 2] The system cannot find the file specified

Even though my directory looks like this:

- SEO.py
- my_output_file.jl

Here is the complete trace:

Traceback (most recent call last):
  File "c:/Users/Henrique/Desktop/SEO/SEO.py", line 6, in <module>
    adv.crawl('https://example.com', 'my_output_file.jl', follow_links=True)
  File "C:\Users\Henrique\AppData\Roaming\Python\Python38\site-packages\advertools\spider.py", line 971, in crawl
    subprocess.run(command)
  File "C:\Python38\lib\subprocess.py", line 489, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Python38\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Python38\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

As you can see it doesn't specify which file was not found but I assume it is the output file.

Any help is greatly appreciated!

Originally posted by @henriquearaujo-98 in #247

Python 3.10/11 SSL: SSLV3_ALERT_HANDSHAKE_FAILURE

Not fatal, but just an issue note:

Seems there is a issue with 3.10/3.11
python/cpython#103142

Mac Intel

and containers using
FROM python:3.11-slim
FROM python:3.10-slim

This url:

https://opentopography.org/sitemap.xml

gets redirected to:

https://portal.opentopography.org/sitemap.xml

If i just use https://portal.opentopography.org/sitemap.xml it works fine.

File "/Users//development/dev_earthcube/earthcube_utilities/venv311/lib/python3.11/site-packages/advertools/sitemaps.py", line 491, in sitemap_to_df
    xml_text = urlopen(Request(sitemap_url, headers=headers))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1002)>

serp_goog next page

hi sir,
I am having problems with the start parameter on serp_goog. I want to query from the first page to the 4th.
Please advise.
thank you
packages version:
pandas
In [5]: print(pd.version)
0.25.0
advertools
In [6]: print(adv.version)
0.7.3
----> 4 next_result_1=adv.serp_goog(cx=cx, key=key, q=queri, gl=['id'],start=[1, 11, 21])
KeyError: "['start'] not in index"

C:\Anaconda3\lib\site-packages\advertools\serp.py in serp_goog(q, cx, key, c2coff, cr, dateRestrict, exactTerms, excludeTerms, fileType, filter, gl, highRange, hl, hq, imgColorType, imgDominantColor, imgSize, imgType, linkSite, lowRange, lr, num, orTerms, relatedSite, rights, safe, searchType, siteSearch, siteSearchFilter, sort, start)
700 specified_cols)
701 non_ordered = result_df.columns.difference(set(ordered_cols))
--> 702 final_df = result_df[ordered_cols + list(non_ordered)]
703 return final_df
704

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key)
2979 if is_iterator(key):
2980 key = list(key)
-> 2981 indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
2982
2983 # take() does not accept boolean indexers

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing)
1269 # When setting, missing keys are not allowed, even with .loc:
1270 kwargs = {"raise_missing": True if is_setter else raise_missing}
-> 1271 return self._get_listlike_indexer(obj, axis, **kwargs)[1]
1272 else:
1273 try:

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1076
1077 self._validate_read_indexer(
-> 1078 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1079 )
1080 return keyarr, indexer

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1169 if not (self.name == "loc" and not raise_missing):
1170 not_found = list(set(key) - set(ax))
-> 1171 raise KeyError("{} not in index".format(not_found))
1172
1173 # we skip the warning on Categorical/Interval

KeyError: "['start'] not in index"

How to get initial url in output.jl ?

Hello Elias,

Advertools is a really great package ! Many thanks for the splendid work.

Nevertheless I have a little problem for which I have not found a good workaround (except crawling urls one by one)

I was wondering how to get the initial url that is crawled.
Advertools returns the url after redirection and not the url before. So when you want to merge data it can becone tricky if you have no "reference".

Can we also imagine to "inject" specific user params as string to get them in the output ? I've tried to do so in the xpath_selectors, but completely failed.

Many thanks,

Caro

Bypass a cookie wall

Hello Elias,
I had already posted the topic some time ago on #328, but I don't think you had seen it.

Thank you for the fantastic work you're doing with advertools.

However, I have an issue with websites that have a cookie wall, like on https://www.interflora.fr/p/roses-passion.

When I do
scrapy shell view(response)
I can clearly see that I am blocked.
image
There is absolutely no element like the title, the button or body_text

So, I was wondering if you might have a fantastic idea to work around this issue.

Thanks a million !

Pandas Futurewarning "fillna" in url_to_df()

Method: adv.url_to_df()

advertools/urlytics.py:198: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
.assign(last_dir=dirs_df

crawl dataFrame - jsonld objects

I've made it as far as examining the dataFrame returned from a crawl!

Looking through the docs I was expecting separately labelled columns for the various bits of jsonls data. Instead I'm seeing a column with lots of Objects within it. Is this a quirk of viewing a datFrame within VSCode? Or are the docs out of date? Or something else? I'm a little stuck!

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.