big-data-fc / scraper Goto Github PK
View Code? Open in Web Editor NEWsofifa.com scraper, built to scrape data needed for our project of Big Data Computing 2021-22 at Sapienza University of Rome
License: MIT License
sofifa.com scraper, built to scrape data needed for our project of Big Data Computing 2021-22 at Sapienza University of Rome
License: MIT License
Hi,
by far I like your scraper the most because it is quite easy to use; however, I am missing the potential value in the output. Is there any way to add it?
Thanks in advance
Any suggestions on how to overcome an HTTP 429 code for too many requests?
2023-03-18 20:35:58 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): sofifa.com:443 2023-03-18 20:35:58 [urllib3.connectionpool] DEBUG: https://sofifa.com:443 "GET /team/110404/banfield/?r=200025 HTTP/1.1" 429 None 2023-03-18 20:35:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=jt&showCol%5B11%5D=le&showCol%5B12%5D=vl&showCol%5B13%5D=wg&showCol%5B14%5D=rc&showCol%5B15%5D=ta&showCol%5B16%5D=cr&showCol%5B17%5D=fi&showCol%5B18%5D=he&showCol%5B19%5D=sh&showCol%5B20%5D=vo&showCol%5B21%5D=ts&showCol%5B22%5D=dr&showCol%5B23%5D=cu&showCol%5B24%5D=fr&showCol%5B25%5D=lo&showCol%5B26%5D=bl&showCol%5B27%5D=to&showCol%5B28%5D=ac&showCol%5B29%5D=sp&showCol%5B30%5D=ag&showCol%5B31%5D=re&showCol%5B32%5D=ba&showCol%5B33%5D=tp&showCol%5B34%5D=so&showCol%5B35%5D=ju&showCol%5B36%5D=st&showCol%5B37%5D=sr&showCol%5B38%5D=ln&showCol%5B39%5D=te&showCol%5B40%5D=ar&showCol%5B41%5D=in&showCol%5B42%5D=po&showCol%5B43%5D=vi&showCol%5B44%5D=pe&showCol%5B45%5D=cm&showCol%5B46%5D=td&showCol%5B47%5D=ma&showCol%5B48%5D=sa&showCol%5B49%5D=sl&showCol%5B50%5D=tg&showCol%5B51%5D=gd&showCol%5B52%5D=gh&showCol%5B53%5D=gk&showCol%5B54%5D=gp&showCol%5B55%5D=gr&showCol%5B56%5D=tt&showCol%5B57%5D=bs&showCol%5B58%5D=wk&showCol%5B59%5D=sk&showCol%5B60%5D=aw&showCol%5B61%5D=dw&showCol%5B62%5D=ir&showCol%5B63%5D=pac&showCol%5B64%5D=sho&showCol%5B65%5D=pas&showCol%5B66%5D=dri&showCol%5B67%5D=def&showCol%5B68%5D=phy&r=200025&offset=1380> (referer: https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=jt&showCol%5B11%5D=le&showCol%5B12%5D=vl&showCol%5B13%5D=wg&showCol%5B14%5D=rc&showCol%5B15%5D=ta&showCol%5B16%5D=cr&showCol%5B17%5D=fi&showCol%5B18%5D=he&showCol%5B19%5D=sh&showCol%5B20%5D=vo&showCol%5B21%5D=ts&showCol%5B22%5D=dr&showCol%5B23%5D=cu&showCol%5B24%5D=fr&showCol%5B25%5D=lo&showCol%5B26%5D=bl&showCol%5B27%5D=to&showCol%5B28%5D=ac&showCol%5B29%5D=sp&showCol%5B30%5D=ag&showCol%5B31%5D=re&showCol%5B32%5D=ba&showCol%5B33%5D=tp&showCol%5B34%5D=so&showCol%5B35%5D=ju&showCol%5B36%5D=st&showCol%5B37%5D=sr&showCol%5B38%5D=ln&showCol%5B39%5D=te&showCol%5B40%5D=ar&showCol%5B41%5D=in&showCol%5B42%5D=po&showCol%5B43%5D=vi&showCol%5B44%5D=pe&showCol%5B45%5D=cm&showCol%5B46%5D=td&showCol%5B47%5D=ma&showCol%5B48%5D=sa&showCol%5B49%5D=sl&showCol%5B50%5D=tg&showCol%5B51%5D=gd&showCol%5B52%5D=gh&showCol%5B53%5D=gk&showCol%5B54%5D=gp&showCol%5B55%5D=gr&showCol%5B56%5D=tt&showCol%5B57%5D=bs&showCol%5B58%5D=wk&showCol%5B59%5D=sk&showCol%5B60%5D=aw&showCol%5B61%5D=dw&showCol%5B62%5D=ir&showCol%5B63%5D=pac&showCol%5B64%5D=sho&showCol%5B65%5D=pas&showCol%5B66%5D=dri&showCol%5B67%5D=def&showCol%5B68%5D=phy&r=200025&offset=1320)
The scrape should get data from sofifa, for years [12,13,14]
Is there a way to extend beyond the 60 entries on the players base url? What if I want to scrape every player?
sorry if I open another one but the old post is closed
In the meantime, I thank you as always for your availability.
I first tried to add your portion of code immediately below and launched the command, but it told me that the "player_url" was not declared.
So I went to the top of the code, where there is team_url and I also added:
"player_url": player.css("td:nth-child(2) a::attr(href)").get(),
Which was in the item list immediately below.
Also in item I added this:
"player_meta": self.parse_player(player_url),
and launched the script but unfortunately the player meta column is completely empty.
Salve di nuovo, purtroppo non sono esperto sulla creazione di script come questo quindi sono qui a chiedere info a voi sul vostro che reputo il migliore in assoluto!
Il vostro script ovviamente esporta quelli che sono i dati presenti nella pagine di sofifa ma non i valori interni di ogni singolo giocatore. In quelle pagine cè un valore per me molto importante cioe L'ANNO DI NASCITA, questo perchè il sito di sofifa non è preciso con le età nella sua finestra sfasando anche di due anni, ma con il valore della data effettiva risolverei con una formula excel dopo esportata. Avete qualche idea?! Grazie mille da subito se trovate tempo anche solo per rispondermi.
Hi again, unfortunately I'm not an expert on creating scripts like this so I'm here to ask you for info on yours which I think is the best ever!
Your script obviously exports the data present in the sofifa pages but not the internal values of each single player. In those pages there is a value that is very important to me, namely THE YEAR OF BIRTH, this is because the sofifa site is not precise with the ages in its window, even offsetting by two years, but with the value of the effective date I would solve it with an excel formula after exported. Do you have any ideas?! Thank you very much from the start if you find time even just to answer me.
Salve, complimenti per il vostro scrap che funziona una favola. Ma la favola si è infranta quando mi sono accorto che di tutti i valori purtroppo mancano due campi molto importanti, cioè l'overall e il potenziale. C'è il BOV cioè il BEST OVERALL ma non è utile essendo il top valore che hanno mai raggiunto e spesso non corrisponde in quello attuale.
Ho verificato il codice, ma non riesco ad aggiungere il campo, in Sofifa.py ho aggiunto la riga:
"ove": player.css("td.col-oa::text").get(),
Il td l'ho verificato sul sorgente del sito, ma purtroppo mi lascia il campo vuoto.
Cosa sbaglio o cosa si può fare? Grazie.
Salve, purtroppo oggi volevo grattare di nuovo i dati ma al momento lo script sembra non estrarre piu nulla..
Vi mostro qui sotto cosa scrive:
Hi, unfortunately today I wanted to scratch the data again but at the moment the script doesn't seem to extract anything anymore..
I'll show you what he writes below:
`2024-01-04 16:10:59 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: sofifa)
2024-01-04 16:10:59 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr 5 2022, 01:53:17) - [Clang 12.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 3.4.8, Platform macOS-10.16-x86_64-i386-64bit
2024-01-04 16:10:59 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 32,
'BOT_NAME': 'sofifa',
'NEWSPIDER_MODULE': 'sofifa.spiders',
'SPIDER_MODULES': ['sofifa.spiders'],
'USER_AGENT': 'sofifa (+http://www.gof.com)'}
2024-01-04 16:10:59 [py.warnings] WARNING: /Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2024-01-04 16:10:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-04 16:10:59 [scrapy.extensions.telnet] INFO: Telnet Password: 58a2503b8ae9f86c
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2024-01-04 16:10:59 [sofifa] INFO: Scraping year 240019
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-04 16:10:59 [scrapy.core.engine] INFO: Spider opened
2024-01-04 16:10:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-04 16:10:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-04 16:10:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B%5D=pi&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=bo&showCol%5B%5D=bp&showCol%5B%5D=gu&showCol%5B%5D=jt&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc&showCol%5B%5D=ta&showCol%5B%5D=cr&showCol%5B%5D=fi&showCol%5B%5D=he&showCol%5B%5D=sh&showCol%5B%5D=vo&showCol%5B%5D=ts&showCol%5B%5D=dr&showCol%5B%5D=cu&showCol%5B%5D=fr&showCol%5B%5D=lo&showCol%5B%5D=bl&showCol%5B%5D=to&showCol%5B%5D=ac&showCol%5B%5D=sp&showCol%5B%5D=ag&showCol%5B%5D=re&showCol%5B%5D=ba&showCol%5B%5D=tp&showCol%5B%5D=so&showCol%5B%5D=ju&showCol%5B%5D=st&showCol%5B%5D=sr&showCol%5B%5D=ln&showCol%5B%5D=te&showCol%5B%5D=ar&showCol%5B%5D=in&showCol%5B%5D=po&showCol%5B%5D=vi&showCol%5B%5D=pe&showCol%5B%5D=cm&showCol%5B%5D=td&showCol%5B%5D=ma&showCol%5B%5D=sa&showCol%5B%5D=sl&showCol%5B%5D=tg&showCol%5B%5D=gd&showCol%5B%5D=gh&showCol%5B%5D=gk&showCol%5B%5D=gp&showCol%5B%5D=gr&showCol%5B%5D=tt&showCol%5B%5D=bs&showCol%5B%5D=wk&showCol%5B%5D=sk&showCol%5B%5D=aw&showCol%5B%5D=dw&showCol%5B%5D=ir&showCol%5B%5D=pac&showCol%5B%5D=sho&showCol%5B%5D=pas&showCol%5B%5D=dri&showCol%5B%5D=def&showCol%5B%5D=phy&r=240019> (referer: None)`
E continua a mostrare la stessa schermata senza scaricare alcun dato. Cosa può essere?
And it continues to show the same screen without downloading any data. What can it be?
Lo script funziona alla grande, e risolto anche il valore Overal che mancava.
Però ora non so come mai all'improvviso quando lancio lo Scrap mi scarica solo i primi 60 giocatori , in pratica solo la prima pagina. Vi allego il risultato che restituisce il terminal che per voi è sicuramente qualcosa di piu chiaro
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.