Code Monkey home page Code Monkey logo

scraper's People

Contributors

chayhuixiang avatar davquar avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

scraper's Issues

Player potential?

Hi,

by far I like your scraper the most because it is quite easy to use; however, I am missing the potential value in the output. Is there any way to add it?

Thanks in advance

http 429

Any suggestions on how to overcome an HTTP 429 code for too many requests?

2023-03-18 20:35:58 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): sofifa.com:443 2023-03-18 20:35:58 [urllib3.connectionpool] DEBUG: https://sofifa.com:443 "GET /team/110404/banfield/?r=200025 HTTP/1.1" 429 None 2023-03-18 20:35:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=jt&showCol%5B11%5D=le&showCol%5B12%5D=vl&showCol%5B13%5D=wg&showCol%5B14%5D=rc&showCol%5B15%5D=ta&showCol%5B16%5D=cr&showCol%5B17%5D=fi&showCol%5B18%5D=he&showCol%5B19%5D=sh&showCol%5B20%5D=vo&showCol%5B21%5D=ts&showCol%5B22%5D=dr&showCol%5B23%5D=cu&showCol%5B24%5D=fr&showCol%5B25%5D=lo&showCol%5B26%5D=bl&showCol%5B27%5D=to&showCol%5B28%5D=ac&showCol%5B29%5D=sp&showCol%5B30%5D=ag&showCol%5B31%5D=re&showCol%5B32%5D=ba&showCol%5B33%5D=tp&showCol%5B34%5D=so&showCol%5B35%5D=ju&showCol%5B36%5D=st&showCol%5B37%5D=sr&showCol%5B38%5D=ln&showCol%5B39%5D=te&showCol%5B40%5D=ar&showCol%5B41%5D=in&showCol%5B42%5D=po&showCol%5B43%5D=vi&showCol%5B44%5D=pe&showCol%5B45%5D=cm&showCol%5B46%5D=td&showCol%5B47%5D=ma&showCol%5B48%5D=sa&showCol%5B49%5D=sl&showCol%5B50%5D=tg&showCol%5B51%5D=gd&showCol%5B52%5D=gh&showCol%5B53%5D=gk&showCol%5B54%5D=gp&showCol%5B55%5D=gr&showCol%5B56%5D=tt&showCol%5B57%5D=bs&showCol%5B58%5D=wk&showCol%5B59%5D=sk&showCol%5B60%5D=aw&showCol%5B61%5D=dw&showCol%5B62%5D=ir&showCol%5B63%5D=pac&showCol%5B64%5D=sho&showCol%5B65%5D=pas&showCol%5B66%5D=dri&showCol%5B67%5D=def&showCol%5B68%5D=phy&r=200025&offset=1380> (referer: https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=jt&showCol%5B11%5D=le&showCol%5B12%5D=vl&showCol%5B13%5D=wg&showCol%5B14%5D=rc&showCol%5B15%5D=ta&showCol%5B16%5D=cr&showCol%5B17%5D=fi&showCol%5B18%5D=he&showCol%5B19%5D=sh&showCol%5B20%5D=vo&showCol%5B21%5D=ts&showCol%5B22%5D=dr&showCol%5B23%5D=cu&showCol%5B24%5D=fr&showCol%5B25%5D=lo&showCol%5B26%5D=bl&showCol%5B27%5D=to&showCol%5B28%5D=ac&showCol%5B29%5D=sp&showCol%5B30%5D=ag&showCol%5B31%5D=re&showCol%5B32%5D=ba&showCol%5B33%5D=tp&showCol%5B34%5D=so&showCol%5B35%5D=ju&showCol%5B36%5D=st&showCol%5B37%5D=sr&showCol%5B38%5D=ln&showCol%5B39%5D=te&showCol%5B40%5D=ar&showCol%5B41%5D=in&showCol%5B42%5D=po&showCol%5B43%5D=vi&showCol%5B44%5D=pe&showCol%5B45%5D=cm&showCol%5B46%5D=td&showCol%5B47%5D=ma&showCol%5B48%5D=sa&showCol%5B49%5D=sl&showCol%5B50%5D=tg&showCol%5B51%5D=gd&showCol%5B52%5D=gh&showCol%5B53%5D=gk&showCol%5B54%5D=gp&showCol%5B55%5D=gr&showCol%5B56%5D=tt&showCol%5B57%5D=bs&showCol%5B58%5D=wk&showCol%5B59%5D=sk&showCol%5B60%5D=aw&showCol%5B61%5D=dw&showCol%5B62%5D=ir&showCol%5B63%5D=pac&showCol%5B64%5D=sho&showCol%5B65%5D=pas&showCol%5B66%5D=dri&showCol%5B67%5D=def&showCol%5B68%5D=phy&r=200025&offset=1320)

Add scraper

The scrape should get data from sofifa, for years [12,13,14]

60 entries max?

Is there a way to extend beyond the 60 entries on the players base url? What if I want to scrape every player?

Esportazione data di Nascita - Birth date export

sorry if I open another one but the old post is closed

In the meantime, I thank you as always for your availability.

I first tried to add your portion of code immediately below and launched the command, but it told me that the "player_url" was not declared.

So I went to the top of the code, where there is team_url and I also added:
"player_url": player.css("td:nth-child(2) a::attr(href)").get(),
Which was in the item list immediately below.

Also in item I added this:
"player_meta": self.parse_player(player_url),

and launched the script but unfortunately the player meta column is completely empty.

I attach a screenshot of the entire code
immagine

Esportazione data di Nascita - Birth date export

Salve di nuovo, purtroppo non sono esperto sulla creazione di script come questo quindi sono qui a chiedere info a voi sul vostro che reputo il migliore in assoluto!
Il vostro script ovviamente esporta quelli che sono i dati presenti nella pagine di sofifa ma non i valori interni di ogni singolo giocatore. In quelle pagine cè un valore per me molto importante cioe L'ANNO DI NASCITA, questo perchè il sito di sofifa non è preciso con le età nella sua finestra sfasando anche di due anni, ma con il valore della data effettiva risolverei con una formula excel dopo esportata. Avete qualche idea?! Grazie mille da subito se trovate tempo anche solo per rispondermi.


Hi again, unfortunately I'm not an expert on creating scripts like this so I'm here to ask you for info on yours which I think is the best ever!
Your script obviously exports the data present in the sofifa pages but not the internal values of each single player. In those pages there is a value that is very important to me, namely THE YEAR OF BIRTH, this is because the sofifa site is not precise with the ages in its window, even offsetting by two years, but with the value of the effective date I would solve it with an excel formula after exported. Do you have any ideas?! Thank you very much from the start if you find time even just to answer me.

Overall e Potenziale

Salve, complimenti per il vostro scrap che funziona una favola. Ma la favola si è infranta quando mi sono accorto che di tutti i valori purtroppo mancano due campi molto importanti, cioè l'overall e il potenziale. C'è il BOV cioè il BEST OVERALL ma non è utile essendo il top valore che hanno mai raggiunto e spesso non corrisponde in quello attuale.

Ho verificato il codice, ma non riesco ad aggiungere il campo, in Sofifa.py ho aggiunto la riga:
"ove": player.css("td.col-oa::text").get(),
Il td l'ho verificato sul sorgente del sito, ma purtroppo mi lascia il campo vuoto.

Cosa sbaglio o cosa si può fare? Grazie.

I dati non vengono piu estratti [Referer : None]

Salve, purtroppo oggi volevo grattare di nuovo i dati ma al momento lo script sembra non estrarre piu nulla..
Vi mostro qui sotto cosa scrive:
Hi, unfortunately today I wanted to scratch the data again but at the moment the script doesn't seem to extract anything anymore..
I'll show you what he writes below:

`2024-01-04 16:10:59 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: sofifa)
2024-01-04 16:10:59 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr 5 2022, 01:53:17) - [Clang 12.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 3.4.8, Platform macOS-10.16-x86_64-i386-64bit
2024-01-04 16:10:59 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 32,
'BOT_NAME': 'sofifa',
'NEWSPIDER_MODULE': 'sofifa.spiders',
'SPIDER_MODULES': ['sofifa.spiders'],
'USER_AGENT': 'sofifa (+http://www.gof.com)'}
2024-01-04 16:10:59 [py.warnings] WARNING: /Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

2024-01-04 16:10:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-04 16:10:59 [scrapy.extensions.telnet] INFO: Telnet Password: 58a2503b8ae9f86c
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2024-01-04 16:10:59 [sofifa] INFO: Scraping year 240019
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-04 16:10:59 [scrapy.core.engine] INFO: Spider opened
2024-01-04 16:10:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-04 16:10:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-04 16:10:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B%5D=pi&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=bo&showCol%5B%5D=bp&showCol%5B%5D=gu&showCol%5B%5D=jt&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc&showCol%5B%5D=ta&showCol%5B%5D=cr&showCol%5B%5D=fi&showCol%5B%5D=he&showCol%5B%5D=sh&showCol%5B%5D=vo&showCol%5B%5D=ts&showCol%5B%5D=dr&showCol%5B%5D=cu&showCol%5B%5D=fr&showCol%5B%5D=lo&showCol%5B%5D=bl&showCol%5B%5D=to&showCol%5B%5D=ac&showCol%5B%5D=sp&showCol%5B%5D=ag&showCol%5B%5D=re&showCol%5B%5D=ba&showCol%5B%5D=tp&showCol%5B%5D=so&showCol%5B%5D=ju&showCol%5B%5D=st&showCol%5B%5D=sr&showCol%5B%5D=ln&showCol%5B%5D=te&showCol%5B%5D=ar&showCol%5B%5D=in&showCol%5B%5D=po&showCol%5B%5D=vi&showCol%5B%5D=pe&showCol%5B%5D=cm&showCol%5B%5D=td&showCol%5B%5D=ma&showCol%5B%5D=sa&showCol%5B%5D=sl&showCol%5B%5D=tg&showCol%5B%5D=gd&showCol%5B%5D=gh&showCol%5B%5D=gk&showCol%5B%5D=gp&showCol%5B%5D=gr&showCol%5B%5D=tt&showCol%5B%5D=bs&showCol%5B%5D=wk&showCol%5B%5D=sk&showCol%5B%5D=aw&showCol%5B%5D=dw&showCol%5B%5D=ir&showCol%5B%5D=pac&showCol%5B%5D=sho&showCol%5B%5D=pas&showCol%5B%5D=dri&showCol%5B%5D=def&showCol%5B%5D=phy&r=240019> (referer: None)`

E continua a mostrare la stessa schermata senza scaricare alcun dato. Cosa può essere?

And it continues to show the same screen without downloading any data. What can it be?

Scarica solo i primi 60 giocatori

Lo script funziona alla grande, e risolto anche il valore Overal che mancava.
Però ora non so come mai all'improvviso quando lancio lo Scrap mi scarica solo i primi 60 giocatori , in pratica solo la prima pagina. Vi allego il risultato che restituisce il terminal che per voi è sicuramente qualcosa di piu chiaro
Screenshot 2023-02-06 alle 16 41 25

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.