themm1 / procyclingstats Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 18.0 1.77 MB

procyclingstats scraper

License: MIT License

Python 100.00%

cycling html-parsing python python-package scraper sports-analytics web-scraping

procyclingstats's People

Contributors

Stargazers

Watchers

Forkers

baronet2 emiltang kiernanlabs mathiaswiench leapingllamas cecper henryjcee selektormode robclaessensrhdhv broeckxfrederik juntaroyamanouchi bmserras lukasmahieu lytixdev stephrobert cheeseycube jkbmeyer ibairos

procyclingstats's Issues

climbs() method for Stage scraper produces empty list

procyclingstats/procyclingstats/stage_scraper.py

Line 258 in 14c4921

climbs_html = self.html.css_first("div > ul.list.circle")

Was following along with the examples/climbs_by_stages.py example, but all values in the stages_climbs dict were empty lists. Found that it was the climbs() method for Stage scraper not grabbing them from their respective pages. I resolved this locally by changing the CSS selector:

climbs_html = self.html.css_first("ul.list.circle")

Worth noting that I installed procyclingstats from pip, and when doing so the version of selectolax that was installed in my virtual environment was 0.3.12, which is different than the version listed in requirements.txt (0.3.8). Not sure if related.

Results of one day race

Dear themm1, thank you for all the great work you already did on this Procyclingstats scraper.

I was wondering if it is possible to retrieve the results of a certain one day race? In my specific case, I am only interested in obtaining the winner.

I already tried multiple things, such as scraping the url (e.g. race/paris-roubaix/2024/result), or try to execute Race.stage_winners(). Unfortunately none of these did work.

Do you perhaps know of any method to retrieve the winner of a one day race?

Best regards.

Race results

Is it possible to add the results of a race? So parsing the table that you see in for example /race/gp-samyn/2024/result

One day races ranking scraping

Hello,

Your work is fantastic.
Unfortunately, the one day race ranking is identified as a race ranking and the parsing doesn't work.
Can you adapt the code for the one day racing?

I have the issue with the default:
https://www.procyclingstats.com/rankings/me/uci-one-day-races

or with a filtered request :
https://www.procyclingstats.com/rankings.php?date=2022-12-31&nation=&age=&zage=&page=smallerorequal&team=&offset=0&filter=Filter&p=me&s=one-day-races

here is the error:

Traceback (most recent call last):
File "", line 1, in
File "/Users/pavz/Library/Python/3.9/lib/python/site-packages/procyclingstats/scraper.py", line 112, in parse
parsed_data[method_name] = method()
File "/Users/pavz/Library/Python/3.9/lib/python/site-packages/procyclingstats/ranking_scraper.py", line 212, in races_ranking
table_parser.parse(fields)
File "/Users/pavz/Library/Python/3.9/lib/python/site-packages/procyclingstats/table_parser.py", line 102, in parse
raise UnexpectedParsingError(message)
procyclingstats.errors.UnexpectedParsingError: Field 'stage_name' wasn't parsed correctly

Thank you

Error with Stage when no profile score

pcs.Stage('https://www.procyclingstats.com/race/tour-du-gevaudan-languedoc-roussillon/2015/stage-2').parse()

Produces the error

IndexError: list index out of range

ValueError while parsing weight of rider

Hey, for some riders (e.g. https://www.procyclingstats.com/rider/nikias-arndt) the weight is given as a float. When the weight()-method is being used while parsing the entire data for such a rider this raises a ValueError as the method expects this to be an integer.

Feature request: temperature of a stage

Nowadays ProCyclingStats gives the temperature of a stage, would it be possible to add this to the scraper?

Invalid URL

When I run even basic commands, like:

from procyclingstats import RiderResults

results = RiderResults("rider/tadej-pogacar")
print(results.results())

I get the following error:

ValueError                                Traceback (most recent call last)
Input In [1], in <cell line: 3>()
      1 from procyclingstats import RiderResults
----> 3 results = RiderResults("rider/tadej-pogacar")
      4 print(results.results())

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/procyclingstats/scraper.py:50, in Scraper.__init__(self, url, html, update_html)
     48 self.update_html()
     49 if not self._html_valid():
---> 50     raise ValueError(
     51         f"HTML from given URL is invalid: '{self.url}'")
     52 self._set_up_html()

ValueError: HTML from given URL is invalid: 'https://www.procyclingstats.com/rider/tadej-pogacar'

The URL seems correct, so what is the problem?

Keyerror for getattr(stage, classification)

line data = getattr(stage, classification)() gives error in the pcs code it seems:

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\Users%username%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\stage_scraper.py", line 298, in results
table = join_tables(table, table_parser.table, "rider_url")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%username%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\utils.py", line 162, in join_tables
table.append({**table2_dict[row[join_key]], **row})
~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'rider/laurens-de-plus'

race.stages() throws UnexpectedParsingError

Trying to parse stages from a specific Race, with the code provided by your repo in the file examples\climbs_by_stages.py

RACE_URL = "race/tour-de-france/2022"
race = Race(f"{RACE_URL}/overview")
race_climbs = RaceClimbs(f"{RACE_URL}/route/climbs")
stages = race.stages()

Throws exception UnexpectedParsingError with error

procyclingstats.errors.UnexpectedParsingError: Field 'profile_icon' wasn't parsed correctly

I've gave a watch at the error and it seems that your code can't parse the stages table due to one extra row 'sum' that calculates the sum of all kms stage. I suggest the removal of this row before the TableParsing.

Feature Request to add "age" method to Rider class

It would be nice to scrape the age from a given rider, rather than having to calculate it after the fact from their birthdate. To that end I have created a pull request with the addition of an age() method in the Rider class.

startlist() fails

Thanks for all the great work on this!
Happy to help with some MRs when I get time this weekend if you're open to it

Currently the startlist() parse appears broken. When I run the example code
race_startlist = RaceStartlist("race/tour-de-france/2022/startlist") race_startlist.startlist()

I get the following error:
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[4], line 2
1 race_startlist = RaceStartlist("race/tour-de-france/2022/startlist")
----> 2 race_startlist.startlist()

File ~/racestats/.venv/lib/python3.8/site-packages/procyclingstats/race_startlist_scraper.py:96, in RaceStartlist.startlist(self, *args)
93 startlist_html = self.html.css_first(".startlist_v3")
94 # startlist is individual startlist e.g.
95 # race/tour-de-pologne/2009/gc/startlist
---> 96 if startlist_html.css_first("li.team") is None:
97 startlist_html = self.html.css_first(".page-content > div")
98 startlist_table = []

AttributeError: 'NoneType' object has no attribute 'css_first'`

stage results Table Parser fails on results with empty teams or ages

The Stage Results parser fails when some rows have empty values for age or team.

In some races riders do not have a team, or PCS does not know all ages. In that case a blank value is displayed.

For example: https://www.procyclingstats.com/race/nc-germany-we/2023/result

For the age it fails because an empty string can't be cast to an int. (table_parser.py:212)
This can be fixed by not casting the age to int, but leave it as an string. I don't know if this safe

For the teams it fails because the teams are found using _filter_a_elements. But an empty cell has no a_element and thus raises an UnexpectedParsingError.

Img

Could it be possible to also scrape the images of riders?

Thanks

PCI and UCI points in rider results are all 0s

Currently it looks like it is parsing all UCI and PCI points in the rider results as 0s

E.g.
rider_results = RiderResults("rider/rui-oliveira/results")
rider_results.results()

Shows 0s for the first stage: race/tour-of-slovenia/2023/stage-3
despite earning 4 and 3 points respectively: https://www.procyclingstats.com/rider/rui-oliveira/results

Vertical_meters returns all NaN

stage.vertical_meters returns all NaN since a few days. Before it worked correctly. Please can it be fixed?

Problem scraping when trying to run the method team.riders()

when running version procyclingstats==0.1.7 this code:

from procyclingstats import Team


team = Team("team/bora-hansgrohe-2022")
print(team.riders())

results in this error:

Traceback (most recent call last):
  File "C:\Users\user\Documenten\test.py", line 12, in <module>
    print(team.riders())
  File "C:\Users\user\anaconda3\lib\site-packages\procyclingstats\team_scraper.py", line 183, in riders
    table_parser = TableParser(career_points_table_html)
  File "C:\Users\user\anaconda3\lib\site-packages\procyclingstats\table_parser.py", line 31, in __init__
    table_body = html_table.css_first("tbody")
AttributeError: 'NoneType' object has no attribute 'css_first'

Feature request to add the "Hills" speciality.

PCS have introduced a new speciality; "Hills". Would be nice to scrape this as well. Thanks.

Question about your api

Hi when I use your api how does it work exactly? Does it work like this:
I use this in my python application:

from procyclingstats import Rider
rider = Rider("rider/tadej-pogacar")
rider.birthdate()
"1998-9-21"

When does the webscraping happen? Is it scraping locally from my device or am I making calls to a db/server that you created using the scraping? When is the data updated?

Thanks for your answer

Problem scraping if rider DNF first stage

$ python test.py
[
Traceback (most recent call last):
File "/Users/colin/marketcetera/workspaces/procyclingstats/code/procyclingstats/examples/test.py", line 27, in
pprint(stage.parse())
File "/Users/colin/Library/Python/3.10/lib/python/site-packages/procyclingstats/scraper.py", line 112, in parse
parsed_data[method_name] = method()
File "/Users/colin/Library/Python/3.10/lib/python/site-packages/procyclingstats/stage_scraper.py", line 298, in results
table = join_tables(table, table_parser.table, "rider_url")
File "/Users/colin/Library/Python/3.10/lib/python/site-packages/procyclingstats/utils.py", line 162, in join_tables
table.append({**table2_dict[row[join_key]], **row})
KeyError: 'rider/laurens-de-plus'

In the first stage of the '23 Vuelta, Laurens de Plus crashed and DNF. Not sure if this is the source of the issue or not. Sample script attached to reproduce.

test.py.txt

Max number of races is limited to 100?

Hi!

I have created a small custom function

def fetch_rider_results(rider_url):
    rider_results = RiderResults(rider_url + "/results")
    rider_results_JSON = rider_results.parse()
    return rider_results_JSON

When I call this function, for instance

st.write(fetch_rider_results("rider/jonas-vingegaard"))

I get a JSON object, but the number of races and stages only go from 0-99. Wouldn't it be possible instead to fetch for instance the results for the last 5 years?

Best regards
Kasper

team.wins_count() throws a ValueError when the team has no wins

The wins_count() function located in team_scraper.py assumes that the retrieved html text will be a valid integer, but throws an error when there are no wins because procyclingstats.com displays that with a dash instead of a zero. You can see what I mean below.

return int(wins_count_html.text())

I will be submitting a pull request with a simple fix if you would like to merge it.

Error with proxy connection

Error indicating that there is an issue with the proxy connection when trying to access the URL "https://www.procyclingstats.com/rider/tadej-pogacar". The error message says"Max retries exceeded".

error with RaceStartlist

Getting this error when I use RaceStartlist.

` 94 for team_html in startlist_html.css(".ridersCont"):
95 riders_table = team_html.css_first("ul")
---> 96 table_parser = TableParser(riders_table)
97 rider_f_to_parse = [f for f in casual_rider_fields if f in fields]
98 table_parser.parse(rider_f_to_parse)

AttributeError: 'NoneType' object has no attribute 'css_first'`

I see this has come up before, but I am using 0.1.6.

Thanks

rider_number in Stage.results, points, kom, gc, youth

Hi @themm1 , after many years of non coding I am feeling pretty rookie in python, but I have succeeded in writing some code to scrape results (all stage results, for an xlsx driven cycling game I alway play with some 100 friends. Extremely useful to be able to load all results.

Hoewever, somewhere after the recent giro - all scraping working perfectly during the giro - it appears that the rider_number column has disappeared from the Stage.results, points, kom, gc and youth. Also, startlist seems empty (from the API documentation I read that rider_number should only to be available in Racestartlist.startlist, but is was in results as well. In the html code the BIBs seem to be available in the results pages anyway.

As a BIB number is much easier and more consistent to match, my excel is driven on the BIB number, not so much the rider_name. A workaround to find the rider_number in Racestartlist.startlist() and put it in the dataframe I need for the results is possible, but not desireable if the data is also available in the results. Can you explain this issue?
Hoewever, in the html code the BIBs seem to be available in the results pages.

many thx!!

race.stages() fails

Seems like the site has changed. I get the following when using race.stages()

AttributeError                            Traceback (most recent call last)

----> 8         race.stages()

in stages(self, *args)

----> 172         if self.is_one_day_race():
    173             return []

in is_one_day_race(self)

     73         one_day_race_html = self.html.css_first("div.sub > span.blue")
---> 74         return "stage" not in one_day_race_html.text().lower()
     75 

AttributeError: 'NoneType' object has no attribute 'text'