lbenz730 / ncaahoopr Goto Github PK
View Code? Open in Web Editor NEWAn R package for working with NCAA Basketball Play-by-Play Data
License: MIT License
An R package for working with NCAA Basketball Play-by-Play Data
License: MIT License
Seems to be an issue with pulling box scores for a matchup where there is a team name containing a hyphen. It will return a list with the other team's data missing and with a name of ''
Can be fixed by just changing:
matchup <- unlist(strsplit(pagetext, "-"))[[1]][1]
to
matchup <- unlist(strsplit(pagetext, " - "))[[1]][1]
so it splits on the correct hyphen and not on the potential team name hyphen.
Hi Luke - Two quick items:
get_master_schedule might be having an issue pulling up the first day of the 2018-19 season for example:
get_master_schedule("2018-11-06") ----> isn't working
get_master_schedule("2018-12-06") ----> for example is working
Am I doing something incorrect, or any guidance on how to get the game_ids from that date?
Do you possibly have code or a link to get the full season of game_ids/schedule for prior years? I'm still new to R and have been trying to come up with various codes to get at this in different ways, but from a QC perspective would be interested to know if there was a faster way as well.
Thanks so much!
I'm guessing this is due to player name discrepancies. Could these be assigned via the possession_before column?
df$shot_team = ifelse(is.na(df$shot_team) & !is.na(df$shot_outcome), df$possession_before, df$shot_team)
I'll randomly get an error with these types of queries (pulling historical game ids) - the most recent to pop up is:
ex: get_game_ids(team = 'Virginia Tech', season ='2017-18')
Error in if (reg_flag > 1) { : argument is of length zero
Safer to just download archives? Deploying a Shiny app that allows for querying of shot location data by team, year and player and want to be able to support historical querying while also keeping app size down, but seems that historical pulling has its drawbacks w/r to ESPN changing things on their backend and causing random errors to pop up 🤷♂️
The free_throw column appears to be NA or FALSE only (it never shows up as being TRUE and all of the shot information of the free throw attempt in shot_team, shot_outcome is NA as well). For example, when running for the Duke vs. Mich St. game: ncaahoopR::get_pbp_game("401260005")
, this issue occurs.
Would love to get box score function working
wp_chart
and gg_wp_chart
Myself and some others are having issues with the get_schedule and get_pbp functions. Everything else is working fine, but something is off with these two. They produce the same error message:
Error in get_schedule("Penn State", "2020-21") :
No team schedule available for Penn State / 2020-21. Current ESPN season = "2020-21". If you are trying to find the most recent season (2019-20), please supply season = "2019-20" argument.
Any help or solution would be appreciated!
Log of issues in PBP Parsing errors:
game_id
: 401082334, play_id
209 (player not known)game_id
: 401082334, play_id
87 defensive rebound as same time on shotI noticed some of the PBP names are off in the 'dict' dataframe. I have been scraping boxscores for the 2016-17 and 2017-18 season but it looks like the name in the boxscore list is the same as the name in pbp data (I only spot checked a few of these in pbp but the boxscores for sure use this name). These are the changes that I have reconciled:
dict$ESPN_PBP[dict$ESPN_PBP == 'Arkansas State'] = 'Arkansas St'
dict$ESPN_PBP[dict$ESPN_PBP == 'Fort Wayne'] = 'Purdue Fort Wayne'
dict$ESPN_PBP[dict$ESPN_PBP == 'Georgia State'] = 'Georgia St'
dict$ESPN_PBP[dict$ESPN_PBP == 'Little Rock'] = 'Arkansas-Little Rock'
dict$ESPN_PBP[dict$ESPN_PBP == 'Louisiana'] = 'Lafayette'
dict$ESPN_PBP[dict$ESPN_PBP == 'Loyola-Chicago'] = 'Loyola Chicago'
dict$ESPN_PBP[dict$ESPN_PBP == 'McNeese'] = 'Mcneese St'
dict$ESPN_PBP[dict$ESPN_PBP == "Mt. St. Mary's"] = "Mt. St. Mary'S"
dict$ESPN_PBP[dict$ESPN_PBP == "San José St"] = "San José State"
dict$ESPN_PBP[dict$ESPN_PBP == 'Southeast Missouri'] = 'Southeast Missouri State'
dict$ESPN_PBP[dict$ESPN_PBP == 'Seattle'] = 'Seattle U'
dict$ESPN_PBP[dict$ESPN_PBP == 'SIU-Edwardsville'] = 'SIU Edwardsville'
dict$ESPN_PBP[dict$ESPN_PBP == 'UL Monroe'] = 'Ul Monroe'
dict$ESPN_PBP[dict$ESPN_PBP == 'UMKC'] = 'UM Kansas City'
> get_pbp_game(400915704)
Scraping Data for Game: 1 of 1
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open URL 'http://barttorvik.com/2017_rosters.json': HTTP status was '403 Forbidden'
When running get_pbp_game(gameid), it begins scraping and then gets cut off with an error that reads "Error in if (length(tmp) < ncol(tmp[[1]]) | length(tmp) == 0) { :
argument is of length zero".
Hi there, awesome library!
I have encountered this issue when trying to get pbp data:
Error in get_pbp_game(game_ids) : object 'wp_hoops' not found
Not sure if you saw my comment under the recently closed issue #54, so I will open this new issue here: The shot_team column is still blank for free throws. I think adding pbp$shot_team[(made_shots | missed_shots) & tolower(pbp$shooter) %in% tolower(home_roster)] <- pbp$home[1] and pbp$shot_team[(made_shots | missed_shots) & tolower(pbp$shooter) %in% tolower(away_roster)] <- pbp$away[1] to the end of the if (extra_parse == T) statement should fix this. I think these free throw related problems are occurring this year because ESPN is no longer marking the shot location of free throws as right underneath the basket - (25, 87.94222222222221) or (25, 4.177777777777778). Thanks for your hard work maintaining the package!
When I am trying to generate shot charts, I am unable to get these charts because the variable "court" doesn't seem to be defined. Is there an issue with my environment?
edit: closing issue, I was using ncaahoopR:: rather than importing the library. Importing the library solved the problem.
Better parsing in pbp logs of possession, fouls, shot clock etc. Requires historical rosters from #17
Choose HTML element 66 instead of 65 due to ESPN web page refresh.
Dates are getting parsed to NA
. Will probably have to change the way I determine date of game.
I'm trying to get data from UVA (Virginia Cavaliers) for 2019-20 and I see this Error:
get_pbp("UVA","2019-20")
Getting Game IDs: UVA
Scraping Data for Game: 1 of 13
Error in if (extra_parse & (pbp$date[1] >= "2007-11-01")) { :
missing value where TRUE/FALSE needed
A similar message appears for all teams and seasons when I try to execute the code.
I'm sorry for the incovenience, but what am I doing wrong?
Four new teams have moved to D1 this year, and it doesn't look like your create_ids_df function has been updated to reflect this. To start, when you initiate the "ids" data frame, it should have 357 observations instead of 353. Then, it looks like there is a bug when you fill in the rest of the teams at the end of the function. Some of the teams were mismatched as seen in the snip below:
It seems like there's no substitution info on ESPN. I found that the play by play from NCAA.org contains this information, for example here. I wonder if you could add this feature. Thank you so much
Hey @lbenz730.
SEC charts just bounced back on me, and I realized today that get_master_schedule throws an error for canceled/postponed games. The reason for this is right here.
When you filter out in completed, the scores are filled with NA. So this if
statement with the seq
throws an error.
The score
variable is just a vector of length 1 with NA in it.
if(length(scores) > 0) {
winning_scores <- scores[seq(1, length(scores) - 1, 2)]
losing_scores <- scores[seq(2, length(scores), 2)]
index <- sapply(completed$away_anchor, function(y) { y %in% winners })
completed$home_score[index] <- losing_scores[index]
completed$home_score[!index] <- winning_scores[!index]
completed$away_score[!index] <- losing_scores[!index]
completed$away_score[index] <- winning_scores[index]
}
It's unclear how to interpret the shot location data, so I've got a few questions:
These are probably worth adding to the docs as well. Cheers.
Hi,
I had a couple errors trying to install ncaahoopR , I'm wondering if they're problems on my end or with the package.
If I try to install the package using 'install.packages' I get the following error:
Warning in install.packages :
package ‘ncaahoopR’ is not available (for R version 4.0.2)
If I try to install the package using the github repo + devtools I get this error:
Homebrew 2.1.4
Homebrew/homebrew-core (git revision 3d75; last commit 2019-06-03)
Using PKG_CFLAGS=-I/usr/local/opt/[email protected]/include -I/usr/local/opt/openssl/include
--------------------------- [ANTICONF] --------------------------------
Configuration failed because openssl was not found. Try installing:
ERROR: configuration failed for package ‘openssl’
Thanks for any help,
Aditya
2019 conference changes--change in ids
dataset:
School | Former Conference | New Conference |
---|---|---|
Merrimack | Northeast-10 Conference (D-II) | Northeast Conference |
Savannah State | Mid-Eastern Athletic Conference | Southern Intercollegiate Athletic Conference (D-II) |
Hi Luke,
It looks like ESPN has updated all the rosters for the upcoming season, but now when I try to get the rosters from last year I am getting an error. The link is working to get to the .json data (https://barttorvik.com/2020_rosters.json), but it looks like it it is breaking at the fromJSON function. Is this an internal issue on my end, or something new on Bart's site?
Thanks,
Justin
Greetings!
First of all, thank you so much for this package. I've used it a ton through this season.
I've noticed the "get_schedule()" function isn't working properly when you pass a previous season argument. It does, however, still work flawlessly for the current season. Am I correct in that the format is "2019-20"?
Thanks!
Authors@R: c(person("Luke", "Benz", email = "[email protected]", role = c("aut", "cre")),
person("Meyappan", "Subbaiah", role = c("ctb")),
person("Luke", "Morris", role = c("ctb")),
person("Jared", "Andrews", role = c("ctb")),
person("C. Ryan", "Campbell", role = c("ctb")),
person("Ty", "Walters", role = c("ctb")),
person("Jack", "Lichtenstein", role = c("ctb")),
person("Kurt", "Wirth", role = c("ctb")),
person("Jason", "Maddox", role = c("ctb")),)
As can be seen above, there is a superfluous comma at the end of the list, which has the effect of breaking installs
Hi Luke,
When I try to pull schedule's for teams I get the following error--
get_schedule("Duke")
Error in XML::readHTMLTable(RCurl::getURL(url))[[1]] : subscript out of bounds
Additionally when I try to get a team's PBP I get the following error:
get_pbp("Duke")
Getting Game IDs: Duke
Error in XML::readHTMLTable(RCurl::getURL(url))[[1]] : subscript out of bounds
If I feed in a game id then I don't get the same error but I get the following error if I try to get historical game ids :
get_game_ids("Duke", season = "2017-18")
Error in if (reg_flag > 1) { : argument is of length zero
Appreciate any help,
Aditya
Fix bug in home/away timeout indicator variables.
DESCRIPTION
lists this package as having a license of MIT + file
. However, the LICENSE file is not the standard MIT text. I'm planning to fork this package to add women's data. I can update the LICENSE file in the process if you really intend this to be MIT.
Great library, man -- thanks for putting this all together and enabling better analysis of CBB!
Currently trying to iterate through a set of teams from the library's team dictionary and pull in some play-by-play data, and I'm running into an issue with get_pbp('Campbell')
. Looks like on Campbell's 21st game, the PBP parser bails out with the error: Error in if (length(tmp) < ncol(tmp[[1]]) | length(tmp) == 0) { : argument is of length zero
.
Working backward through the stacktrace, I'm fairly sure this line is the culprit. Looks like the URL for the game in question (https://www.espn.com/mens-college-basketball/game?gameId=401171028) just redirects to the NCAA scoreboard, rather than displaying any game data.
Let me know if you need any other information to debug this!
Edit: Looks like Coastal Carolina has the same issue in its second game, based on its schedule. Hampden-Sydney is not a D1 school (from what I can tell), so it doesn't have a ESPN game summary associated with games against it.
Hey Luke,
Nifty package, quite a fan. Thanks for making it.
I've run into an error when trying to create an assist_network and have it saved as a pdf. I use Jupyter notebook and R 3.4.4. Plotting within the notebook works fine, but the error occurs in R sessions opened from the command line as well. Not a huge deal, but I have been struggling with text getting cut off and wanted to bring it back into frame in Illustrator or something rather than distort the entire image by screwing with image dimensions.
To replicate:
pdf("tester.pdf")
assist_net("Louisville", 401082663, louisville$primary_color, message = "Assist Network - Louisville v Clemson 02/16/2019")
dev.off()
Error (repeated a few dozen times):
Warning message in text.default(x, y, labels = labels, col = label.color, family = label.family, :
“font family 'Arial Black' not found in PostScript font database”
Clearly, the Arial Black font is missing, but I'm not sure how to rectify it. I tried mucking about with the extrafont package, but with no luck. Any hints or solutions would be much appreciated. Thanks!
Hey Luke -
Came across another issue when trying to pull the Army PBP data after re-installing the 1.5.2 version of ncaahoopR.
Getting Game IDs: LIU
Scraping Data for Game: 1 of 22
Error in get_roster(dict$ESPN[dict$ESPN_PBP == pbp$away[1]], year) :
Invalid team. Please consult the ids data frame for a list of valid teams, using data(ids).
Seems to trigger when trying to pull LIU data.
Decrease the number of scraping requests and speed things up with memoise
(https://github.com/r-lib/memoise). Memoise can cache function results (locally or in the cloud), useful for functions called multiple times with the same input. For example, functions relying on get_pbp_game()
will benefit from caching get_pbp_game()
results.
Checkout googleAnalyticsR
(http://code.markedmondson.me/googleAnalyticsR/) to see memoise
in action.
Rows sometimes seem to get shuffled out of order. Need to do some post processing on scores
https://twitter.com/jayau92/status/1196640237478436865
For future dates w/out any games
Error in if (reg_flag > 1) { : argument is of length zero
Fix functions when ESPN does summer archiving. Add historical data for past years as well.
Similar to what team_shot_chart()
does for mapping a team's offense, it would be great if you could input a vector of game ids for a particular team and get an aggregated shot chart or shot heatmap of their opponents' shooting.
I've attempted using team != "Kansas"
within team_shot_chart()
to execute something like this with no luck.
Certain functions that rely on rosters and or schedules are limited in their scope to the current season. Would be nice to extend to past seasons.
I tried to do ncaahoopR::get_roster("Kansas") and it spit out the Kennesaw State roster. It looks like ESPN might have changed the links in their system. For example "kennesaw-st-owls" is now "kennesaw-state-owls". Now the create_ids_df function appears to be the source of this error. I went ahead and updated the "ids.csv" file from lbenz730/NCAA_Hoops_Play_By_Play by changing some of the links that appeared to be changed, and now it appears to be working.
ids_JE.zip
Hey @lbenz730
Looks like the get_master_schedule
works fine for dates in the past. But struggles with future dates. After, playing with the source code a bit, it seems like the issue is in these lines
scores <- as.numeric(gsub("[^0-9]", "", gsub("\\(.*\\)", "", unlist(strsplit(completed$result, ",")))))
winning_scores <- scores[seq(1, length(scores) - 1, 2)]
losing_scores <- scores[seq(2, length(scores), 2)]
The length of scores (for future dates) is 1 and that is throwing an error when trying to find winning scores and losing scores. On first thought maybe an if...else statement could clarify that issue. I should have submitted a PR but was too lazy to fork the code -___-.
I could also be using the function outside of the intended scope.
Use skipNul = T in scan and force it to try multiple times to read. This may be an issue in ESPN HTML. Check if resolved on it's own, if not I will need to make each call to scan() in the package more robust with trial and error catching.
Disclosure I am scraping every game for every team since 2016-17 season. I have finished 16-17 and Winthrop was the only team that this happened for. For the 18-19 season Cal and BC have thrown an error so far. I will update if I find any others.
> sched = get_schedule('Winthrop', '2016-17')
Error in `$<-.data.frame`(`*tmp*`, "game_id", value = c("400916662", "400916669", :
replacement has 33 rows, data has 34
> sched = get_schedule('Cal', '2018-19')
Error in `$<-.data.frame`(`*tmp*`, "game_id", value = c("401087057", "401087058", :
replacement has 31 rows, data has 32
> sched = get_schedule('Boston College', '2018-19')
Error in `$<-.data.frame`(`*tmp*`, "game_id", value = c("401082613", "401082614", :
replacement has 31 rows, data has 32
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.