lbenz730 / ncaahoopr Goto Github PK

View Code? Open in Web Editor NEW

193.0 193.0 48.0 10.8 MB

An R package for working with NCAA Basketball Play-by-Play Data

License: MIT License

R 100.00%

basketball college-basketball espn rstats sports

ncaahoopr's People

Contributors

Stargazers

Watchers

Forkers

zoudj rkahne nemochina2008 charlie86 pirategrunt ries9112 meysubb joongsup ashyxz amock419 adamringler gando10 nliced beniao thiagomcarneiro jph01532 hutchngo j-andrews7 dylorr exclusiveleex accidentalguru saiemgilani richardbender65 bgraft01 jimduggan jacklich10 colintj kurtawirth lafritay ethan9carpenter therealbmills cr458316 z-feldman mattruffner ccaudill11 engy-22 yeisjohn acduke bryce-mitchell jhd33 saglasford kanowell rossdrucker mrcaseb ggreenwood hausoftrexdevteam

ncaahoopr's Issues

get_boxscore for teams with hyphens

Seems to be an issue with pulling box scores for a matchup where there is a team name containing a hyphen. It will return a list with the other team's data missing and with a name of ''

Can be fixed by just changing:

matchup <- unlist(strsplit(pagetext, "-"))[[1]][1]

matchup <- unlist(strsplit(pagetext, " - "))[[1]][1]

so it splits on the correct hyphen and not on the potential team name hyphen.

Prior Year Schedules

Hi Luke - Two quick items:

get_master_schedule might be having an issue pulling up the first day of the 2018-19 season for example:
get_master_schedule("2018-11-06") ----> isn't working
get_master_schedule("2018-12-06") ----> for example is working
Am I doing something incorrect, or any guidance on how to get the game_ids from that date?
Do you possibly have code or a link to get the full season of game_ids/schedule for prior years? I'm still new to R and have been trying to come up with various codes to get at this in different ways, but from a QC perspective would be interested to know if there was a faster way as well.

Thanks so much!

Michael

shot_team NA in middle of game

I'm guessing this is due to player name discrepancies. Could these be assigned via the possession_before column?

df$shot_team = ifelse(is.na(df$shot_team) & !is.na(df$shot_outcome), df$possession_before, df$shot_team)

Clean Up Shot Locations

Move Free Throws --> FT line (currently at basket)
Create scheme to move 3's closer to line
Move 3PT Line back 1 foot in graphic if possible.

get_game_ids : Historical year

I'll randomly get an error with these types of queries (pulling historical game ids) - the most recent to pop up is:

ex: get_game_ids(team = 'Virginia Tech', season ='2017-18')

Error in if (reg_flag > 1) { : argument is of length zero

Safer to just download archives? Deploying a Shiny app that allows for querying of shot location data by team, year and player and want to be able to support historical querying while also keeping app size down, but seems that historical pulling has its drawbacks w/r to ESPN changing things on their backend and causing random errors to pop up 🤷‍♂️

`free_throw` column incorrect

The free_throw column appears to be NA or FALSE only (it never shows up as being TRUE and all of the shot information of the free throw attempt in shot_team, shot_outcome is NA as well). For example, when running for the Duke vs. Mich St. game: ncaahoopR::get_pbp_game("401260005"), this issue occurs.

Box Scores

Would love to get box score function working

Option for Naive Win Probability in Win Prob Functions

GEI calculations
wp_chart and gg_wp_chart

get_schedule function error

Myself and some others are having issues with the get_schedule and get_pbp functions. Everything else is working fine, but something is off with these two. They produce the same error message:

Error in get_schedule("Penn State", "2020-21") :
No team schedule available for Penn State / 2020-21. Current ESPN season = "2020-21". If you are trying to find the most recent season (2019-20), please supply season = "2019-20" argument.

Any help or solution would be appreciated!

PBP Scrape Issue with Games on 1/2/21

Using the commands below I'm trying to get PBP data for 1/2/21. Error causes no data to be saved.

Other recent dates seem to be fine.

Thanks
-Mike

PBP Parse Errors

Log of issues in PBP Parsing errors:

game_id: 401082334, play_id 209 (player not known)
game_id: 401082334, play_id 87 defensive rebound as same time on shot

dict ESPN_PBP name fixes

I noticed some of the PBP names are off in the 'dict' dataframe. I have been scraping boxscores for the 2016-17 and 2017-18 season but it looks like the name in the boxscore list is the same as the name in pbp data (I only spot checked a few of these in pbp but the boxscores for sure use this name). These are the changes that I have reconciled:

dict$ESPN_PBP[dict$ESPN_PBP == 'Arkansas State'] = 'Arkansas St'
dict$ESPN_PBP[dict$ESPN_PBP == 'Fort Wayne'] = 'Purdue Fort Wayne'
dict$ESPN_PBP[dict$ESPN_PBP == 'Georgia State'] = 'Georgia St'
dict$ESPN_PBP[dict$ESPN_PBP == 'Little Rock'] = 'Arkansas-Little Rock'
dict$ESPN_PBP[dict$ESPN_PBP == 'Louisiana'] = 'Lafayette'
dict$ESPN_PBP[dict$ESPN_PBP == 'Loyola-Chicago'] = 'Loyola Chicago'
dict$ESPN_PBP[dict$ESPN_PBP == 'McNeese'] = 'Mcneese St'
dict$ESPN_PBP[dict$ESPN_PBP == "Mt. St. Mary's"] = "Mt. St. Mary'S"
dict$ESPN_PBP[dict$ESPN_PBP == "San JosÃ© St"] = "San José State"
dict$ESPN_PBP[dict$ESPN_PBP == 'Southeast Missouri'] = 'Southeast Missouri State'
dict$ESPN_PBP[dict$ESPN_PBP == 'Seattle'] = 'Seattle U'
dict$ESPN_PBP[dict$ESPN_PBP == 'SIU-Edwardsville'] = 'SIU Edwardsville'
dict$ESPN_PBP[dict$ESPN_PBP == 'UL Monroe'] = 'Ul Monroe'
dict$ESPN_PBP[dict$ESPN_PBP == 'UMKC'] = 'UM Kansas City'

GET_PBP Parse Error

Pretty sure I am up to date on all packages, but am new to using R/ncaahoopR.

Getting this error when pulling full season PBP:

Thanks for all that you do Luke!
-Mike

Error 403 forbidden get_pbp_game

> get_pbp_game(400915704)
Scraping Data for Game: 1 of 1
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open URL 'http://barttorvik.com/2017_rosters.json': HTTP status was '403 Forbidden'

Improve shot chart graphics

See here https://themockup.blog/posts/2020-08-28-heatmaps-in-ggplot2/#dont-bury-the-lede

Errors with objects

get_pbp_game not working

When running get_pbp_game(gameid), it begins scraping and then gets cut off with an error that reads "Error in if (length(tmp) < ncol(tmp[[1]]) | length(tmp) == 0) { :
argument is of length zero".

object 'wp_hoops' not found

Hi there, awesome library!
I have encountered this issue when trying to get pbp data:

Error in get_pbp_game(game_ids) : object 'wp_hoops' not found

`shot_team` for free throws missing

Not sure if you saw my comment under the recently closed issue #54, so I will open this new issue here: The shot_team column is still blank for free throws. I think adding pbp$shot_team[(made_shots | missed_shots) & tolower(pbp$shooter) %in% tolower(home_roster)] <- pbp$home[1] and pbp$shot_team[(made_shots | missed_shots) & tolower(pbp$shooter) %in% tolower(away_roster)] <- pbp$away[1] to the end of the if (extra_parse == T) statement should fix this. I think these free throw related problems are occurring this year because ESPN is no longer marking the shot location of free throws as right underneath the basket - (25, 87.94222222222221) or (25, 4.177777777777778). Thanks for your hard work maintaining the package!

Object 'court' not found

When I am trying to generate shot charts, I am unable to get these charts because the variable "court" doesn't seem to be defined. Is there an issue with my environment?

edit: closing issue, I was using ncaahoopR:: rather than importing the library. Importing the library solved the problem.

Possession Parsing

Better parsing in pbp logs of possession, fouls, shot clock etc. Requires historical rosters from #17

Issue w/ old get_game_ids

Choose HTML element 66 instead of 65 due to ESPN web page refresh.

Date

Dates are getting parsed to NA. Will probably have to change the way I determine date of game.

Error when scaping data (get_pbp)

I'm trying to get data from UVA (Virginia Cavaliers) for 2019-20 and I see this Error:

get_pbp("UVA","2019-20")

Getting Game IDs: UVA
Scraping Data for Game: 1 of 13
Error in if (extra_parse & (pbp$date[1] >= "2007-11-01")) { :
missing value where TRUE/FALSE needed

A similar message appears for all teams and seasons when I try to execute the code.

I'm sorry for the incovenience, but what am I doing wrong?

New teams to D1 - create_ids_df

Four new teams have moved to D1 this year, and it doesn't look like your create_ids_df function has been updated to reflect this. To start, when you initiate the "ids" data frame, it should have 357 observations instead of 353. Then, it looks like there is a bug when you fill in the rest of the teams at the end of the function. Some of the teams were mismatched as seen in the snip below:

Is there any way to get substitution information in the play by play?

It seems like there's no substitution info on ESPN. I found that the play by play from NCAA.org contains this information, for example here. I wonder if you could add this feature. Thank you so much

get_master_schedule - cancelled games returns error (because of invalid scores)

Hey @lbenz730.

SEC charts just bounced back on me, and I realized today that get_master_schedule throws an error for canceled/postponed games. The reason for this is right here.

When you filter out in completed, the scores are filled with NA. So this if statement with the seq throws an error.

The score variable is just a vector of length 1 with NA in it.

 if(length(scores) > 0) {
     winning_scores <- scores[seq(1, length(scores) - 1, 2)]
     losing_scores <- scores[seq(2, length(scores), 2)]
     
     index <- sapply(completed$away_anchor, function(y) { y %in% winners })
     completed$home_score[index] <- losing_scores[index]
     completed$home_score[!index] <- winning_scores[!index]
     completed$away_score[!index] <- losing_scores[!index]
     completed$away_score[index] <- winning_scores[index]
 }

Shot location orientation and scale

It's unclear how to interpret the shot location data, so I've got a few questions:

What's the correct perspective? Staring down at the width of the court like a typical broadcast?
Where is (0, 0)? The bottom left corner of the court with the above perspective?
What's the scale? Feet?

These are probably worth adding to the docs as well. Cheers.

Error installing package with R v4.0.2

Hi,

I had a couple errors trying to install ncaahoopR , I'm wondering if they're problems on my end or with the package.

If I try to install the package using 'install.packages' I get the following error:
Warning in install.packages :
package ‘ncaahoopR’ is not available (for R version 4.0.2)

If I try to install the package using the github repo + devtools I get this error:

Homebrew 2.1.4
Homebrew/homebrew-core (git revision 3d75; last commit 2019-06-03)
Using PKG_CFLAGS=-I/usr/local/opt/[email protected]/include -I/usr/local/opt/openssl/include
--------------------------- [ANTICONF] --------------------------------
Configuration failed because openssl was not found. Try installing:

deb: libssl-dev (Debian, Ubuntu, etc)
rpm: openssl-devel (Fedora, CentOS, RHEL)
csw: libssl_dev (Solaris)
brew: [email protected] (Mac OSX)
If openssl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a openssl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
-------------------------- [ERROR MESSAGE] ---------------------------
tools/version.c:1:10: fatal error: 'openssl/opensslv.h' file not found
#include <openssl/opensslv.h>
^~~~~~~~~~~~~~~~~~~~
1 error generated.

ERROR: configuration failed for package ‘openssl’

removing ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library/openssl’
Error: Failed to install 'ncaahoopR' from GitHub:
(converted from warning) installation of package ‘openssl’ had non-zero exit status

Thanks for any help,
Aditya

New conference changes

2019 conference changes--change in ids dataset:

School	Former Conference	New Conference
Merrimack	Northeast-10 Conference (D-II)	Northeast Conference
Savannah State	Mid-Eastern Athletic Conference	Southern Intercollegiate Athletic Conference (D-II)

Getting rosters from previous years

Hi Luke,

It looks like ESPN has updated all the rosters for the upcoming season, but now when I try to get the rosters from last year I am getting an error. The link is working to get to the .json data (https://barttorvik.com/2020_rosters.json), but it looks like it it is breaking at the fromJSON function. Is this an internal issue on my end, or something new on Bart's site?

Thanks,
Justin

Previous Season Schedules

Greetings!

First of all, thank you so much for this package. I've used it a ton through this season.

I've noticed the "get_schedule()" function isn't working properly when you pass a previous season argument. It does, however, still work flawlessly for the current season. Am I correct in that the format is "2019-20"?

Thanks!

Extra Comma in DESCRIPTION Author Field

Authors@R: c(person("Luke", "Benz", email = "[email protected]", role = c("aut", "cre")), 
             person("Meyappan", "Subbaiah", role = c("ctb")),
             person("Luke", "Morris", role = c("ctb")), 
             person("Jared", "Andrews", role = c("ctb")),
             person("C. Ryan", "Campbell", role = c("ctb")),
             person("Ty", "Walters", role = c("ctb")),
             person("Jack", "Lichtenstein", role = c("ctb")),
             person("Kurt", "Wirth", role = c("ctb")),
             person("Jason", "Maddox", role = c("ctb")),)

As can be seen above, there is a superfluous comma at the end of the list, which has the effect of breaking installs

Errors with Get Schedule/ Get PBP

Hi Luke,

When I try to pull schedule's for teams I get the following error--

get_schedule("Duke")
Error in XML::readHTMLTable(RCurl::getURL(url))[[1]] : subscript out of bounds

Additionally when I try to get a team's PBP I get the following error:

get_pbp("Duke")
Getting Game IDs: Duke
Error in XML::readHTMLTable(RCurl::getURL(url))[[1]] : subscript out of bounds

If I feed in a game id then I don't get the same error but I get the following error if I try to get historical game ids :

get_game_ids("Duke", season = "2017-18")
Error in if (reg_flag > 1) { : argument is of length zero

Appreciate any help,
Aditya

Timeout Indicators

Fix bug in home/away timeout indicator variables.

LICENSE file is not MIT text

DESCRIPTION lists this package as having a license of MIT + file. However, the LICENSE file is not the standard MIT text. I'm planning to fork this package to add women's data. I can update the LICENSE file in the process if you really intend this to be MIT.

Parsing error on Campbell game

Great library, man -- thanks for putting this all together and enabling better analysis of CBB!

Currently trying to iterate through a set of teams from the library's team dictionary and pull in some play-by-play data, and I'm running into an issue with get_pbp('Campbell'). Looks like on Campbell's 21st game, the PBP parser bails out with the error: Error in if (length(tmp) < ncol(tmp[[1]]) | length(tmp) == 0) { : argument is of length zero.

Working backward through the stacktrace, I'm fairly sure this line is the culprit. Looks like the URL for the game in question (https://www.espn.com/mens-college-basketball/game?gameId=401171028) just redirects to the NCAA scoreboard, rather than displaying any game data.

Let me know if you need any other information to debug this!

Edit: Looks like Coastal Carolina has the same issue in its second game, based on its schedule. Hampden-Sydney is not a D1 school (from what I can tell), so it doesn't have a ESPN game summary associated with games against it.

Font Error in assist_net function

Hey Luke,

Nifty package, quite a fan. Thanks for making it.

I've run into an error when trying to create an assist_network and have it saved as a pdf. I use Jupyter notebook and R 3.4.4. Plotting within the notebook works fine, but the error occurs in R sessions opened from the command line as well. Not a huge deal, but I have been struggling with text getting cut off and wanted to bring it back into frame in Illustrator or something rather than distort the entire image by screwing with image dimensions.

To replicate:

pdf("tester.pdf")
assist_net("Louisville", 401082663, louisville$primary_color, message = "Assist Network - Louisville v Clemson 02/16/2019")
dev.off()

Error (repeated a few dozen times):

Warning message in text.default(x, y, labels = labels, col = label.color, family = label.family, :
“font family 'Arial Black' not found in PostScript font database”

Clearly, the Arial Black font is missing, but I'm not sure how to rectify it. I tried mucking about with the extrafont package, but with no luck. Any hints or solutions would be much appreciated. Thanks!

get_pbp issue

Hey Luke -

Came across another issue when trying to pull the Army PBP data after re-installing the 1.5.2 version of ncaahoopR.

Getting Game IDs: LIU
Scraping Data for Game: 1 of 22
Error in get_roster(dict$ESPN[dict$ESPN_PBP == pbp$away[1]], year) :
Invalid team. Please consult the ids data frame for a list of valid teams, using data(ids).

Seems to trigger when trying to pull LIU data.

Reduce scraping requests with memoise

Decrease the number of scraping requests and speed things up with memoise (https://github.com/r-lib/memoise). Memoise can cache function results (locally or in the cloud), useful for functions called multiple times with the same input. For example, functions relying on get_pbp_game() will benefit from caching get_pbp_game() results.

Checkout googleAnalyticsR (http://code.markedmondson.me/googleAnalyticsR/) to see memoise in action.

Scores misrepresented in rows

Rows sometimes seem to get shuffled out of order. Need to do some post processing on scores
https://twitter.com/jayau92/status/1196640237478436865

Fix get_master_schedule

For future dates w/out any games

Fix historical schedules

Error in if (reg_flag > 1) { : argument is of length zero

Summer HTML mixup

Fix functions when ESPN does summer archiving. Add historical data for past years as well.

Single shot charts/shot heatmaps for series of opponents

Similar to what team_shot_chart() does for mapping a team's offense, it would be great if you could input a vector of game ids for a particular team and get an aggregated shot chart or shot heatmap of their opponents' shooting.

I've attempted using team != "Kansas" within team_shot_chart() to execute something like this with no luck.

Add Historical Rosters and Schedules

Certain functions that rely on rosters and or schedules are limited in their scope to the current season. Would be nice to extend to past seasons.

error running get_roster

I tried to do ncaahoopR::get_roster("Kansas") and it spit out the Kennesaw State roster. It looks like ESPN might have changed the links in their system. For example "kennesaw-st-owls" is now "kennesaw-state-owls". Now the create_ids_df function appears to be the source of this error. I went ahead and updated the "ids.csv" file from lbenz730/NCAA_Hoops_Play_By_Play by changing some of the links that appeared to be changed, and now it appears to be working.
ids_JE.zip

get_master_schedule doesn't work for dates in advance

Hey @lbenz730

Looks like the get_master_schedule works fine for dates in the past. But struggles with future dates. After, playing with the source code a bit, it seems like the issue is in these lines

scores <- as.numeric(gsub("[^0-9]", "", gsub("\\(.*\\)", "", unlist(strsplit(completed$result, ",")))))
winning_scores <- scores[seq(1, length(scores) - 1, 2)]
losing_scores <- scores[seq(2, length(scores), 2)]

The length of scores (for future dates) is 1 and that is throwing an error when trying to find winning scores and losing scores. On first thought maybe an if...else statement could clarify that issue. I should have submitted a PR but was too lazy to fork the code -___-.

I could also be using the function outside of the intended scope.

Issue w/ Nulls and Not Scanning Correctly

Use skipNul = T in scan and force it to try multiple times to read. This may be an issue in ESPN HTML. Check if resolved on it's own, if not I will need to make each call to scan() in the package more robust with trial and error catching.

get_sched error for certain teams in certain years

Disclosure I am scraping every game for every team since 2016-17 season. I have finished 16-17 and Winthrop was the only team that this happened for. For the 18-19 season Cal and BC have thrown an error so far. I will update if I find any others.

> sched = get_schedule('Winthrop', '2016-17')
Error in `$<-.data.frame`(`*tmp*`, "game_id", value = c("400916662", "400916669",  : 
  replacement has 33 rows, data has 34

> sched = get_schedule('Cal', '2018-19')
Error in `$<-.data.frame`(`*tmp*`, "game_id", value = c("401087057", "401087058",  : 
  replacement has 31 rows, data has 32

> sched = get_schedule('Boston College', '2018-19')
Error in `$<-.data.frame`(`*tmp*`, "game_id", value = c("401082613", "401082614",  : 
  replacement has 31 rows, data has 32