cdalzell / lahman Goto Github PK

View Code? Open in Web Editor NEW

75.0 17.0 37.0 202.41 MB

R Package Containing Sean Lahman's Baseball Database

Home Page: https://cran.r-project.org/web/packages/Lahman/Lahman.pdf

R 100.00%

lahman's People

Contributors

Stargazers

Watchers

lahman's Issues

Awards for Managers and Players don't appear to be updated

It looks like both Awards tables need to be updated.

library(tidyverse)
packageVersion("Lahman")
#> [1] '8.0.1'

Lahman::AwardsManagers %>% 
  .$yearID %>% 
  max()
#> [1] 2016


Lahman::AwardsPlayers %>% 
  .$yearID %>% 
  max()
#> [1] 2017

Delete MD5 file, add source_data/ to .Rbuildignore

The MD5 file is outdated, and should be deleted because it causes errors/warnings in package build/check.

The source_data/ directory is useful to preserve, but should be added to .Rbuildignore because
it gives warnings in build/check.

move baseballdatabank-master/ out of data/

I tried R cmd check in a local clone of the feature/2015-data-update. It gives a warning on baseballdatabank-master:

* checking contents of 'data' directory ... WARNING
Files not of a type allowed in a 'data' directory:
  'baseballdatabank-master'
Please use e.g. 'inst/extdata' for non-R data files
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking examples ... OK
* DONE
Status: 1 WARNING, 1 NOTE

Vignettes: payroll

The "payroll.Rmd" file does not knit; there is an error at line 210, in the car::Boxplot() function. I'm not familiar with the package, and can't diagnose the source of the error.

Update README To Provide devtools#install_github Instructions

Add Missing Examples

There are currently no examples in the following man files:

FieldingOF
FieldingPost
PitchingPost
SeriesPost
TeamsFranchises
TeamsHalf

It would be nice to have some added, even to the ones that are just separate tables for post season play!

Missing table - FieldingOFsplit

It looks like the table FieldingOFsplit is missing. I think you need it to cover OF splits for years subsequent to 1955.

Migrate package url from R-Forge

Now that we have vignettes, perhaps it it time to abandon the old URL: http://lahman.r-forge.r-project.org/ in the package DESCRIPTION.

One main thing that might be interesting there is the link http://lahman.r-forge.r-project.org/doc/ to an old version of the package documentation, with results of all examples. This could be redone using pkgdown, and then the URL changed to a github.io based version.

There might be a few other links worth preserving.

My sense is that if we do this, it should be after the CRAN release of v.7.0-0
Thoughts?

Add Unit Testing

There are a number of tests and code examples in the documentation, it would probably be helpful to create actual unit tests out of these to help expedite and increase the confidence in library data validation.

There are a number of packages out there, but this one (http://r-pkgs.had.co.nz/tests.html) seems to have CRAN integration.

Incorrect birthDate for johnsbi01

Data issue reported via email, confirmed in latest version.

library(Lahman)
library(dplyr)

Lahman::Master %>%
  filter(playerID == "johnsbi01") %>%
  select(playerID, nameFirst, nameLast, birthDate, debut) %>%
  mutate(debut_age = as.Date(debut) - as.Date(birthDate))

I'm reasonably sure no one has ever debuted in the MLB at -28216 days old.. :)

I'm seeing that there's a ton of work currently underway in our upstream data source. I'll submit a fix PR if that's not already fixed before this year's update is submitted to CRAN.

Cubs and White Sox Are Reversed In Teams Table

Issue reported via email:

I'm not sure if this error goes upstream, but in the most recent version of the R package, the names
for the Cubs and White Sox are reversed in the Teams table.
require(dplyr)
require(Lahman)

Teams %>%
 filter(teamID %in% c("CHA", "CHN") & yearID > 2010) %>%
 select(yearID, teamID, name)
The names should be reversed for 2013 and 2014.

object ‘Teams’ is not exported by 'namespace:Lahman'

Hello Lahman contributors,

I'm developing an R package (bbgraphsR) that uses the Teams dataframe from your package in one of my functions (viz_standings), but either when running devtools::load_all() or devtools::check(), I receive the following warning:

Warning: object ‘Teams’ is not exported by 'namespace:Lahman'

I indeed checked the NAMESPACE file and it makes sense because there is no line exporting that table.

In that case, can I save the table data and use it in my package?
Or is it expected to export that table in the future?

Thanks in advance for your feedback.

Regards.

Daniel.

Testing the Release v.6.0-0 branch

Within R Studio, I ran build->check. Looks good, but there is one NOTE and a comment that may trigger a reply from the CRAN gateway-keeper:

* checking installed package size ... NOTE
  installed size is  7.4Mb
  sub-directories of 1Mb or more:
    data   7.2Mb

This is standard, since it is mainly a data package. this is already addressed in the cran-comments.md file.

We also get a comment at the end that the Pitching examples take longer than 5 sec to run.

* checking examples ... OK
Examples with CPU or elapsed time > 5s
         user system elapsed
Pitching 4.96   0.22    5.18
* DONE

Appearances Example Returns Bad Data

The last example in appearances doc returns 207 players as having played 162 games that season, only four of whom have a player ID associated with them. The rest are NA.

This can be reproduced in the current version available on CRAN (3.0-1)

This will reproduce the issue:

all162 <- ddply(subset(Appearances, yearID > 1960), .(yearID),
                         summarise, allGamers = playerID[G_all == 162])
table(all162$yearID)

Inconsistent date format in Master table

In the table Master two of the date fields (debut and finalGame) the dates are represented inconsistently. For the dates prior to January 1, 1900 the dates are shown as yyyy-mm-dd (i.e. following the ISO 8601 standard). For dates after that, they are in m/d/yyyy.

(Note that a third date field in the table, birthDate, is in yyyy-mm-dd.)

This issue exists in the source csv version the Lahman database (version 2015), and is not peculiar to the R version.

I have not checked other tables at this point.

Pre-release: inform maintainers of packages that depend/suggest Lahman

Lahman has
Reverse suggests: broom, dplyr, poplite

Maintainers of those packages should be notified before release to CRAN. (I recall there was
something that caused dplyr to break in a previous release of Lahman.)

Updating for the v 7.0 releasse

2017 data update

Hi,

I wanted to try and update the data so that it included the 2017 season.
It seems Sean hasn't had a chance to upload to his site yet, but mentioned that the CSV files exist on github (https://twitter.com/seanlahman/status/974796328130744320)

They're at this location: https://github.com/chadwickbureau/baseballdatabank/

I was going to try and open a PR to add these in. It seems I should be trying to use the inst/scripts to ensure they're correctly saved as Rdata files and appropriate columns set to factors. Is that correct?

Documentation for lgID

Can someone tell me what "ML" in lgID stands for? AL & NL are self-explanatory.

teamID errors in Salaries table for 2014

I teach a statistics course and assigned some data wrangling using the Salaries and Teams tables from the Lahman package for a midterm. An astute student pointed out some weirdness in the Salaries table for 2014. There are two extra teamIDs, NYM and SFG, in 2014 (leading to 32 national league teams total for that year). These codes appear to be the franchIDs as opposed to the teamIDs (NYN being the teamID and NYM the franchID; SFN being the teamID and SFG the franchID). But, in 2014, NYN and SFN (the correct teamIDs) also appear, with some salary data present. Code is below. I also pulled the .csv files from Sean Lahman's site and the error is there as well.

library(Lahman)
Teams <- Lahman::Teams
Salaries <- Lahman::Salaries
library(dplyr)

# create new payroll dataframe summing over players within teams for each year from Salaries table
payroll <- Salaries %>%
  group_by(yearID, teamID) %>%
  summarise(payroll = sum(salary))

# look at just 2013:2014...
payroll %>% 
  filter(yearID %in% c(2013, 2014)) %>%
  group_by(teamID) %>%
  summarise(n = n()) %>% 
  print(n = 32)

payroll %>%
  filter(yearID %in% c(2013, 2014), teamID %in% c("NYM", "NYN", "SFG", "SFN")) 

# note 32 teams total in NL for 2014: NYM and SFG each appear just in 2014
# by comparison, in the Teams table...
Teams %>% 
  filter(yearID %in% c(2013, 2014), teamID %in% c("NYM", "NYN", "SFG", "SFN")) %>%
  select(yearID:franchID)

#3 rows for NYN teamID in 2014; 24 for NYM team ID same year
Salaries %>% 
  filter(yearID == 2014, teamID %in% c("NYM", "NYN")) %>%
  group_by(teamID) %>%
  tally()

#1 row for SFN teamID in 2014; 27 for SFG team ID same year
Salaries %>% 
  filter(yearID == 2014, teamID %in% c("SFG", "SFN")) %>%
  group_by(teamID) %>%
  tally()

# looks like in Salaries table, two franchIDs (NYM and SFG) were mistakenly entered for a few rows instead of the teamID (NYN and SFN)

2021 Data Update

The 2021 data update is being released & refined, time to update and release v10 of this package.

Currently watching the upstream 2022.x releases: https://github.com/chadwickbureau/baseballdatabank/releases

change of SchoolPlayers -> CollegePlaying gives file not found errors

First attempt to build/install v. 4.0-0 gives the following error:

* installing *source* package 'Lahman' ...
files 'data/SchoolsPlayers.RData', 'man/SchoolsPlayers.Rd' are missing

I'm not sure why the name of this table was changed.

AllstarFull has missing gameNum for 1962

There were two All-Star games played in 1962, but the 8.0-0 release candidate has all data in AllstarFull from both games with zeros in the gameNum field.

SchoolsPlayers Is Now CollegePlaying

From readme2014.txt:

SchoolsPlayers has been replaced with a new table called CollegePlaying.
This reflects advances in the compilation of this data, largely led by
Ted Turocy. The old table reported college attendance for major league
players by listing a start date and end date. The new version has a
separate record for each year that a player attended. This allows
us to better account for players who attended multiple colleges or
skipped a season, as well as to identify teammates.

Need to make sure the .Rd files are created correctly and that the docs are updated appropriately.

Remove deprecated fields from the batting table

From readme2014.txt:

Removed two deprecated fields from the batting table. The G_batting and
G_old fields were rendered obsolete when we created the appearances table.
They've been removed from the batting table starting with this version

Need to:

Ensure the fields are removed from batting
Ensure appearances are correct
Update the documentation accordingly

dplyr::lahman opportunities??

I just noticed in vignette("databases", package="dplyr"), that Hadley has defined a bunch of
functions to use/illustrate dplyr operations with the Lahman database, but in sql form.
See also: ?dplyr::lahman

I have no idea whether this actually works somewhere, or is purely notional. It seems to require
a local sql database to be setup somewhere. But it seems to be something useful to explore
perhaps with Hadley's cooperation, for a future release.

CollegePlaying not merging

Hi,

I noticed that the playerID for the CollegePlaying dataset does not merge. Did you change identifier after 2014? Thank you

2019 Data Update

The 2019 data update has been released, time to update and release v8 of this package.

Feature branch: https://github.com/cdalzell/Lahman/tree/feature/2019-data-update

Preparing for v. 7.0 release

I'm looking at the branches with a view to preparing for the v. 7.0 release. A major feature will be the inclusion of vignettes, now all on the vignettes branch. This branch is now ~11 commits ahead of master and 6 commits behind. There will be more work on the vignettes branch as we go forward.

I'm not sure how to manage this on GitHub with a PR to master. Hopefully, there are no conflicts, so a fast-forward will work.

I think, but am not sure, that Chris' initial work with the 2017 data bases are on a feature/ branch.

Add Project Badges

Ex: http://www.r-pkg.org/services#badges

Wrong playerID for Rob Thomson in Managers

(Cross-posting from https://github.com/chadwickbureau/baseballdatabank/issues/143)

In the Managers table, the playerID given for Rob Thomson is thompro01 (which is Robby Thompson's) instead of thomsro99.

The other rows for Thomson (2008 NYA) and Thompson (2005 CLE and 2013 SEA) list the correct playerIDs.

library(dplyr, warn.conflicts = FALSE)
library(Lahman)

Managers %>% 
  filter(playerID == "thompro01" | playerID == "thomsro99")
#>    playerID yearID teamID lgID inseason   G  W  L rank plyrMgr
#> 1 thompro01   2005    CLE   AL        3   1  1  0    2       N
#> 2 thomsro99   2008    NYA   AL        2   3  1  2    3       N
#> 3 thompro01   2013    SEA   AL        2  28 13 15    4       N
#> 4 thompro01   2022    PHI   NL        2 111 65 46    3       N

^{Created on 2023-07-31 with reprex v2.0.2}

Test against ggplot2 1.1.0 pre-release

Per email from Hadley Wickham:

I am starting the ggplot2 release process, aiming for a CRAN release
on November 13. This is biggest ggplot2 release in a while so there
are a lot of improvements.

Key points:

You can install the dev version with
devtools::install_github("hadley/ggplot2") and
you can see what's changed at
https://github.com/hadley/ggplot2/blob/master/NEWS.md.

Please run R CMD check on your package with the
development version of ggplot2 installed. You can
see the results from my runs at
https://github.com/hadley/ggplot2/blob/master/revdep/summary.md.

2020 Data Update

The 2020 data update has been released, time to update and release v9 of this package.

https://github.com/cdalzell/Lahman/tree/feature/2020-data-update

Issue with AwardsPlayers table

I'm seeing an issue with AwardsPlayers in Version 11.0-0. Jake Peavy (peavyja01) has a "gold glove" award in 2012. This doesn't show up in the AwardsPlayers table or any other table

The award was shared with Jeremy Hellickson ( hellije01 ). But only hellije01 is credited in AwardsPlayers.

Other years where OFs have shared "Gold Gloves" seem to be correct (example 1987 or 1985 AL)

Entity-Relationship (ER) diagram for Lahman

Many people find it hard to see how the various Lahman data frames are related: what are the keys that link the different data sets? One possibility is an ER diagram, commonly used to describe relational data bases.

A quick Google search turned up the datamodler package. Perhaps this could be used here in some way:

In a vignette
In the README.md as a package overview

OOTP Data Integration

I think it would be neat if it were possible to load and append OOTP (Out Of The Park Baseball) save game stats.

This is an extreme stretch goal and I have no idea if it's even feasible as I'm not even slightly familiar with the internals of OOTP saves.

Prune some old branches

Late Spring cleaning:
We have quite a few old dead branches on the repo tree.

Time to clean them up. I can do this, but wanted to check with @cdalzell first, in case any old ones should be preserved for any reason.

Dennis' notes re pull request #23

@djmurphy420 made the following notes re pull request #23

Note 1: Check the graph in the Batting table re labeling of .400 hitters - dplyr changes the ordering of hitters so Michael's adjustments don't work as before. I just tried to make a few vertical adjustments. If the graph isn't acceptable, I can try it again using the ggrepel package, which uses an algorithm to spread out overlapping labels.

Note 2: The zipcode data object wasn't found when I tested the existing example from Schools.Rd. I tried some small geocoding experiments, but in several cases, the geocodes from Google were far off the mark so I abandoned the effort. Someone might want to look at that more closely.

Note 3: I made a couple of changes in the Lahman-package file, but there need to be a few more updates re maintainer and references. I'll let Chris and Michael sort that out.

Workflow for vignettes

Yesterday, I made a pull request for my vignettes branch that I created in my repo. By mistake, I merged it into master. It didn't do any harm, because it just created a vignettes/ directory with a dummy file.

Going forward, especially if there will be others working on vignettes, we need some instructions for how this should work. In the repo for my datavis site, https://github.com/friendly/datavis, we have the following:

Workflow for development

git checkout master
git pull
git branch {new-development-branch-name}
git checkout {new-development-branch-name}
...git add and git commit code to the new branch
...develop, test, develop, test
git push -u origin {new-development-branch-name}
...once the development is finished, create a pull request for the branch
to be merged into master.

[These should probably be described in RStudio-friendly terms, rather than just git commands.]

Questions:

How should we modify these here?
Should everyone involved have their own vignettes branch in their local repo?
Should we bring the develop branch up to date, and merge there?

Check on removing Master

the Master data table was replaced by People last year. It still exists in `data/'. Should check that there are no references to it anywhere and then remove it.

erroneous dates in Master

johnsbi01 was born in 1961 and died in 1942? According to (https://en.wikipedia.org/wiki/Lefty_Johnson) he was born in 1862.

downloading 2019

I was unable to download 2019 data with the provided link: install_github("cdalzell/Lahman", ref="insert_branch_name_here") after trying various branches. I received this error message: Error in utils::download.file(url, path, method = method, quiet = quiet, :
cannot open URL 'https://api.github.com/repos/cdalzell/Lahman/tarball/...'
I WAS able to download 2019 simply using: install_github("cdalzell/Lahman")

It is time to release v8.0-0

We had some conversation about the new release, but this got lost with covid & other things. Baseball season is on hold, but the Lahman pkg can still go forward.

Perhaps @cdalzell can remind us of the steps we need to do. AFAICS, the current version, with all updated data since the last release in on the release v8/0-0 branch

replace plyr with dplyr; add dplyr examples

dplyr is now so much better than plyr that the examples using the latter should be replaced with dplyr. For the most part, I think this entails just using

library(dplyr)

rather than

library(plyr)

in examples, but this should be checked. Also, requires change in DESCRIPTION -- Suggests:

As well, further useful examples of dplyr could be added. Some examples of queries on the Batting table are given in vignette("window-functions", package="dplyr").

CRAN version of Lahman requires 3.5

Which is breaking our dplyr automated checks which we test on 3.3 and 3.4. But the DESCRIPTION in this repo only mentions 2.10, so I'm a little mystified.

Package Build Issues

Encountering a few errors and warnings while building & checking. This may just be some merge debris.. I'm currently investigating.

Found the following significant warnings:
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:59: unexpected '}'
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:49: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:50: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:51: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:52: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:54: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:55: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:56: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:57: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:58: All text must be in a section

WARNING
Files not of a type allowed in a 'data' directory:
'baseballdatabank-master'
Please use e.g. 'inst/extdata' for non-R data files

ERROR
Error in eval(expr, envir, enclos) :
could not find function "battingStats"
Calls: %>% -> eval -> eval

2015 update available

2015 data is now available, either through Github or my website: http://seanlahman.com

Salaries data stops at 2016

I'm using Lahman package 9.0.0. It appears that Salaries data is stuck at 2016.

> library(Lahman)
> summary(Salaries$yearID)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1985    1994    2001    2001    2009    2016

I can't see any documentation on this in R package documentation or at Sean Lahman's web site. Is this part of the dataset no longer being updated in the master dataset? If so, it is unfortunate given Michael Friendly's nice vignette, Team Payroll and the World Series.

Branches need some pruning

Looking at the github repo, there is an awful lot of cruft branches that go back to release/v.3.0-0.
I guess we should maintain the release branches, but what about the various feature/ branches
and dm- branches?

2022 Data Update

Looks like the 2022 data was released a couple of weeks ago: https://github.com/chadwickbureau/baseballdatabank/releases/tag/v2023.1

Time to start in on v11 as well as perhaps address some of the outstanding tickets if possible.

cdalzell / lahman Goto Github PK

lahman's People

Contributors

Stargazers

Watchers

Forkers

lahman's Issues

Workflow for development

Questions:

Recommend Projects

Recommend Topics

Recommend Org