cdalzell / lahman Goto Github PK
View Code? Open in Web Editor NEWR Package Containing Sean Lahman's Baseball Database
Home Page: https://cran.r-project.org/web/packages/Lahman/Lahman.pdf
R Package Containing Sean Lahman's Baseball Database
Home Page: https://cran.r-project.org/web/packages/Lahman/Lahman.pdf
It looks like both Awards tables need to be updated.
library(tidyverse)
packageVersion("Lahman")
#> [1] '8.0.1'
Lahman::AwardsManagers %>%
.$yearID %>%
max()
#> [1] 2016
Lahman::AwardsPlayers %>%
.$yearID %>%
max()
#> [1] 2017
The MD5 file is outdated, and should be deleted because it causes errors/warnings in package build/check.
The source_data/ directory is useful to preserve, but should be added to .Rbuildignore because
it gives warnings in build/check.
I tried R cmd check in a local clone of the feature/2015-data-update
. It gives a warning on baseballdatabank-master
:
* checking contents of 'data' directory ... WARNING
Files not of a type allowed in a 'data' directory:
'baseballdatabank-master'
Please use e.g. 'inst/extdata' for non-R data files
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking examples ... OK
* DONE
Status: 1 WARNING, 1 NOTE
The "payroll.Rmd" file does not knit; there is an error at line 210, in the car::Boxplot() function. I'm not familiar with the package, and can't diagnose the source of the error.
There are currently no examples in the following man files:
It would be nice to have some added, even to the ones that are just separate tables for post season play!
It looks like the table FieldingOFsplit
is missing. I think you need it to cover OF splits for years subsequent to 1955.
Now that we have vignettes, perhaps it it time to abandon the old URL: http://lahman.r-forge.r-project.org/ in the package DESCRIPTION.
One main thing that might be interesting there is the link http://lahman.r-forge.r-project.org/doc/ to an old version of the package documentation, with results of all examples. This could be redone using pkgdown
, and then the URL changed to a github.io based version.
There might be a few other links worth preserving.
My sense is that if we do this, it should be after the CRAN release of v.7.0-0
Thoughts?
There are a number of tests and code examples in the documentation, it would probably be helpful to create actual unit tests out of these to help expedite and increase the confidence in library data validation.
There are a number of packages out there, but this one (http://r-pkgs.had.co.nz/tests.html) seems to have CRAN integration.
Data issue reported via email, confirmed in latest version.
library(Lahman)
library(dplyr)
Lahman::Master %>%
filter(playerID == "johnsbi01") %>%
select(playerID, nameFirst, nameLast, birthDate, debut) %>%
mutate(debut_age = as.Date(debut) - as.Date(birthDate))
I'm reasonably sure no one has ever debuted in the MLB at -28216 days old.. :)
I'm seeing that there's a ton of work currently underway in our upstream data source. I'll submit a fix PR if that's not already fixed before this year's update is submitted to CRAN.
Issue reported via email:
I'm not sure if this error goes upstream, but in the most recent version of the R package, the names
for the Cubs and White Sox are reversed in the Teams table.require(dplyr) require(Lahman) Teams %>% filter(teamID %in% c("CHA", "CHN") & yearID > 2010) %>% select(yearID, teamID, name)The names should be reversed for 2013 and 2014.
Hello Lahman contributors,
I'm developing an R package (bbgraphsR) that uses the Teams
dataframe from your package in one of my functions (viz_standings), but either when running devtools::load_all()
or devtools::check()
, I receive the following warning:
Warning: object ‘Teams’ is not exported by 'namespace:Lahman'
I indeed checked the NAMESPACE file and it makes sense because there is no line exporting that table.
In that case, can I save the table data and use it in my package?
Or is it expected to export that table in the future?
Thanks in advance for your feedback.
Regards.
Daniel.
Within R Studio, I ran build->check. Looks good, but there is one NOTE and a comment that may trigger a reply from the CRAN gateway-keeper:
* checking installed package size ... NOTE
installed size is 7.4Mb
sub-directories of 1Mb or more:
data 7.2Mb
This is standard, since it is mainly a data package. this is already addressed in the cran-comments.md
file.
We also get a comment at the end that the Pitching examples take longer than 5 sec to run.
* checking examples ... OK
Examples with CPU or elapsed time > 5s
user system elapsed
Pitching 4.96 0.22 5.18
* DONE
The last example in appearances doc returns 207 players as having played 162 games that season, only four of whom have a player ID associated with them. The rest are NA.
This can be reproduced in the current version available on CRAN (3.0-1)
This will reproduce the issue:
all162 <- ddply(subset(Appearances, yearID > 1960), .(yearID),
summarise, allGamers = playerID[G_all == 162])
table(all162$yearID)
In the table Master
two of the date fields (debut
and finalGame
) the dates are represented inconsistently. For the dates prior to January 1, 1900 the dates are shown as yyyy-mm-dd (i.e. following the ISO 8601 standard). For dates after that, they are in m/d/yyyy.
(Note that a third date field in the table, birthDate
, is in yyyy-mm-dd.)
This issue exists in the source csv version the Lahman database (version 2015), and is not peculiar to the R version.
I have not checked other tables at this point.
Lahman has
Reverse suggests: broom, dplyr, poplite
Maintainers of those packages should be notified before release to CRAN. (I recall there was
something that caused dplyr to break in a previous release of Lahman.)
Hi,
I wanted to try and update the data so that it included the 2017 season.
It seems Sean hasn't had a chance to upload to his site yet, but mentioned that the CSV files exist on github (https://twitter.com/seanlahman/status/974796328130744320)
They're at this location: https://github.com/chadwickbureau/baseballdatabank/
I was going to try and open a PR to add these in. It seems I should be trying to use the inst/scripts to ensure they're correctly saved as Rdata files and appropriate columns set to factors. Is that correct?
Can someone tell me what "ML" in lgID stands for? AL & NL are self-explanatory.
I teach a statistics course and assigned some data wrangling using the Salaries and Teams tables from the Lahman package for a midterm. An astute student pointed out some weirdness in the Salaries table for 2014. There are two extra teamIDs, NYM and SFG, in 2014 (leading to 32 national league teams total for that year). These codes appear to be the franchIDs as opposed to the teamIDs (NYN being the teamID and NYM the franchID; SFN being the teamID and SFG the franchID). But, in 2014, NYN and SFN (the correct teamIDs) also appear, with some salary data present. Code is below. I also pulled the .csv files from Sean Lahman's site and the error is there as well.
library(Lahman)
Teams <- Lahman::Teams
Salaries <- Lahman::Salaries
library(dplyr)
# create new payroll dataframe summing over players within teams for each year from Salaries table
payroll <- Salaries %>%
group_by(yearID, teamID) %>%
summarise(payroll = sum(salary))
# look at just 2013:2014...
payroll %>%
filter(yearID %in% c(2013, 2014)) %>%
group_by(teamID) %>%
summarise(n = n()) %>%
print(n = 32)
payroll %>%
filter(yearID %in% c(2013, 2014), teamID %in% c("NYM", "NYN", "SFG", "SFN"))
# note 32 teams total in NL for 2014: NYM and SFG each appear just in 2014
# by comparison, in the Teams table...
Teams %>%
filter(yearID %in% c(2013, 2014), teamID %in% c("NYM", "NYN", "SFG", "SFN")) %>%
select(yearID:franchID)
#3 rows for NYN teamID in 2014; 24 for NYM team ID same year
Salaries %>%
filter(yearID == 2014, teamID %in% c("NYM", "NYN")) %>%
group_by(teamID) %>%
tally()
#1 row for SFN teamID in 2014; 27 for SFG team ID same year
Salaries %>%
filter(yearID == 2014, teamID %in% c("SFG", "SFN")) %>%
group_by(teamID) %>%
tally()
# looks like in Salaries table, two franchIDs (NYM and SFG) were mistakenly entered for a few rows instead of the teamID (NYN and SFN)
The 2021 data update is being released & refined, time to update and release v10 of this package.
Currently watching the upstream 2022.x releases: https://github.com/chadwickbureau/baseballdatabank/releases
First attempt to build/install v. 4.0-0 gives the following error:
* installing *source* package 'Lahman' ...
files 'data/SchoolsPlayers.RData', 'man/SchoolsPlayers.Rd' are missing
I'm not sure why the name of this table was changed.
There were two All-Star games played in 1962, but the 8.0-0 release candidate has all data in AllstarFull from both games with zeros in the gameNum field.
From readme2014.txt:
SchoolsPlayers has been replaced with a new table called CollegePlaying.
This reflects advances in the compilation of this data, largely led by
Ted Turocy. The old table reported college attendance for major league
players by listing a start date and end date. The new version has a
separate record for each year that a player attended. This allows
us to better account for players who attended multiple colleges or
skipped a season, as well as to identify teammates.
Need to make sure the .Rd files are created correctly and that the docs are updated appropriately.
From readme2014.txt:
Removed two deprecated fields from the batting table. The G_batting and
G_old fields were rendered obsolete when we created the appearances table.
They've been removed from the batting table starting with this version
Need to:
I just noticed in vignette("databases", package="dplyr")
, that Hadley has defined a bunch of
functions to use/illustrate dplyr operations with the Lahman database, but in sql form.
See also: ?dplyr::lahman
I have no idea whether this actually works somewhere, or is purely notional. It seems to require
a local sql database to be setup somewhere. But it seems to be something useful to explore
perhaps with Hadley's cooperation, for a future release.
Hi,
I noticed that the playerID for the CollegePlaying dataset does not merge. Did you change identifier after 2014? Thank you
The 2019 data update has been released, time to update and release v8 of this package.
Feature branch: https://github.com/cdalzell/Lahman/tree/feature/2019-data-update
I'm looking at the branches with a view to preparing for the v. 7.0 release. A major feature will be the inclusion of vignettes, now all on the vignettes
branch. This branch is now ~11 commits ahead of master
and 6 commits behind. There will be more work on the vignettes
branch as we go forward.
I'm not sure how to manage this on GitHub with a PR to master
. Hopefully, there are no conflicts, so a fast-forward will work.
I think, but am not sure, that Chris' initial work with the 2017 data bases are on a feature/
branch.
(Cross-posting from https://github.com/chadwickbureau/baseballdatabank/issues/143)
In the Managers table, the playerID
given for Rob Thomson is thompro01
(which is Robby Thompson's) instead of thomsro99
.
The other rows for Thomson (2008 NYA) and Thompson (2005 CLE and 2013 SEA) list the correct playerID
s.
library(dplyr, warn.conflicts = FALSE)
library(Lahman)
Managers %>%
filter(playerID == "thompro01" | playerID == "thomsro99")
#> playerID yearID teamID lgID inseason G W L rank plyrMgr
#> 1 thompro01 2005 CLE AL 3 1 1 0 2 N
#> 2 thomsro99 2008 NYA AL 2 3 1 2 3 N
#> 3 thompro01 2013 SEA AL 2 28 13 15 4 N
#> 4 thompro01 2022 PHI NL 2 111 65 46 3 N
Created on 2023-07-31 with reprex v2.0.2
Per email from Hadley Wickham:
I am starting the ggplot2 release process, aiming for a CRAN release
on November 13. This is biggest ggplot2 release in a while so there
are a lot of improvements.Key points:
- You can install the dev version with
devtools::install_github("hadley/ggplot2")
and
you can see what's changed at
https://github.com/hadley/ggplot2/blob/master/NEWS.md.- Please run
R CMD check
on your package with the
development version of ggplot2 installed. You can
see the results from my runs at
https://github.com/hadley/ggplot2/blob/master/revdep/summary.md.
The 2020 data update has been released, time to update and release v9 of this package.
https://github.com/cdalzell/Lahman/tree/feature/2020-data-update
I'm seeing an issue with AwardsPlayers in Version 11.0-0. Jake Peavy (peavyja01) has a "gold glove" award in 2012. This doesn't show up in the AwardsPlayers table or any other table
The award was shared with Jeremy Hellickson ( hellije01 ). But only hellije01 is credited in AwardsPlayers.
Other years where OFs have shared "Gold Gloves" seem to be correct (example 1987 or 1985 AL)
Many people find it hard to see how the various Lahman
data frames are related: what are the keys that link the different data sets? One possibility is an ER diagram, commonly used to describe relational data bases.
A quick Google search turned up the datamodler package. Perhaps this could be used here in some way:
I think it would be neat if it were possible to load and append OOTP (Out Of The Park Baseball) save game stats.
This is an extreme stretch goal and I have no idea if it's even feasible as I'm not even slightly familiar with the internals of OOTP saves.
Late Spring cleaning:
We have quite a few old dead branches on the repo tree.
Time to clean them up. I can do this, but wanted to check with @cdalzell first, in case any old ones should be preserved for any reason.
@djmurphy420 made the following notes re pull request #23
Note 1: Check the graph in the Batting table re labeling of .400 hitters - dplyr changes the ordering of hitters so Michael's adjustments don't work as before. I just tried to make a few vertical adjustments. If the graph isn't acceptable, I can try it again using the ggrepel package, which uses an algorithm to spread out overlapping labels.
Note 2: The zipcode data object wasn't found when I tested the existing example from Schools.Rd. I tried some small geocoding experiments, but in several cases, the geocodes from Google were far off the mark so I abandoned the effort. Someone might want to look at that more closely.
Note 3: I made a couple of changes in the Lahman-package file, but there need to be a few more updates re maintainer and references. I'll let Chris and Michael sort that out.
Yesterday, I made a pull request for my vignettes
branch that I created in my repo. By mistake, I merged it into master
. It didn't do any harm, because it just created a vignettes/
directory with a dummy file.
Going forward, especially if there will be others working on vignettes, we need some instructions for how this should work. In the repo for my datavis site, https://github.com/friendly/datavis, we have the following:
[These should probably be described in RStudio-friendly terms, rather than just git
commands.]
vignettes
branch in their local repo?develop
branch up to date, and merge there?the Master
data table was replaced by People
last year. It still exists in `data/'. Should check that there are no references to it anywhere and then remove it.
johnsbi01
was born in 1961 and died in 1942? According to (https://en.wikipedia.org/wiki/Lefty_Johnson) he was born in 1862.
I was unable to download 2019 data with the provided link: install_github("cdalzell/Lahman", ref="insert_branch_name_here") after trying various branches. I received this error message: Error in utils::download.file(url, path, method = method, quiet = quiet, :
cannot open URL 'https://api.github.com/repos/cdalzell/Lahman/tarball/...'
I WAS able to download 2019 simply using: install_github("cdalzell/Lahman")
We had some conversation about the new release, but this got lost with covid & other things. Baseball season is on hold, but the Lahman pkg can still go forward.
Perhaps @cdalzell can remind us of the steps we need to do. AFAICS, the current version, with all updated data since the last release in on the release v8/0-0 branch
dplyr is now so much better than plyr that the examples using the latter should be replaced with dplyr. For the most part, I think this entails just using
library(dplyr)
rather than
library(plyr)
in examples, but this should be checked. Also, requires change in DESCRIPTION
-- Suggests:
As well, further useful examples of dplyr could be added. Some examples of queries on the Batting
table are given in vignette("window-functions", package="dplyr")
.
Which is breaking our dplyr automated checks which we test on 3.3 and 3.4. But the DESCRIPTION in this repo only mentions 2.10, so I'm a little mystified.
Encountering a few errors and warnings while building & checking. This may just be some merge debris.. I'm currently investigating.
Found the following significant warnings:
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:59: unexpected '}'
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:49: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:50: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:51: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:52: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:54: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:55: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:56: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:57: All text must be in a section
Warning: d:/temp/Rtmp4snZfi/R.INSTALL1780fd14f1b/Lahman/man/TeamsFranchises.Rd:58: All text must be in a section
WARNING
Files not of a type allowed in a 'data' directory:
'baseballdatabank-master'
Please use e.g. 'inst/extdata' for non-R data files
ERROR
Error in eval(expr, envir, enclos) :
could not find function "battingStats"
Calls: %>% -> eval -> eval
2015 data is now available, either through Github or my website: http://seanlahman.com
I'm using Lahman package 9.0.0. It appears that Salaries data is stuck at 2016.
> library(Lahman)
> summary(Salaries$yearID)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1985 1994 2001 2001 2009 2016
I can't see any documentation on this in R package documentation or at Sean Lahman's web site. Is this part of the dataset no longer being updated in the master dataset? If so, it is unfortunate given Michael Friendly's nice vignette, Team Payroll and the World Series.
Looking at the github repo, there is an awful lot of cruft branches that go back to release/v.3.0-0.
I guess we should maintain the release branches, but what about the various feature/ branches
and dm- branches?
Looks like the 2022 data was released a couple of weeks ago: https://github.com/chadwickbureau/baseballdatabank/releases/tag/v2023.1
Time to start in on v11 as well as perhaps address some of the outstanding tickets if possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.