ropengov / psdata Goto Github PK

An R package to download regularly maintained political science data sets and make commonly used, but infrequently updated variables based on this data.

Home Page: https://ropengov.github.io/psData/

R 100.00%

psdata's Introduction

psData

Started by Christopher Gandrud

This R package includes functions for gathering commonly used and regularly maintained political science data sets. It also includes functions for combining components from these data sets into variables that have been suggested in the political science literature, but are not regularly updated.

psData includes two primary function types: Getters and Variable Builders. Getter functions automate the gathering and cleaning of particular data sets so that they can easily be merged with other data. They do not transform the underlying data. Variable Builders use Getters to gather data and then transform it into new variables suggested by the political science literature. The functions currently part of psData include:

Getters

DpiGet: a function to download the Database of Political Institutions data set. It keeps specified variables and creates a standard country ID variable that can be used for merging the data with other data sets.
PolityGet: a function to download the Polity IV data set. It keeps specified variables and creates a standard country ID variable that can be used for merging the data with other data sets.
RRCrisisGet: download and combine Reinhart and Rogoff’s (2010) crisis dummy variables into one data frame.
WB_IMFGet downloads Axel Dreher’s data set of IMF programs and World Bank projects (1970-2011). It keeps specified variables and creates a standard country ID variable that can be used for merging the data with other data sets.

Variable Builders

WinsetCreator: Creates the winset (W) and a modified version of the selectorate (S) variable from Bueno de Mesquita et al. (2003) using the most recent data available from Polity IV and the Database of Political Institutions.

Others

Other functions included that might be useful to people working with political science data:

CountryID: Function for creating standardised country names and ID variables. This builds on countrycode and includes extra capabilities for reporting and dealing with duplicates.

Updates

Most of the Getter functions currently included in psData download data from a specific URL that links to a data file. Hopefully, the data sets’ authors will keep their data up-to-date. When they make updates, they will likely link to the updated file with a new URL. All of the functions in psData that gather data from a file at a specific URL allow the user to specify a new URL, if they want to.

If you notice an updated version of one of the data sets, feel free to submit a Pull Request with the new URL. It would be great if you make sure that the function still works, as the data set’s authors may change the format breaking the Getter function.

Suggestions

Please feel free to suggest other data set downloading and variable creating functions. To do this just leave a note on the package’s Issues page.

Also feel free to make a pull request with a new Getter or Variable Builder. Please make the pull request on a branch other than the master.

Examples

To download only the polity2 variable from Polity IV:

library(psData)
PolityData <- PolityGet(vars = "polity2")

head(PolityData)
#>   iso2c standardized_country     country year polity2
#> 1    AF          Afghanistan Afghanistan 1800      -6
#> 2    AF          Afghanistan Afghanistan 1801      -6
#> 3    AF          Afghanistan Afghanistan 1802      -6
#> 4    AF          Afghanistan Afghanistan 1803      -6
#> 5    AF          Afghanistan Afghanistan 1804      -6
#> 6    AF          Afghanistan Afghanistan 1805      -6

Note that the iso2c variable refers to the ISO two letter country code country ID. This standardised country identifier could be used to easily merge the Polity IV data with another data set. Another country ID can be selected with the OutCountryID argument. See the package documentation for details.

To create winset (W) and selectorate (ModS) data use the following code:

library(psData)

WinData <- WinsetCreator()

head(WinData)
#>   iso2c              country year    W ModS
#> 1    AE United Arab Emirates 1975 0.25 0.25
#> 2    AE United Arab Emirates 1976 0.25 0.25
#> 3    AE United Arab Emirates 1977 0.25 0.25
#> 4    AE United Arab Emirates 1978 0.25 0.25
#> 5    AE United Arab Emirates 1979 0.25 0.25
#> 6    AE United Arab Emirates 1980 0.25 0.25

psdata's People

Contributors

Stargazers

Watchers

Forkers

ulfelder nataliabueno anhqle ee-in xfim tris-sondon samclifford taigaaltai onurgitmez

psdata's Issues

Data recipe for PITF Worldwide Atrocities Dataset?

[T]he Political Instability Task Force’s Worldwide Atrocities Dataset [] records information from several international press sources about situations in which five or more civilians are deliberately killed in the context of some wider political conflict. (Jay Ulfelder)

The blog post already links to some R code. The recipe method is described in #10.

Travis CI fail

Travis CI fails on loading xlsx dependency.

Quality of Governance Indicators

@briatte Thanks for suggesting placing your QoG work together in one place with psData.

First attribution: This would be a really big contribution, so I'ld be more than happy to add you as a contributor to the repository and make you an author on the package overall. Just let me know.

Things to discuss/work on:

Unified syntax: I think to make the package really user friendly we should work on unifying the syntax across the commands that gather the different data sets. I've worked out kind of a start, but I'm not particularly wedded to anything. We should discuss this further.

Unified data object class and data handling methods: I really like what you've done with adding additional capabilities to lag/lead/graph the data. It would be great to apply this across the other sources.

Thoughts on moving forward:

If you want to merge psData and your work together I think:

I should first give you contributor status on the psData repo.
We start a qogDev branch to psData which you can patch over the code into.
We should start co-writing a JSS style package description paper while we unify the syntax and data handling methods across the commands. I think writing a paper while doing this will (a) help us clearly delineate the package structure and usage, (b) provide a guide to other possible contributors (who may also become co-authors), and (c) satisfy our need to appease the academic gods who are hungry for papers.

Quick question:

Does the current version of qogdata pass CRAN check?

Just let me know what you think about this plan as a way forward.

Error in read.dta(tmpfile) : a binary read error occurred

Occurs with following command

DpiData <- DpiGet()

also there is what appears to be a typo on the main page. You refer to pdData as not yet being on CRAN

2015 edition of Polity IV has been released and the default 2012 URL does not work

The default url in PolityGet() points to the p4v2012.sav file which does not exist anymore, and therefore it produces an error. It must point to the latest release, which currently is p4v2015.sav.

panel_lag and shift

add improvements to panel lag and shift based on recent updates in DataCombine. These are mostly error handling improvements, but also has capabilities for a number of different moving averages/'spread' for dummy variables.

csv,conf Presentation

I've put together the first draft of the presentation for csv,conf next week.

The slide deck is here: http://christophergandrud.github.io/psDataCSVConf/psDataPresentation.html

Feel free to make a pull request here: https://github.com/christophergandrud/psDataCSVConf

ECPR Summer School

Noting this down as an idea for when the production version is out: it might be relevant to run an ECPR Summer School course with psData one day. They have courses on other software, probably teaching with the same kind of data.

Overhall of psData

Following discussions in #1 and #3, there seems to be a consensus emerging that we should focus on:

packages for downloading/cleaning individual data sets
at the same time, at least for data in country-year format, a common syntax/capabilities should be developed that makes it easy for users to download data sets with these individual packages and returns data frames that can be easily merged with one another.

As such, I think what we might want to do is

(a) Create a new text document repo that would be used to collaborate on a common psCountryData framework. (Probably starting with an .md document that would develop a framework checklist for individual country-year data packages to follow. Also, because my main professional incentive right now is to publish papers, this could be developed into a JSS style article laying out the framework and giving examples from packages that implement it. Any interested person could of course co-author.).

(a) Create a new package called psCountryData that would contain core functions shared by the individual data Getter/Variable Builder packages. For example, psData currently includes a CountryID function. This is a modified version of countrycode that is handy for creating merge ready country identifiers. It looks like there is some good stuff in @briatte's QoG package that could go in there too.

(c) Break up psData into individual Getter and Variable Builder packages that implement this framework. Similarly the two QoG packages could (depending on the authors' preferences), implement the unified syntax.

Any thoughts?

Is this package still usable?

Hello, I am not sure if this package is still usable?
I tried to run the code, yet, R shows warning and said there is no "WinsetCreator".

Suggestions for Data sets/Variables to Add

Feel free to add data sets and variables that would be useful to include in psData. Code contributions are also always very helpful

Recipes

In terms of repository structure, I think it would be beneficial to split each data source into separate files. The idea would be to create a standarized "recipe" format that would include all info about the dataset (e.g. where to download, bibtex cite, name of cleaning script, date updated), and then a cleaning script that does all the magic we need.

I use something like that locally, where I have a YAML file that specifies all the info and then an accompanying python script that I use for cleaning.

This makes user contributions very easy. They just cut and paste another "recipe" and include an R script that does the cleaning. The only thing psData has to do is provide a proper API to parse the recipe, download the data, and activate the cleaning script.

Think of something like the homebrew install for mac and its library of "formulas":

https://github.com/Homebrew/homebrew/tree/master/Library/Formula

Implement psData Overhall

Following the consensus in #5 a number of major changes will be made to psData for the version 1 release. These include:

Separating the present getter and variable builder functions into their own packages.
Implement the framework laid out in the psData guidelines. Including:
get_data a function for downloading panel data sets by calling getter and variable builder functions from associated packages.
panel_set a function for cleaning the downloaded data into a political science panel-series (psData) object.
panel_merge a function for merging psdata objects.

Cache data locally after download

Allow users to pass a flag that will cache the data locally after it is downloaded. Goals:

1- Reproducibility. Even if the online version disappears, the user will still have a local copy
2- Quicker download on second request.

Should be accompanied by a standalone cleaning script if applicable.

CPS Special Issue thoughts

This isn't really a psData topic, but I was wondering if anyone is interested in putting together a proposal for the CPS special issue on transparency in the social sciences.

I'm thinking that maybe a project on how social scientists use version control and (primarily) GitHub to make their research more transparent.

Anyone interested?

PolityGet duplicate year for Russia/USSR

Thanks for a very useful package! One tiny bug. In the Polity IV Data import, Russia and USSR both have a 1922 year. This might not normally be a problem but they have the same iso2c code, so it creates a duplicate. Only 1922 has this problem.

S4 development branch

Hi,

Here's a draft S4 class that creates a subclass of data.frame to implement functions for panel data and time series manipulation, or for any dyadic data.

CountryID by @christophergandrud (wraps around countrycode by @vincentarelbundock, with edits)
MoveFront from DataCombine by @christophergandrud
class-HomogList and subclass_homog_list from DataFrameConstr by @jrnold
class-psData (hacked from DataFrameConstr by @jrnold)
constrained_data_frame from DataFrameConstr by @jrnold
debt.rda (demo Reinhart and Rogoff data)
get_data and a bunch of get (download) functions for e.g. Polity 4 and QOG data
panel with functions by @zmjones
utils (with snippets by more authors)

I'm sure many more people have written useful code for working with panel data, including for visualization purposes—the package should probably also look in that direction, and the draft includes a Reinhart and Rogoff plot.

Output

# install
devtools::install_github("briatte/psData")
library(psData)

# example
data(debt)

The object is of class data.frame:

> head(debt)
    Country Year    growth     ratio
1 Australia 1946 -3.557951 190.41908
2 Australia 1947  2.459475 177.32137
3 Australia 1948  6.437534 148.92981
4 Australia 1949  6.611994 125.82870
5 Australia 1950  6.920201 109.80940
6 Australia 1951  4.272612  87.09448

Now add panel design variables to quickly create a psData panel object:

> as.panel(debt, "Country", "Year")
 Panel data frame [ 1171 rows x  4 columns, 20 Country x 64 Year ]

    Country Year    growth     ratio
1 Australia 1946 -3.557951 190.41908
2 Australia 1947  2.459475 177.32137
3 Australia 1948  6.437534 148.92981
4 Australia 1949  6.611994 125.82870
5 Australia 1950  6.920201 109.80940
6 Australia 1951  4.272612  87.09448

The object is now a data.frame to which the package can pass functions similar to the xt and ts commands in Stata. Country codes and basic date formats are automatically detected and usable for safe merges between multiple panel data.

Need a verbose message on duplicated and dropped observations after countryid standardization

When using psData, the user would frequently see messages such as:

37 duplicated values were created when standardising the country ID with iso2c.
59 observations dropped based on missing values of the standardised ID variable.

I think it would be a good idea to add a verbose option that allows user to know exactly which id has been dropped / duplicated due to transforming which type of country ID to another. It would add a lot of peace of mind. In my experience, I always go back to check what has been dropped manually.

If @christophergandrud think this is a good idea, I would be happy to contribute a pull request.

dat

Do any of you have experience with dat?

I've been trying to think about how this might relate to psData. Overall, they aim to address similar issues, with psData being a bit more focused on one type of data and usage from R.

I wonder if it is worth exploring using dat as a backend, or some other integration.

Summer Hackathon

Sorry everyone, I've been kind of overwhelmed with other projects the past couple of weeks.

I was thinking that to get this thing started up again it might be good for interested parties to think of a few days to a week in the summer that they would be available to work together on this close to full time.

Any interest? Preferred times?

grep(pattern = "Northern America", x = psData::countrycode_data$region, value = T)

returns repeated values of "Northern America " with a trailing space.