ipums / ipumsr Goto Github PK
View Code? Open in Web Editor NEWRequest, download, and read IPUMS data in R
Home Page: https://tech.popdata.org/ipumsr/
License: Mozilla Public License 2.0
Request, download, and read IPUMS data in R
Home Page: https://tech.popdata.org/ipumsr/
License: Mozilla Public License 2.0
Oct 5, 2018 @gergness:
Seems like the codebooks are in the same format as Terra, so should be straightforward
May 1, 2020 @dtburk:
In response to mnpopcenter/ipumsr#56, we added a warning message when the lower_vars
argument to any of the read_ipums_*
functions is ignored. As described in the discussion of that issue, the reason the argument is sometimes ignored is to make sure the case of the variable names stays in sync between the data and the ipums_ddi
object associated with the data. Keeping these in sync is helpful if the user wants to use a function like set_ipums_var_attributes()
that attaches metadata from the ipums_ddi
to variables in a data.frame. However, by making these metadata-attaching functions a little smarter, we can probably allow the case of variable names to get out of sync between ipums_ddi
and data.frame, while still allowing users to attach metadata if they want to. Once we make those fixes, we can allow users to convert variable names to lowercase when they read in the data, even if they have already read in the DDI.
Prepare for release:
git pull
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
git push
usethis::use_github_release()
usethis::use_dev_version()
git push
add_to_extract()
allows arbitrary argument names for cross-product compatibility, but no check is done to warn users if they include arguments that are not relevant for the particular extract type they are working with. This produces confusing behavior. For instance:
extract <- define_extract_usa(
samples = "us2017b",
variables = "YEAR",
description = "Test extract"
)
# Returns extract with no modifications or warnings since there is no "vars" field in a usa_extract
add_to_extract(
extract,
vars = "New Variable"
)
We do warn users for remove_from_extract()
, so this just requires an extrapolation of that check to add_to_extract()
.
Currently ipums.org is the first-listed URL in the DESCRIPTION file, which means that links to the package generated by tools such as downlit will go to that URL. It might make more sense to list the GitHub URL, or tech.popdata.org/ipumsr, first in the DESCRIPTION file for this reason.
The next version of tidyselect that I'm about to release will cause CRAN failures for ipumsr because its tests are checking for exact matches of error messages generated in tidyselect and these have now changed. Since error message contents aren't part of the tidyselect API, could you please use testthat snapshots instead?
Hi,
When I am defining an API extract with many samples (i.e. n >= 20), would it be possible to add support for dplyr select helper verbs? Something like this -
define_extract_cps(
samples = select(starts_with("2022"))
)
If htmltools
, shiny
, or DT
are not present when trying to call ipums_view(ddi)
, the user is prompted with the following error:
|Error in ipums_view(ddi) :
| Please install htmltools, shiny, and DT using
| `install.packages(c('htmltools', 'shiny', 'DT')
The closing )
as well as single quote are missing from the end of this message, which could confuse some users.
The API request in submit_extract
is missing the api_key
argument, so users can only submit an extract if their API key is in their .Renviron
. Attempting to submit an extract with the API key specified explicitly in the api_key
argument fails.
If attempting to download an extract by providing an ipums_extract
object that was not yet completed at the time it was generated, download_extract()
gets, but does not successfully use, the updated status of this extract provided by get_extract_info()
. An expired extract error is thrown.
Should be able to be addressed by updating the is_ready
variable after getting updated info:
if (!is_ready) {
extract <- get_extract_info(extract, api_key = api_key)
}
should be changed to
if (!is_ready) {
extract <- get_extract_info(extract, api_key = api_key)
is_ready <- extract_is_completed_and_has_links(extract)
}
Hello. I was working on parsing a DDI file and was looking at the IPUMSR source code. One thing I found a bit confusing was a portion of the ddi_read.R
file, which seems to parse the <CodInstr>
section of the variable node.
Most of the time, the categorical information is contained within the <catgry>
tag, however I noticed this section of the code that uses a regular expression to parse that portion of the CodInstr
tag. The code is below. My question is, why is it necessary to parse the CodInstr
section of the DDI file, and whether this is a common thing. The regular expression is very specific, so I am not sure that it would generalize very well. Is this specific function used only for the specific "total personal income" INCTOT variable, or are there other variables that also have categorical information in the CodInstr
tag.
The code from IPUMSR is found in the specified file ddi_read.R
starting at line 907.
parse_code_regex <- function(x, vtype) {
if (vtype %in% c("numeric", "integer")) {
labels <- fostr_named_capture(
x,
"^(?<val>-?[0-9.,]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(?<lbl>.+?)$",
only_matches = TRUE
)
labels$val <- as.numeric(fostr_replace_all(labels$val, ",", ""))
} else {
labels <- fostr_named_capture(
x,
"^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$",
only_matches = TRUE
)
}
labels
}
May 14, 2019 @gergness:
When I was first writing ipumsr I did some work translating the stata code on static pages of ipums.org to explain how to use survey weight variables. It's always been on my todo list to help projects update, but I never did get around to it.
Yesterday, two IPUMS users on twitter were talking about this:
https://twitter.com/surlyurbanist/status/1127968834902605825
To make sure it doesn't get lost, here's the translation of CPS, USA & NHIS user notes on weights for R.
Adapted from https://cps.ipums.org/cps/repwt.shtml
In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.
# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~WTSUPP, repweights = "REPWTP[0-9]+", type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)
# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = WTSUPP, repweights = matches("REPWTP[0-9]+"), type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)
After setting up the svy object, we can now use it to perform weighted calcuations. For example, to
calculate the mean of a variable named VAR1:
svymean(~VAR1, svy)
svy %>%
summarize(mn = survey_mean(VAR1))
And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:
svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)
svy %>%
filter(AGE >= 25 & AGE < 65) %>%
summarize(mn = survey_mean(VAR1))
Adapted from: https://usa.ipums.org/usa/repwt.shtml
In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.
# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~PERWT, repweights = "REPWTP[0-9]+", type = "Fay", rho = 0.5, mse = TRUE)
# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = PERWT, repweights = matches("REPWTP[0-9]+"), type = "Fay", rho = 0.5, mse = TRUE)
After setting up the svy object, we can now use it to perform weighted calcuations. For example, to
calculate the mean of a variable named VAR1:
svymean(~VAR1, svy)
svy %>%
summarize(mn = survey_mean(VAR1))
And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:
svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)
svy %>%
filter(AGE >= 25 & AGE < 65) %>%
summarize(mn = survey_mean(VAR1))
Adapted from https://nhis.ipums.org/nhis/userNotes_variance.shtml
The following general syntax will allow users to account for sampling weights and design variables when using STATA, SAS, SAS-callable SUDAAN, or R (through the survey or srvyr package) to estimate, for example, means using IPUMS NHIS data.
...
# If not installed already: install.packages("survey")
library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)
svymean(~VAR1, svy)
# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)
svy %>%
summarize(mn = survey_mean(VAR1))
...
library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)
svy_subset <- subset(svy, AGE >= 65)
svymean(~VAR1, svy_subset)
library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)
svy %>%
filter(AGE >= 65) %>%
summarize(mn = survey_mean(VAR1))
Is there a reference mapping the sample name code to the sample dataset name? I don't see it in the vignette. Thank you!
I am using the function read_ipums_ddi
to import the ATUS.
It used to work fine in the past.
I get the following error
Error in read_xml.character(ddi_file_load, data_layer = NULL) : Opening and ending tag mismatch: meta line 12 and head [76]
For example, the detailed race could be of value for looking at multiple groups and having a dataset that can be filtered through versus several subset pulls or a hodgepodge set that may not address questions without multiple iterations.
Is there anyway to improve usability to allow for detailed to be selected and all be returned.
This can be manually built but that is considerable tedium.
Current default behavior:
var_spec("RACE",
case_selection_type= "detailed", case_selections =c('must include exactly'))
Revised default behavior:
var_spec("RACE",
case_selection_type= "detailed", case_selections ="all, unless you list specific codes"))
From ipumsr created by dtburk: mnpopcenter/ipumsr#72
We shouldn't see this message when we specify a subset of variables with the vars
argument to read_ipums_micro()
:
Note: Using an external vector in selections is ambiguous.
โน Use `all_of(vars_of_interest)` instead of `vars_of_interest` to silence this message.
โน See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
I got this message with ipumsr version 0.4.5 and tidyselect version 1.1.0.
Apr 19, 2021 @schmert:
Haven-labelled variables are unfamiliar to many R users. The {ipumsr} documentation even includes instructions that suggest that R users will almost always want to alter the haven-labelled variables output by read_ipums_micro before doing any real work -- with zap_values, to_character, etc.
Would it be possible to add a parameter to read_ipums_micro that allows the user to choose how labelled variables are output in the first place? For example,
output_labelled_as = c("haven", "value", "label", "factor")
with the default being the current "haven"?
This could save R users a ton of headaches. Thanks.
Jan 15, 2021 @edarin:
Thanks a lot for developing this useful package.
Would it be possible to create a function for downloading data directly in R from the website?
Thanks!
This package depends on (depends, imports or suggests) raster and one or more of the retiring packages rgdal, rgeos or maptools (https://r-spatial.org/r/2022/04/12/evolution.html). Since raster 3.6.3
, all use of external FOSS library functionality has been transferred to terra, making the retiring packages very likely redundant. It would help greatly if you could remove dependencies on the retiring packages as soon as possible.
The value-labels vignette refers to as_factor()
, and pkgdown attempts to automatically link this to the appropriate function documentation, but in this case, it links to forcats::as_factor()
instead of haven::as_factor()
. If the vignette was just referring to a function from an external package, we could just use haven::as_factor()
explicitly. However, ipumsr re-exports haven::as_factor()
so that ipumsr users don't have to load haven to use it, so it wouldn't be ideal if we had to use haven::as_factor()
in the vignette just to get the pkgdown link to work properly.
One partial solution would be to replace references to as_factor()
with
[`as_factor()`](https://haven.tidyverse.org/reference/as_factor.html)
in the text of the vignette, but then those links would look different from the links auto-generated by pkgdown, and we would have to manually update the url if haven ever moved its documentation site. Moreover, that approach wouldn't work for code references to as_factor()
.
It's possible that we should create an issue on pkgdown or downlit requesting a new feature that allows pkgdown users to manually specify which package a function is from for function names that appear in multiple packages, or alternatively, an update that checks for function name matches in re-exported functions before looking more widely.
ipums_website()
has several issues that should be addressed. Currently, the list of supported projects is out of date and the UI is somewhat inconsistent. While this function likely does not get substantial use, it may remain useful given the current absence of a metadata API for microdata projects. We need to:
var
argument, since some projects that do not have variable-specific websites are supportedipums_ddi
object)Hi there, I am trying to submit an NHGIS data extract through ipumsr and I'm unable to locate a table that I know is available through the website. The table is B19001H and I need it for both 2005-2009 ACS and 2014-2018 ACS. If helpful, the titles are:
Is it possible to make this available through ipumsr, or should I manually download this extract? Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.