Code Monkey home page Code Monkey logo

ipumsr's People

Contributors

bjornarneson avatar dtburk avatar etiennebacher avatar franfabrizio avatar gergness avatar hadley avatar jacobkap avatar renae-r avatar robe2037 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ipumsr's Issues

Improve lower_vars workflow (created May 1, 2020 by @dtburk on mnpopcenter/ipumsr)

May 1, 2020 @dtburk:

In response to mnpopcenter/ipumsr#56, we added a warning message when the lower_vars argument to any of the read_ipums_* functions is ignored. As described in the discussion of that issue, the reason the argument is sometimes ignored is to make sure the case of the variable names stays in sync between the data and the ipums_ddi object associated with the data. Keeping these in sync is helpful if the user wants to use a function like set_ipums_var_attributes() that attaches metadata from the ipums_ddi to variables in a data.frame. However, by making these metadata-attaching functions a little smarter, we can probably allow the case of variable names to get out of sync between ipums_ddi and data.frame, while still allowing users to attach metadata if they want to. Once we make those fixes, we can allow users to convert variable names to lowercase when they read in the data, even if they have already read in the DDI.

Release ipumsr 0.6.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Check if any deprecation processes should be advanced, as described in Gradual deprecation
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted ๐ŸŽ‰
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Rebuild pkgdown site
  • git push

`add_to_extract()` silently swallows unused arguments

add_to_extract() allows arbitrary argument names for cross-product compatibility, but no check is done to warn users if they include arguments that are not relevant for the particular extract type they are working with. This produces confusing behavior. For instance:

extract <- define_extract_usa(
  samples = "us2017b",
  variables = "YEAR",
  description = "Test extract"
)

# Returns extract with no modifications or warnings since there is no "vars" field in a usa_extract
add_to_extract(
  extract,
  vars = "New Variable"
)

We do warn users for remove_from_extract(), so this just requires an extrapolation of that check to add_to_extract().

ipums_view not correctly displaying table in RStudio viewer

ipums_view does not show value labels in the RStudio viewer. Instead, it will write something like "Showing 1 to 6 of 6 entries" without showing any of the entries on the page. Opening the page in a new browser widow solves the issue.

RStudio

Consider swapping order of package URLs in the DESCRIPTION file

Currently ipums.org is the first-listed URL in the DESCRIPTION file, which means that links to the package generated by tools such as downlit will go to that URL. It might make more sense to list the GitHub URL, or tech.popdata.org/ipumsr, first in the DESCRIPTION file for this reason.

ipumsr and tidyselect 1.2.1

The next version of tidyselect that I'm about to release will cause CRAN failures for ipumsr because its tests are checking for exact matches of error messages generated in tidyselect and these have now changed. Since error message contents aren't part of the tidyselect API, could you please use testthat snapshots instead?

error message typo in ipums_view

If htmltools, shiny, or DT are not present when trying to call ipums_view(ddi), the user is prompted with the following error:

|Error in ipums_view(ddi) :
| Please install htmltools, shiny, and DT using
| `install.packages(c('htmltools', 'shiny', 'DT')

The closing ) as well as single quote are missing from the end of this message, which could confuse some users.

Cannot use `api_key` argument in `submit_extract`

The API request in submit_extract is missing the api_key argument, so users can only submit an extract if their API key is in their .Renviron. Attempting to submit an extract with the API key specified explicitly in the api_key argument fails.

`download_extract()` fails if provided an `ipums_extract` that has finished on server but does not have links in R

If attempting to download an extract by providing an ipums_extract object that was not yet completed at the time it was generated, download_extract() gets, but does not successfully use, the updated status of this extract provided by get_extract_info(). An expired extract error is thrown.

Should be able to be addressed by updating the is_ready variable after getting updated info:

if (!is_ready) {
  extract <- get_extract_info(extract, api_key = api_key)
}

should be changed to

if (!is_ready) {
  extract <- get_extract_info(extract, api_key = api_key)
  is_ready <- extract_is_completed_and_has_links(extract)
}

understanding parsing of DDI file using regular expression

Hello. I was working on parsing a DDI file and was looking at the IPUMSR source code. One thing I found a bit confusing was a portion of the ddi_read.R file, which seems to parse the <CodInstr> section of the variable node.

Most of the time, the categorical information is contained within the <catgry> tag, however I noticed this section of the code that uses a regular expression to parse that portion of the CodInstr tag. The code is below. My question is, why is it necessary to parse the CodInstr section of the DDI file, and whether this is a common thing. The regular expression is very specific, so I am not sure that it would generalize very well. Is this specific function used only for the specific "total personal income" INCTOT variable, or are there other variables that also have categorical information in the CodInstr tag.

The code from IPUMSR is found in the specified file ddi_read.R starting at line 907.

parse_code_regex <- function(x, vtype) {
  if (vtype %in% c("numeric", "integer")) {
    labels <- fostr_named_capture(
      x,
      "^(?<val>-?[0-9.,]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(?<lbl>.+?)$",
      only_matches = TRUE
    )

    labels$val <- as.numeric(fostr_replace_all(labels$val, ",", ""))
  } else {
    labels <- fostr_named_capture(
      x,
      "^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$",
      only_matches = TRUE
    )
  }

  labels
}

survey weights (created May 14, 2019 by @gergness on mnpopcenter/ipumsr)

May 14, 2019 @gergness:

When I was first writing ipumsr I did some work translating the stata code on static pages of ipums.org to explain how to use survey weight variables. It's always been on my todo list to help projects update, but I never did get around to it.

Yesterday, two IPUMS users on twitter were talking about this:
https://twitter.com/surlyurbanist/status/1127968834902605825

To make sure it doesn't get lost, here's the translation of CPS, USA & NHIS user notes on weights for R.


CPS - Replicate Weights

Adapted from https://cps.ipums.org/cps/repwt.shtml

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.

  • The sample should be treated as a single stratum (the weights contain the relevant information from the sample design), so no PSU should be specified.
  • The full-sample weight must be specified.
  • You then specify the replicate weights in the repweights argument. Note that IPUMS-CPS data contain a variable called REPWTP, which merely indicates the presence of replicate weights and is coded 1 for every case. Therefore, make sure to use a regular expression like "REPWTP[0-9]+" to make sure you don't include REPWTP.
  • The fpc argument should not be specified.
  • The type argument should be set to "Jkn" and rho to 0.5
  • The mse argument should be set to TRUE

R (survey package)

# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~WTSUPP, repweights = "REPWTP[0-9]+", type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)

R (srvyr package)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = WTSUPP, repweights = matches("REPWTP[0-9]+"), type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)

After setting up the svy object, we can now use it to perform weighted calcuations. For example, to
calculate the mean of a variable named VAR1:

R (survey package)

svymean(~VAR1, svy)

R (srvyr package)

svy %>% 
  summarize(mn = survey_mean(VAR1))

And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:

R (survey package)

svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)

R (srvyr package)

svy %>% 
  filter(AGE >= 25 & AGE < 65) %>%
  summarize(mn = survey_mean(VAR1))

USA - Replicate weights

Adapted from: https://usa.ipums.org/usa/repwt.shtml

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.

  • The sample should be treated as a single stratum (the weights contain the relevant information from the sample design), so no PSU should be specified.
  • The full-sample weight must be specified.
  • You then specify the replicate weights in the repweights argument. Note that IPUMS-USA data contain a variable called REPWTP, which merely indicates the presence of replicate weights and is coded 1 for every case. Therefore, make sure to use a regular expression like "REPWTP[0-9]+" to make sure you don't include REPWTP.
  • The fpc argument should not be specified.
  • The type argument should be set to "Fay" and rho to 0.5
  • The mse argument should be set to TRUE

R (survey package)

# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~PERWT, repweights = "REPWTP[0-9]+", type = "Fay", rho = 0.5, mse = TRUE)

R (srvyr package)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = PERWT, repweights = matches("REPWTP[0-9]+"), type = "Fay", rho = 0.5, mse = TRUE)

After setting up the svy object, we can now use it to perform weighted calcuations. For example, to
calculate the mean of a variable named VAR1:

R (survey package)

svymean(~VAR1, svy)

R (srvyr package)

svy %>% 
  summarize(mn = survey_mean(VAR1))

And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:

R (survey package)

svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)

R (srvyr package)

svy %>% 
  filter(AGE >= 25 & AGE < 65) %>%
  summarize(mn = survey_mean(VAR1))

IPUMS NHIS

Adapted from https://nhis.ipums.org/nhis/userNotes_variance.shtml

General Syntax to Account for Sample Design

The following general syntax will allow users to account for sampling weights and design variables when using STATA, SAS, SAS-callable SUDAAN, or R (through the survey or srvyr package) to estimate, for example, means using IPUMS NHIS data.

...

R (survey)

# If not installed already: install.packages("survey")
library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)

svymean(~VAR1, svy)

R (srvyr)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)

svy %>% 
  summarize(mn = survey_mean(VAR1))

Subsetting IPUMS NHIS Data

...

R (survey)

library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)

svy_subset <- subset(svy, AGE >= 65)
svymean(~VAR1, svy_subset)

R (srvyr)

library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)

svy %>% 
  filter(AGE >= 65) %>%
  summarize(mn = survey_mean(VAR1))

Error in read_ipums_ddi

I am using the function read_ipums_ddi to import the ATUS.
It used to work fine in the past.

I get the following error

Error in read_xml.character(ddi_file_load, data_layer = NULL) :    Opening and ending tag mismatch: meta line 12 and head [76]

Return all when case_selection_type is "detailed" and no selections specified

For example, the detailed race could be of value for looking at multiple groups and having a dataset that can be filtered through versus several subset pulls or a hodgepodge set that may not address questions without multiple iterations.
Is there anyway to improve usability to allow for detailed to be selected and all be returned.
This can be manually built but that is considerable tedium.

Current default behavior:
var_spec("RACE",
case_selection_type= "detailed", case_selections =c('must include exactly'))

Revised default behavior:
var_spec("RACE",
case_selection_type= "detailed", case_selections ="all, unless you list specific codes"))

Get rid of message related to reading in a subset of variables

From ipumsr created by dtburk: mnpopcenter/ipumsr#72

We shouldn't see this message when we specify a subset of variables with the vars argument to read_ipums_micro():

Note: Using an external vector in selections is ambiguous.
โ„น Use `all_of(vars_of_interest)` instead of `vars_of_interest` to silence this message.
โ„น See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

I got this message with ipumsr version 0.4.5 and tidyselect version 1.1.0.

optional parameter in read_ipums_micro to choose how haven-labelled variables are handled (created Apr 19, 2021 by @schmert on mnpopcenter/ipumsr)

Apr 19, 2021 @schmert:

Haven-labelled variables are unfamiliar to many R users. The {ipumsr} documentation even includes instructions that suggest that R users will almost always want to alter the haven-labelled variables output by read_ipums_micro before doing any real work -- with zap_values, to_character, etc.

Would it be possible to add a parameter to read_ipums_micro that allows the user to choose how labelled variables are output in the first place? For example,
output_labelled_as = c("haven", "value", "label", "factor")
with the default being the current "haven"?

This could save R users a ton of headaches. Thanks.

Please remove dependencies on **rgdal**, **rgeos**, and/or **maptools**

This package depends on (depends, imports or suggests) raster and one or more of the retiring packages rgdal, rgeos or maptools (https://r-spatial.org/r/2022/04/12/evolution.html). Since raster 3.6.3, all use of external FOSS library functionality has been transferred to terra, making the retiring packages very likely redundant. It would help greatly if you could remove dependencies on the retiring packages as soon as possible.

pkgdown site links `as_factor()` to `forcats::as_factor()` instead of `haven::as_factor()`

The value-labels vignette refers to as_factor(), and pkgdown attempts to automatically link this to the appropriate function documentation, but in this case, it links to forcats::as_factor() instead of haven::as_factor(). If the vignette was just referring to a function from an external package, we could just use haven::as_factor() explicitly. However, ipumsr re-exports haven::as_factor() so that ipumsr users don't have to load haven to use it, so it wouldn't be ideal if we had to use haven::as_factor() in the vignette just to get the pkgdown link to work properly.

One partial solution would be to replace references to as_factor() with

[`as_factor()`](https://haven.tidyverse.org/reference/as_factor.html)

in the text of the vignette, but then those links would look different from the links auto-generated by pkgdown, and we would have to manually update the url if haven ever moved its documentation site. Moreover, that approach wouldn't work for code references to as_factor().

It's possible that we should create an issue on pkgdown or downlit requesting a new feature that allows pkgdown users to manually specify which package a function is from for function names that appear in multiple packages, or alternatively, an update that checks for function name matches in re-exported functions before looking more widely.

Update project info and UI for `ipums_website()`

ipums_website() has several issues that should be addressed. Currently, the list of supported projects is out of date and the UI is somewhat inconsistent. While this function likely does not get substantial use, it may remain useful given the current absence of a metadata API for microdata projects. We need to:

  • Update project names that are out of date (including hyphens)
  • Add recent IPUMS projects and remove retired ones
  • Allow use of API codes to specify projects for consistency with other functions in package
  • Allow function to work on OS other than Windows
  • Don't require var argument, since some projects that do not have variable-specific websites are supported
  • Streamline S3 dispatch, as a different argument is required if specifying project name manually (as opposed to with an ipums_ddi object)
  • Deprecate superfluous arguments and update defaults where confusing

Table not available through ipumsr?

Hi there, I am trying to submit an NHGIS data extract through ipumsr and I'm unable to locate a table that I know is available through the website. The table is B19001H and I need it for both 2005-2009 ACS and 2014-2018 ACS. If helpful, the titles are:

  • For 2005-2009 ACS: Household Income in the Past 12 Months (in 2009 Inflation-Adjusted Dollars) (White Alone, Not Hispanic or Latino Householder.
  • For 2014-2018 ACS: Household Income in the Past 12 Months (in 2018 Inflation-Adjusted Dollars) (White Alone, Not Hispanic or Latino Householder.

Is it possible to make this available through ipumsr, or should I manually download this extract? Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.