Code Monkey home page Code Monkey logo

censusxy's Introduction

I am data scientist passionate about building real world evidence data and software products that accelerate the production of insights.

Currently, I work as a Director in Pfizer’s Evidence Generation Platform, where I am a RWE Digital Platforms specialist working across all of our theraputic areas.

I specialize in health disparities research, geospatial mapping and analyses, the development of open source research software, and data acquisition.

Follow My Work

  • You can follow my work on my website

censusxy's People

Contributors

bransonf avatar chris-prener avatar christopherkenny avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

censusxy's Issues

cxy_geocode issues within a function (non standard evaluation issue?)

Im having some issues with cxy_geocode when it is within a function - I think due to the nonstandard evaluation of inputs. Regardless, the function below attempts to geocode, link with an sf object created from a tiger shape file (to extract the full GEOID), and end with a nice data frame with the original data and the GEOIDs of their census blocks. This idea works outside of a function but within, it always errors out that some of the variables cant be found. I think I see the issue (the way the function and the cxy handle quoted strings) but haven't been able to resolve it. Thanks for any thoughts!

### SOME DATA
data<-data.frame(Address = c("4643 Lindell Blvd", "3545 Lafayette Ave"), City = c("St. Louis", "St. Louis"), State = c("MO", "MO"), ZIP.Code = c(63108, 63104) )

#### BEGIN FUNCTION
geocodeme<-function(data, StateSub, Street.col, City.col, State.col, Zip.col){

#### state subset

sub<-subset(data, toupper(data[,rlang::quo_name(rlang::enquo(State.col))])%in%toupper(StateSub))

require("censusxy") || install.packages("censusxy")
library(censusxy)

##### GEOCODE

results <- cxy_geocode(sub, address = Street.col, city = City.col, state = State.col, zip=Zip.col, style="full", output="sf")

require("tigris") || install.packages("tigris")
library(tigris)

require("sf") || install.packages("sf")
library(sf)

### pull state
shp<-block_groups(toupper(StateSub), class="sf")

#### set projection
results <- sf::st_transform(results, "+proj=longlat +datum=WGS84")
shp <- sf::st_transform(shp, "+proj=longlat +datum=WGS84")

require("spatialEco") || install.packages("spatialEco")
library(spatialEco)

out<-point.in.poly(results, shp)
rm(shp)
rm(results)
return(out)

}
#### END FUNCTION

### run function
blah<-geocodeme(data, StateSub="MO", Street.col=Address, City.col=City, State.col=State, Zip.col=ZIP.Code)

cxy_geocode arguments returning "object ___ not found" for existing objects

Hello,

I cannot figure out why the columns containing the names of address components in my dataframe continuously return the object ___ not found error.

Here's the code:

`

load libraries

library(tidyverse)
library(rvest)
library(censusxy)

import agencies from TS website

active_agencies <- read_html("https://docs.google.com/spreadsheets/d/e/2PACX-1vRnKxfwMxfnc1tg6iEoSuKw5obUvxxDiBV-WDRYuuLy_xX6FWLb27cjsq3DuSZs7cAg1FxRdWjCpvW6/pubhtml?gid=0&single=true") %>%
html_node("table") %>%
html_table()

set up agencies list as df

active_agencies <- active_agencies[-c(1:10),]
names(active_agencies) <- as.matrix(active_agencies[1,])
active_agencies <- active_agencies[-c(1),]
active_agencies$Zip Code <- sapply(active_agencies$Zip Code, as.numeric)
active_agencies <- filter(active_agencies, Zip Code > 80000)

geocode agencies list

geocode <- cxy_geocode(active_agencies, street = Address, city = City, state = "Nevada", zip = Zip Code, return = 'locations', class = 'dataframe', output = 'simple')`

In this code, the "city" argument is the first one to throw an error. I've double-checked the column name by ensuring that active_agencies$City returns the city names. I've also changed the name of the column and tried running cxy_geocode() using the new column name but still got the object ___ not found error.

I've also tried skipping the "city" argument to see if that would allow the function to run. Unfortunately, the object ___ not found error seems to shift to another field. In my case, the object ___ not found error appeared in the order of city --> state --> zip --> address (i.e., if I omitted the "city" argument but included all other arguments, then the "state" argument would be the next one to return the object ___ not found error).

I'm relatively new to R, so it's entirely possible I'm missing something obvious. If it helps:
-- R version 4.1.0
-- RStudio version 1.4.1106
-- Windows 10, 64-bit

Integrate Github Actions

Github Actions seems like the path of least resistance long term. Supports all operating systems, and is much faster.

Linux is failing due to an outdated version of GDAL <2.0.1

MacOS is failing due to missing the udunits system dependency

Need to add code coverage secret to integrate coverage with Ghub actions. Also possible to build site automatically on commit. May be useful, especially if we want to separate the pkgdown website and serve from a gh-pages branch as opposed to master.

GEOID extraction from geocoding call

Is there a way to pull the GEOID (up to at least block group FIPS codes) from the cry_geocode() function? The API may not allow this, but you can get it if you geocode manually on the website. Thanks!

Issue with address batch n > 3

Hi,
I am currently trying to get the census tracts for a list of addresses.
When I ran this test with 3 addresses, I am able to get all information I need:
geoids <- cxy_geocode(data.frame(street=c("2584 Haverford Dr","3964 Alemeny Blvd","3269 Mission St"), city=c("Troy","San Francisco","San Francisco"), state=c("MI","CA","CA"), postal_code=c("48098","94132","94110"), Id = c(1, 2, 3)), street = 'street', city = 'city', state = 'state', zip = 'postal_code', return = 'geographies', class = 'dataframe', output = 'full', benchmark = "Public_AR_Census2020", vintage = "Census2020_Census2020")

However, when I try the same code with a list of 851 addresses I get the following error:
Error in curl::curl_fetch_memory(url, handle = handle) : Empty reply from server

Any idea as to why I am getting this error? Any help will be greatly appreciated

General feedback

  • what are the vintage options?
  • help files need to be clarified and expanded
  • make sure return() appears at the end of all the single line functions

CRAN issue

Please correct before 2021-02-15 to safely retain your package on CRAN.

It seems we need to remind you of the CRAN policy:

'Packages which use Internet resources should fail gracefully with an informative message
if the resource is not available or has changed (and not give a check warning nor error).'

Non-ASCII characters lead to NA geocoding records

Maybe this is obvious, but I am seeing some strange behavior with non-ASCII characters on Mac OS. Here's my version info:

> packageVersion("censusxy")
[1] ‘1.0.0’
> getRversion()
[1] ‘4.0.2’

library(censusxy)
library(dplyr)
library(stringi)
library(sf)

> # this works (as does the web ui)
> g<-cxy_single('412 45th Strèet','Oakland','CA','94609', return = 'geographies', vintage = 'Current_Current')
> summary(as.factor(g$cxy_status))
integer(0)
> 
> 
> # this breaks
> my_df <- data.frame(street= '412 45th Strèet', city = 'Oakland', state='CA', zip ='94609')
> geocoded_data <- cxy_geocode(my_df,
+                                     street = "street", 
+                                     city = "city", 
+                                     state = "state", 
+                                     zip = "zip",
+                                     output = "full", 
+                                     class = "dataframe", 
+                                     return="geographies",
+                                     vintage ='Current_Current')
> 
> summary(as.factor(geocoded_data$cxy_status))
NA's 
   1 
> 
> my_df <- my_df %>% 
+   mutate_if(is.character,
+             stri_trans_general,
+             id = "latin-ascii")
> 
> # this breaks
> geocoded_data <- cxy_geocode(my_df,
+                              street = "street", 
+                              city = "city", 
+                              state = "state", 
+                              zip = "zip",
+                              output = "full", 
+                              class = "dataframe", 
+                              return="geographies",
+                              vintage ='Current_Current')
> 
> summary(as.factor(geocoded_data$cxy_status))
Match 
    1 

tie

From what I can tell when cxy_status = 'Tie' no geocode is listed, though it would make sense that there are two results. Is there a way to get at least one of those results to show up?

Malformed Response Data

hey @bransonf - I've been working with some particularly exciting address data, and managed to break censusxy. With the attached data, I load them up in R and then:

censusxy::cxy_geocode(test2, address = "gw_i_address", city = "gw_i_city",
            state  = "gw_i_state", zip = "gw_i_zip",
            style = "minimal", output = "tibble", timeout = 30)

This results in a warning:

Warning message:
Expected 4 pieces. Missing pieces filled with `NA` in 3 rows [21, 33, 42]. 

When you look at what gets returned by commenting out lines 144 to 156 of cxy_geocode.R, you'll see that things aren't being put back where we expect them to be:

image

Can you take a shot at figuring out what is causing the unexpected response?

test2.zip

Missing Pieces Warning

@bransonf - I'm getting a warning with censusxy that I am hoping you can run down the source of. When I geocode, I get #> Warning: Expected 2 pieces. Missing pieces filled with NA in 1 rows [7]. This same warning actually shows up on our package website. Thanks for taking a look at this!

reprex to create it:

test <- dplyr::tibble(
  address = c("1416 Delmar Blvd", "3803 E Dr Martin Luther King Dr", "3700 Lindell Blvd", "3700 Lindell",
              "Lindell Blvd at S Vandeventer Ave", "1420 Delmar Blvd", "Calvary Cemetery"),
  city = c("St. Louis", "St. Louis", "St. Louis", "St. Louis", "St. Louis", "St. Louis", "St. Louis"),
  state = c("MO", "MO", "MO", "MO", "MO", "MO", "MO"),
  zip = c(NA, NA, 63108, 63108, NA, NA , NA)
)

censusxy::cxy_geocode(test, address = "address", city = "city", state  = "state", zip = "zip",
            style = "minimal", output = "tibble", timeout = 30)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [7].
#> # A tibble: 7 x 9
#>   address   city  state   zip cxy_status cxy_quality cxy_match    lon   lat
#>   <chr>     <chr> <chr> <dbl> <chr>      <chr>       <chr>      <dbl> <dbl>
#> 1 1416 Del… St. … MO       NA Match      Exact       1416 DELM… -90.2  38.6
#> 2 3803 E D… St. … MO       NA Match      Exact       3803 DR M… -90.2  38.6
#> 3 3700 Lin… St. … MO    63108 Match      Exact       3700 LIND… -90.2  38.6
#> 4 3700 Lin… St. … MO    63108 Match      Non_Exact   3700 LIND… -90.2  38.6
#> 5 Lindell … St. … MO       NA Match      Non_Exact   LINDELL B… -90.2  38.6
#> 6 1420 Del… St. … MO       NA Match      Exact       1420 DELM… -90.2  38.6
#> 7 Calvary … St. … MO       NA No_Match   ""          ""          NA    NA

Created on 2019-09-27 by the reprex package (v0.3.0)

To Do List

  • Finish Tests
  • Write Vignette
  • Update README
  • Pass CRAN Checks
  • Implement CI
  • Build Pkgdown Site
  • create landing page for pkgdown site as index.Rmd - use other openGIS pkgs as template
  • the .rda files are not properly serialized for pre r v 3.5 support - https://cran.r-project.org/doc/manuals/r-devel/NEWS.html under r 3.6.0 news
  • remove use of tidy::separate_() - the _ functions are no longer recommended
  • description of .data for cxy_geocode() is confusing
  • create two exported sample data sets - stl_homicide.rda and stl_homicide_small.rda - with associated documentation
  • build out unit tests using stl_homicide_small.rda to decrease testing time
  • Improved join by UID to ID
  • send to winbuilder
  • initial release with Zenodo
  • fork and add CRAN instructions to README.Rmd, index.Rmd, and censusxy.Rmd
  • rebuild pkgdown site and README on the fork
  • submit to CRAN

Parallel processing fails on macOS

I'm having an issue with parallel processing on macOS, potentially related to changes to CoreFoundation and/or forking in recent macOS updates. When I set parallel = 2 (or any number > 1) I receive the following error:

The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column
In addition: Warning messages:
1: In parallel::mclapply(batches, batch_geocoder, return, timeout,  :
  scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected

I can reproduce this error on an M1 Mac and Intel Mac, both running macOS 12.4. I didn't have any issues with parallel processing in censusxy circa February 2022.

Reprex below:

library(censusxy)

data <- stl_homicides

## Works
homicide_sf <- cxy_geocode(data, street = "street_address", city = "city", state = "state", output="full", vintage = 'Current_Current', benchmark = "Public_AR_Current", return="geographies", class="dataframe", parallel=1)

## Breaks with 'The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().'
homicide_sf <- cxy_geocode(data, street = "street_address", city = "city", state = "state", output="full", vintage = 'Current_Current', benchmark = "Public_AR_Current", return="geographies", class="dataframe", parallel=2)
  • OS: macOS 12.4
  • Version of R: 4.2.0
  • Version of RStudio: 2022.02.2
  • Version of censusxy: 1.0.2.9000 (also failed under CRAN version)
  • Session info:
R version 4.2.0 (2022-04-22)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] censusxy_1.0.2.9000

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3       magrittr_2.0.3     units_0.8-0        tidyselect_1.1.2   R6_2.5.1           rlang_1.0.2       
 [7] fansi_1.0.3        httr_1.4.3         dplyr_1.0.9        tools_4.2.0        parallel_4.2.0     grid_4.2.0        
[13] KernSmooth_2.23-20 utf8_1.2.2         cli_3.3.0          e1071_1.7-9        DBI_1.1.2          ellipsis_0.3.2    
[19] class_7.3-20       assertthat_0.2.1   tibble_3.1.7       lifecycle_1.0.1    sf_1.0-7           crayon_1.5.1      
[25] purrr_0.3.4        vctrs_0.4.1        curl_4.3.2         mime_0.12          glue_1.6.2         proxy_0.4-26      
[31] compiler_4.2.0     pillar_1.7.0       generics_0.1.2     classInt_0.4-3     pkgconfig_2.0.3

cxy_geocode function no longer returning county names when output field specified as "full"

Describe the bug
I've been using the cxy_geocode function for ~4 months and have just encountered a bug that I haven't run into before. When entering in a given address (including state, zip, city, etc.), the function is still returning the coordinates of matches, but has stopped returning the counties and census tracts of those matches. This is a recent phenomenon as the function was previously returning counties when the output field was specified as "full".

Expected behavior
Previously the cxy_geocode function would return the coordinates and county names of given addresses it was able to match, now it only returns coordinates of matches and returns "null" for county name (as well as census tract).

To Reproduce
library(censusxy)
sample_df = data.frame(street_name = "241 Cary Ave", state_name = "IL", zip_code = 60035, city_name = "Highland Park")
output <- cxy_geocode(sample_df, street="street_name", state = "state_name", zip="zip_code", city = "city_name", return = "geographies", benchmark = "Public_AR_Census2020", vintage = "910", output = "full")

Desktop (please complete the following information):

  • OS: Windows 10
  • R Version 3.6.1

GitHub Actions

Tasks:

  1. Set-up CI build on GitHub actions
  2. Address build errors
  3. Add pkgdown GitHub action for site build
  4. README.md and repo housekeeping (remove old CI files, update badges)
  5. Ensure code coverage is working correctly

Track + cache progress

I have just stumbled on this package and it's brilliant! Huge congrats and appreciation to the authors.

I'm currently in the process of re-writing an old project into R, and have already integrated cxy_geocode into my workflow. I have two small ideas / suggestions for new features:

  1. it'd be great to know how many addresses have been encoded already, or as a % of all addresses. I need to geocode approx. 400k addresses and have set the timeout value to something very large so it should be fine, but it'd be good to know where i am in the process.
  2. it'd also be great if there was a cache feature. I ran into the same problem with my original code for this project (written in Python) using censusgeocode. Like you, i was able to overcome the batch limit of a single call and parallelize, but also cache'd progress just in case the API call breaks. See here for the Python code on this in case that's of interest/use.

I would write a PR for these suggestions but I am sadly still woefully inept in R. Appreciate they're probably pretty low down on your priority list but just wanted to throw them out there! Many thanks again @chris-prener

Inconsistent geocoding (singe vs batch)

I've run into an issue with the package. I have a large dataset to geocode. When using cxy_geocode, I end up with many NA for location. But when I try cxy_single for a random sample from those addresses, locations are returned. Any idea why this might be?

Here's one example of an address where this is happening.

g<-cxy_single('3256 n halsted st','chicago','il',return = 'geographies', vintage = 'Current_Current')

Issues geocoding large datasets

I am trying to geocode several thousand records (about 75,000 addresses or more). I get error messages, and they are very difficult to debug. For example:
Error in eval_tidy(enquo(var), var_env) : object 'V2' not found

When the error occurs, the geocoding fails and so does the variable assignment. In other words, the hour or two it spent geocoding was useless because I can't retain the results, nor can I determine which records might be giving the geocoder issues.

Any thoughts on this issue and how to resolve it?

Cannot Get Sample to Run

When running sample directly from "cxy_geocode" documentation

x <- stl_homicides[1:10,]

geocode
cxy_geocode(x, street = 'street_address', city = 'city', state = 'state', zip = 'postal_code',
return = 'locations', class = 'dataframe', output = 'simple')

I get this error:
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to 127.0.0.1 port 1080: Connection refused

Update Docs/News with Parallel Support

@chris-prener when you have a moment will you update the news and everything to censusxy 1.0.2 with "Added parallel support on Windows"

Maybe wait a bit before we resubmit to CRAN, let any extraneous bugs pop up?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.