theonaunheim / surgeo Goto Github PK
View Code? Open in Web Editor NEWOpen Source Proxy Demographic module written in Python
License: MIT License
Open Source Proxy Demographic module written in Python
License: MIT License
Build out test_runner.py script for TravisCI usage.
Hey - great idea and implementation, thanks for putting this together!
Notcing that when I'm getting probabilities with the BIFSG model, I get null no results / null probabillities if any one of my input features is either null or doesnt show up in the census data. It would be great if there were an option to override the null probabilities that get introduced in these intermediate steps.
ex: if the ZCTA is absent but First and Last name are present in the census data, then before we combine the probabilities, we fill the null zip code data with the population level statistics, and calculate the combined probabilitiy from that (alternatively we could just not include it in the calculation, not sure which is preferable). Perhaps a 'backfill with aggregate statistics' flag parameter for each of the components would be good.
Currently looks like this, but we could definitely eek out some information here instead of leaving it null:
zcta5 first_name surname white black api native multiple hispanic
0 90210 RANDALL ZZZZZZ NaN NaN NaN NaN NaN NaN
1 90210 QQQQQQ AARON NaN NaN NaN NaN NaN NaN
2 99999 RANDALL AARON NaN NaN NaN NaN NaN NaN
3 90210 RANDALL AARON 0.972583 0.004928 0.000934 0.000053 0.020869 0.000633
Including:
Zip and surname get written over with error. It should be everything but zip and surname should be error.
Currently returns NaN because it is normalized to "ALLOTHERNAMES".
Hey, thanks for putting this together! Any plans to modify this to work at the census block group or tract level? Thanks!
Would there be any interest in attempting to implement the improved BIFSG model that includes first name data as well?
See https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012
While the overall magnitude of the improvement associated with BIFSG is somewhat modest, the largest improvements occur for NH Blacks, which is the group for which BISG is least accurate. Moreover, the improvement for NH Blacks is much higher where geography has low ability to distinguish NH Blacks. This aspect is particularly important as much of the research on the topic of racial/ethnic differences focuses on specific geographic areas rather than the entire United States. It is also worthwhile to note that the improvements of BIFSG over BISG are generally comparable to the improvements of BISG over simpler methods. Last but not least, when assessing the degree of improvement from BIFSG, one should consider that even the most advanced methods are likely to result in incremental improvements for Hispanics and NH Asians, given that surnames alone are highly predictive for these particular groups.
I wouldn't mind submitting a PR on this if there is interest.
Like it says.
Now that surgeo has both first name and surname models, it makes sense to disambiguate between these names in the data and source code. Every variable/column that is specific to a surname should be styled "surname" and every one that is a first name should be styled "first_name".
Now that surgeo has both BISG and BIFSG models, the class of SurgeoModel should become BISGModel for consistency.
Misapplied to surnames rather than "other race". Minor skew, but requires fix.
Need to create executable and setup.py
The CFPB Model uses 2010 geocode data. This uses 2000. Minimal skew, but requires update.
Hi, I've been trying to reconciliation how to switch between the two above mentioned files in the title. Can you confirm the formulation used in your implementation, using the AARON/white entry as example, is
prob(first_name = AARON | race = WHITE) = prob(race = WHITE | first_name = AARON) / sum[prob(race = WHITE | first_name = i]?
I was able to replicate moving from one file to the other using the above formula, and wanted to make sure it is consistent with what you did.
The reason I'm asking is because in the harvard file, the number of observations for each first name is provided, and I used that in my own calculations and arrived at different probabilities. In particular, my formulation is
obs(first name = AARON)*prob(race = WHITE | first name = AARON) / sum[obs(first name = i) * prob(race = WHITE | first name = i)].
Have you considered this alternative formulation that includes the observation count information? If the choice not to use the observation counts is deliberate, I would love to learn the rationale behind the decision.
People are more likely to have forename and surnames together than they are to have surname and geo data together.
Since the data is already there, examine the possibility of updating surname probability with forname.
Tkinter-based
Hi all! This is an awesome tool, thanks for building this.
Now that 2020 Census data is available, is it possible to update the data this pulls from? I'm happy to help in any way, including data cleaning and making it an optional keyword to prevent people from having their surgeo predictions change unexpectedly.
Any information you have about where you sourced the data/any special data cleaning you needed to format it would be helpful, and I can open a pull request with full test coverage as well.
Appears to be those that cross state lines. Examples: 69201, 51360, 59270
In surgeo model for "TRACT" version, the probability needed to be used according to official documentation was " Probability of tract given race". But in the model code i see that it used "_get_prob_race_given_tract()". The variable name is right but the data used is wrong.
self.geo_level = geo_level.upper()
if geo_level == "TRACT":
self._PROB_GEO_GIVEN_RACE = self._get_prob_race_given_tract()
else:
self._PROB_GEO_GIVEN_RACE = self._get_prob_zcta_given_race()
for zcta it is right but for tract wrong probability file was pulled.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.