briatte / dsr Goto Github PK
View Code? Open in Web Editor NEWIntroduction to Data Science with R (Sciences Po, Paris, 2023)
Home Page: https://f.briatte.org/teaching/syllabus-dsr.pdf
Introduction to Data Science with R (Sciences Po, Paris, 2023)
Home Page: https://f.briatte.org/teaching/syllabus-dsr.pdf
Move this document to the wiki, and include a link in the 'Help' section of exercises, with a short paragraph on getting help and avoiding plagiarism when coding.
https://github.com/briatte/dsr/blob/master/dsr-03/02-eu-mood/eu-mood.r
Look at student feedback -- 'confusing wording' and all.
europe
dummy, minus Albania, Bosnia, Croatia, KosovoSee #27
See #28
… when to have that?
Shown both in dsr-03/01-covid-income
and in dsr-04/01-debt
. The first one needs to be properly documented, whereas the second one can probably go.
This one is complex enough to be its own issue…
https://www.europeansocialsurvey.org/methodology/ess_methodology/data_processing_archiving/weighting.html
https://www.europeansocialsurvey.org/docs/methodology/ESS_weighting_data_1_1.pdf
From the weighting guide, v1.1 (2020), page 7:
From round 9 onwards, all the necessary sample design indicators and weights are already included in the integrated (second release) data file, but if you are working with data from earlier rounds you will first need to merge the sample design indicators on to the main data file. For rounds 7 and 8, the sample design indicators are in the integrated SDDF (sample design data file), so you need to merge this file with the main integrated (questionnaire data) file. For rounds 1 to 6, sample design indicators are stored in a separate file for each country (and files are missing for some countries in some rounds), so you would need to merge several files. Furthermore, for these rounds the indicators psu and stratify have not been recoded in a manner suitable for cross-country analysis, so you will need to do this if you are analysing data from more than one country. Follow the guidance in section 2 of Kaminska & Lynn (2017) and ensure that each value is exclusive to one country.
The guide asks for the creation of anweight
('analytical weights') from the following variables:
# R, data.table syntax
data1[, anweight := pspwght * pweight * 10e3]
# Stata
# gen anweight=pspwght*pweight
Once anweight
exists, weighting guide instructs the following design:
# R
svydesign(ids = ~psu, strata = ~stratum, weights = ~anweight, data = data1)
# Stata
# svyset psu [pweight=anweight], strata(stratum)
Quoting again from the weighting guide:
It is constructed by first deriving the design weight, then applying a post-stratification adjustment, and then a population size adjustment. Further details of how the weights are derived are documented in the round-specific report on the production of weights. Starting from Round 9, anweight is provided for you in the integrated data file. If you are using data from earlier ESS rounds, you can derive anweight yourself.
Full range of weighting variables, quoted from ESS9 codebook:
idno
- Respondent's identification numbercntry
- Countrydweight
- Design weightpspwght
- Post-stratification weight including design weightpweight
- Population size weight (must be combined with dweight
or pspwght
)anweight
- Analysis weightprob
- Sampling probabilitystratum
- Sampling stratumpsu
- Primary sampling unitNotes:
pspwght
includes dweight
anweight
is just the product of pspwght
and pweight
prob
InductiveStep/R-notes#1
ropensci/essurvey#39
ropensci/essurvey#9 (comment)
Second link right above recommends the following for ESS4:
svydesign(
ids = ~ psu + idno, # further comment at the link: specifying just `psu` would be enough
strata = ~ stratify,
weights = ~ dweight,
nest = TRUE,
data = ess4gb
)
Intermediate Quantitative Social Research, Birkbeck, University of London (2017-2020)
https://inductivestep.github.io/R-notes/complex-surveys.html
Working on a multi-country example:
# using srvyr
as_survey_design(
ids = idno, # instead of `psu` or `psu + idno` because `psu` is not in ESS9?
strata = cntry,
nest = TRUE,
weights = pspwght
)
From the text:
The
nest
option takes account of the ids being nested within strata: in other words the same ID is used more than once across the dataset but only once in a country.
Introduction to Survey Statistics, University of Heidelberg, 2018
https://federicovegetti.github.io/teaching/heidelberg_2018/lab/sst_lab_day2.html
When working on countries separately:
# using srvyr
as_survey_design(weights = c(dweight, pspwght)) %>%
group_by(cntry) %>%
# etc.
# ... doesn't pspwght include dweight?
# ... what about stratum? psu?
When working on all countries together:
# using srvyr
as_survey(weights = c(dweight, pspwght, pweight))
http://asdfree.com/european-social-survey-ess.html
Working on a single country (Belgium) after merging the data to the SDDF file:
svydesign(
ids = ~psu ,
strata = ~stratify,
probs = ~prob,
data = ess_df
)
https://vincentarelbundock.github.io/WDI/ by @vincentarelbundock (current choice)
https://github.com/gshs-ornl/wbstats by @jpiburn (more dependencies)
https://github.com/jpiburn/wbdata also by @jpiburn (unreleased?)
Does one of them also allow to download WB shapefiles?
https://iqss.github.io/dss-workshops/
… mention in first session/slides?
Had forgotten about this…
https://www.r-bloggers.com/2014/12/cartography-with-complex-survey-data/
And especially:
https://www.r-bloggers.com/2014/12/maps-and-the-art-of-survey-weighted-maintenance/
Perhaps the code can be adapted to run again? Would have to be quick enough to be run in class, though.
Uses kriging with fields
.
https://github.com/davidbrae/swmap/blob/master/how%20to%20map%20the%20european%20social%20survey.R
Uses KDE with prevR
, by Joseph: see the intro vignette.
http://asdfree.com/demographic-and-health-surveys-dhs.html
https://github.com/davidbrae/swmap/blob/master/how%20to%20map%20the%20demographic%20and%20health%20surveys.R
… and the list goes on: see my blog post on the issue (kind of).
Some folks are putting ideas together on the topic.
U.S. spatial data: https://github.com/martinjhnhadley/statesRcontiguous
rlist, by @renkun-ken, looks like a brilliant way to manipulate lists. My only regret is that the functions are named list.verb
instead of list_verb
.
Suggestion – Use it if lapply
usage becomes cryptic.
pipeR looks cool too, even though I would prefer to keep things simple and stick with just %>%
.
For an advanced version of the course, some notes on the stuff that were excluded due to time constraints:
dplyr
1.1.0 functionsSo, basically, this advanced course could be, over 12 sessions:
(Still no space for JS libraries, fair enough.)
This neglects PCA-style stuff and ML, which should probably be its own project, centered on tidymodels
, covering random forests, gradient-boosted trees, etc.
dsr-03/01-covid-income
needs more structure, and some steps need to be fully documented, e.g.
locale()
as.Date()
skip
in read_excel
dsr-04/01-debt
)Also:
str
, head
and tail
, View
and glimpse
.README
memberstates::memberstates$oecd$iso3c
ggsave
at endCurrent size of slide sets (cap at 25, except Week 1). Revise readings, practice sessions and exercises, and include screenshots of videos when relevant.
DSR-outline-2.txt
to GitHub README
and emailsEMSS-emails-2013
QUANTI1-2020
QUANTI2-2019
emails/
folder?Make better use of great tutorials:
Finish digging into (and reorganising…) those:
QUANTI1-courses
folderQUANTI1-handbooks
folderQUANTI2-courses
folderQUANTI2-handbooks
folderDSR-handbooks
folderDSR-courses
folderPossible additions for the wiki:
Amelia McNamara, Nick Horton, "Wrangling categorical data in R," citing her website:
Wrangling categorical data in R, a paper co-authored with Nick Horton. This paper describes some common mistakes data analysts make when working with categorical data (factors) in R. The paper was published jointly in The American Statistician, Vol. 72, Issue 1 and as a pre-print in the Practical Data Science for Stats collection on PeerJ.
Finalize the list:
Handbooks:
… with survey
and srvyr
Also show
https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/
Code:
https://github.com/mlaviolet/Trends/blob/master/NHIS/nhis.R
http://asdfree.com/national-health-interview-survey-nhis.html
Working backwards…
QUANTI1 (also add handbooks):
QUANTI2:
In Sessions 6, 7 and 8, possibly 9?
https://easystats.github.io/parameters
https://easystats.github.io/parameters/articles/standardize_parameters_effsize.html
… to replace protein consumption in Session 11?
… to combine it with correlation analysis?
https://drsimonj.svbtle.com/how-to-create-correlation-network-plots-with-corrr-and-ggraph
https://fivethirtyeight.com/features/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/
Hello their,
I tried running this code to pull the most recenct CPJ data into, R, but the link seemingly does not work and I can't find the CPJ database that goes up to 2017 in Excel. Would you happen to possess that dataset to update the link and/or code?
u = "https://www.cpj.org/killed/cpj-database.xls"
f = basename(u) # New_York_Times_Brexit_coverage.xlsx
if (!file.exists(f))
download.file(u, f, mode = "wb")
Thanks for taking the time to look at this pull-request.
http://asdfree.com/american-national-election-study-anes.html
http://asdfree.com/european-social-survey-ess.html
http://asdfree.com/general-social-survey-gss.html
http://asdfree.com/survey-of-health-ageing-and-retirement-in-europe-share.html
http://asdfree.com/world-values-survey-wvs.html
Also add
http://faculty.missouri.edu/williamslaro/mipdata.html
Email from Laron K. Williams to EPSA mailing-list below.
Hi everyone.
It is my pleasure to announce the release of the 'Most Important Problem' Dataset (MIPD). It is the result of a collaborative project on Americans' issue attentiveness with two Ph.D. students at the University of Missouri (Colton Heffington and Brandon Beomseob Park).
The MIPD collects all available individual-level responses to the 'most important problem facing this country' question in America from 1939 to June 2015. In addition to providing a sense of how individuals prioritize problems, the MIPD paints a picture of Americans' evolving issue attention over time. There are two datasets connected to the MIPD:
a) MIPD: This dataset contains individual-level responses (from over 670 surveys totaling almost a million respondents) to the MIP question, in addition to demographics, partisan preferences (including vote intention, previous vote, ideology and party identification), approval (general approval and approval of specific policies), economic evaluations (both retrospective and prospective, personal and national), and party competency (best party to address MIP, best party to address specific problems). We code the MIP responses into three coding schemes to ease the incorporation of other datasets: manifesto research group, the comparative agendas project, and Matt Singer's (2011) coding scheme. In addition to being the largest possible data collection of individual-level responses dealing with issue attention, we could certainly envision other uses of the dataset ranging from studies of vote intention, party competency, and partisan rationalization.
b) MIPD Aggregate: This dataset contains the aggregate percentage (weighted by population weights, if available) of respondents identifying that problem as the MIP. These aggregate data are available monthly, quarterly, and annually. This dataset would be a valuable component in an overall portrait of dynamic representation, and in observing how public opinion shifts in response to changing domestic and foreign circumstances.
The data are available from my website (http://faculty.missouri.edu/williamslaro/mipdata.html ) or from my Dataverse (https://dataverse.harvard.edu/dataverse/laronwilliams ).
If you use it, please cite: Colton Heffington, Brandon Beomseob Park and Laron K. Williams (forthcoming). "The 'Most Important Problem' Dataset (MIPD): A New Dataset on American Issue Importance", Conflict Management and Peace Science.
Thanks in advance. Feel free to contact me if you have any questions!
Laron
Other surveys of interest
https://github.com/tidy-survey-r/tidy-survey-book
https://tidy-survey-r.github.io/tidy-survey-book/
https://ec.europa.eu/eurostat/web/lfs
https://ec.europa.eu/eurostat/web/microdata/european-union-labour-force-survey
https://www.gesis.org/en/missy/metadata/EU-LFS/
https://ec.europa.eu/eurostat/web/microdata/public-microdata/labour-force-survey
ILO ones: https://www.ilo.org/surveyLib/index.php/collections/LFS
Code:
https://github.com/mlaviolet/Trends/blob/master/NHIS/nhis.R
http://asdfree.com/national-health-interview-survey-nhis.html
Also show:
https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/
dsr-03/01-covid-income
dsr-04/01-debt
)README
)ggsave
at enddsr-03/02-eu-mood
countrycode
dependency?dsr-04/01-debt
README
accordinglydsr-04/03-anscombe
lattice
?… as 'bonus' demos?
https://github.com/briatte/ida
Maps (covered by #34)
akima
map examples?Web scraping:
Text:
Networks:
Time series:
Reproducible/open science:
Everything below relates to discussions and experiments that happened prior to sf
.
sfheaders
and/or silicate
, as per the author)sf
)tmap
cartography
mapsf
)Related to an old blog post and to this old issue: #1
For the QUANTI students who ask for it. Make it a section…
https://juba.github.io/tidyverse/
https://larmarange.github.io/analyse-R/
https://larmarange.github.io/guide-R/
Videos:
https://larmarange.github.io/webin-R/
https://larmarange.github.io/webin-R/seances.html
Workload of the course (min. 2, max. 6, recommended: 4):
moments
and e1071
e1071
and its alternative measures?ggeffects
or marginaleffects
-- or both? use in more than one session?car
with anything that plots scatterplot matrices?corrr
entirely?ggcorrplot
entirely? (or find a package that does the same, plus scatterplot matrices?)factoextra
and ggfortify
plotly
? e.g. https://plotly-r.com/d-charts.html -- or remove entirelysf
earlier and have a map in Session 3igraph
and ggraph
?modelsummary
(and possibly texreg
)?pdftools
optional (only recommend it in the slides?)rvest
?WDI
, but use it more than once? -- see #21
exercise-04
to use WDI
but save the datalife-expectancy
to use WDI
and save the dataChecks:
countrycode
is used in multiple sessionsbroom
is used in multiple sessionsggmosaic
is used in multiple sessionsggrepel
is used in multiple sessions, "", "", "",performance
is used in multiple sessionstexreg
is used in multiple sessionshttps://github.com/briatte/dsr/blob/master/dsr-04/01-debt/debt.r
Right now, it's an example of data wrangling from QUANTI2-2023.
README
accordinglyFor the session on correlation:
https://ggforce.data-imaginist.com/reference/geom_autopoint.html
https://twitter.com/alexandreafonso/status/1658451225174753280
https://europeangovernanceandpolitics.eui.eu/eui-yougov-solidarity-in-europe-project/
https://cadmus.eui.eu/handle/1814/72778
d <- "2020 SiE year dataset/2020 SiE dataset_dataset.csv" %>%
readr::read_csv()
count(d, qcountry)
cty <- c("United Kingdom" ,"Denmark" ,"Finland" ,"France" ,"Germany" ,"Sweden" ,"Greece" ,"Hungary" ,"Italy" ,"Lithuania" ,"Netherlands" ,"Poland" ,"Romania" ,"Spain")
d$qcountry <- factor(d$qcountry, labels = cty)
summary(d$weight)
str_subset(names(d), "q20")
d <- mutate_at(d, vars(starts_with("q20")), ~ case_when(.x == 1 ~ 1L, .x == 2 ~ 0L, .x == 3 ~ NA_integer_))
# Austria
count(d, qcountry, q20_1)
tapply(d$q20_1, d$qcountry, mean, na.rm = TRUE)
# TODO: survey weights
# TODO: make dyadic version
Possibly present as a ‘freestyle’ exercise -- here's the data, do whatever you like with it. In this case, provide the data as a truly dyadic dataset, with q20_x merged into a single item, and additional columns corresponding to the model in the paper.
Not currently covered, might be interesting by using https://vincentarelbundock.github.io/marginaleffects/
Do what the README
of exercise-04-answer
(not in the repo) recommends doing: below the actual answer (Step 2 of the preston-curve-answer.r
script), show a detailed breakdown of the steps.
Some of them probably breach idiotic copyrights, and the links in the README
files should do. Not including the docs is in fat a good incentive to go and read those READMEs in the first place.
Possible reorganisation:
Might be too complex, and having in Session 3 makes sense if the code is updated to show multiple graph engines (base, lattice, ggplot2).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.