Code Monkey home page Code Monkey logo

dsr's People

Contributors

briatte avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsr's Issues

Practical heuristics

Move this document to the wiki, and include a link in the 'Help' section of exercises, with a short paragraph on getting help and avoiding plagiarism when coding.

Rework Exercise 11

Look at student feedback -- 'confusing wording' and all.

  • include warning about exercise being deliberately more vague than previous ones
  • makes all questions MCQs -- easier to grade, less vague!
  • clarify scenario (not a replication of the article, just inspired by it)

TODO items from class debriefs

Ideas for more graded exercises

See #27

More (bonus) examples

See #28

Teaching resources

https://education.rstudio.com/teach/

Ideas for (graded?) exercises

  • There are sessions without exercises: Week 6 (association), and Week 11 (classification).
  • The exercise for Week 10 (logit) is a follow-up to Week 8 (linear regression), but leave it that way.
  • The exercise for Week 12 is a no-code exercise.

1. 'secret weapon' on births and female educ.

  • have students start with average over years
  • use QOG, or ask for students to assemble the data themselves
  • use a t-test instead of linear regression, and make it the exercise for Week 6?

2. regression on survey + PCA on aggregated vars

  • Use regional data?
  • Make it exercise for Week 11?

3. complex data wrangling with EES 2019 / PartyFacts

  • aggregate country/party-level answers
  • complex merge of DTA/CSV (CSV has party names)
  • add a merge to e.g. PartyFacts?
  • compute PTV by party family

… when to have that?

Merging datasets

Shown both in dsr-03/01-covid-income and in dsr-04/01-debt. The first one needs to be properly documented, whereas the second one can probably go.

Surveys - ESS

This one is complex enough to be its own issue…

Weighting guide

https://www.europeansocialsurvey.org/methodology/ess_methodology/data_processing_archiving/weighting.html
https://www.europeansocialsurvey.org/docs/methodology/ESS_weighting_data_1_1.pdf

From the weighting guide, v1.1 (2020), page 7:

From round 9 onwards, all the necessary sample design indicators and weights are already included in the integrated (second release) data file, but if you are working with data from earlier rounds you will first need to merge the sample design indicators on to the main data file. For rounds 7 and 8, the sample design indicators are in the integrated SDDF (sample design data file), so you need to merge this file with the main integrated (questionnaire data) file. For rounds 1 to 6, sample design indicators are stored in a separate file for each country (and files are missing for some countries in some rounds), so you would need to merge several files. Furthermore, for these rounds the indicators psu and stratify have not been recoded in a manner suitable for cross-country analysis, so you will need to do this if you are analysing data from more than one country. Follow the guidance in section 2 of Kaminska & Lynn (2017) and ensure that each value is exclusive to one country.

The guide asks for the creation of anweight ('analytical weights') from the following variables:

# R, data.table syntax
data1[, anweight := pspwght * pweight * 10e3]
# Stata
# gen anweight=pspwght*pweight

Once anweight exists, weighting guide instructs the following design:

# R
svydesign(ids = ~psu, strata = ~stratum, weights = ~anweight, data = data1)
# Stata
# svyset psu [pweight=anweight], strata(stratum)

Details on analytical weights (ESS9+)

Quoting again from the weighting guide:

It is constructed by first deriving the design weight, then applying a post-stratification adjustment, and then a population size adjustment. Further details of how the weights are derived are documented in the round-specific report on the production of weights. Starting from Round 9, anweight is provided for you in the integrated data file. If you are using data from earlier ESS rounds, you can derive anweight yourself.

Full range of weighting variables, quoted from ESS9 codebook:

  • idno - Respondent's identification number
  • cntry - Country
  • dweight - Design weight
  • pspwght - Post-stratification weight including design weight
  • pweight - Population size weight (must be combined with dweight or pspwght)
  • anweight - Analysis weight
  • prob - Sampling probability
  • stratum - Sampling stratum
  • psu - Primary sampling unit

Notes:

  • pspwght includes dweight
  • anweight is just the product of pspwght and pweight
  • no obvious use for prob

Discussions

InductiveStep/R-notes#1
ropensci/essurvey#39
ropensci/essurvey#9 (comment)

Second link right above recommends the following for ESS4:

svydesign(
  ids = ~ psu + idno, # further comment at the link: specifying just `psu` would be enough
  strata = ~ stratify,
  weights = ~ dweight,
  nest = TRUE,
  data = ess4gb
)

Example: Andi Fugard, ESS9

Intermediate Quantitative Social Research, Birkbeck, University of London (2017-2020)
https://inductivestep.github.io/R-notes/complex-surveys.html

Working on a multi-country example:

# using srvyr
as_survey_design(
  ids = idno, # instead of `psu` or `psu + idno` because `psu` is not in ESS9?
  strata = cntry,
  nest = TRUE,
  weights = pspwght
)

From the text:

The nest option takes account of the ids being nested within strata: in other words the same ID is used more than once across the dataset but only once in a country.

Example: Federico Vegetti, ESS7

Introduction to Survey Statistics, University of Heidelberg, 2018
https://federicovegetti.github.io/teaching/heidelberg_2018/lab/sst_lab_day2.html

When working on countries separately:

# using srvyr
as_survey_design(weights = c(dweight, pspwght)) %>%
  group_by(cntry) %>%
  # etc.

# ... doesn't pspwght include dweight?
# ... what about stratum? psu?

When working on all countries together:

# using srvyr
as_survey(weights = c(dweight, pspwght, pweight))

Example: Daniel Oberski, ESS7

http://asdfree.com/european-social-survey-ess.html

Working on a single country (Belgium) after merging the data to the SDDF file:

svydesign(
  ids = ~psu ,
  strata = ~stratify,
  probs = ~prob,
  data = ess_df
)

Surveys + Maps - swmap

If need be, use rlist and pipeR

rlist, by @renkun-ken, looks like a brilliant way to manipulate lists. My only regret is that the functions are named list.verb instead of list_verb.

Suggestion – Use it if lapply usage becomes cryptic.

pipeR looks cool too, even though I would prefer to keep things simple and stick with just %>%.

Under-covered topics

For an advanced version of the course, some notes on the stuff that were excluded due to time constraints:

  1. Data manipulation (would deserve an extra session)
  1. Surveys
  • Labelled data (treated in passing; related to the point above re: factors)
  • Survey weights
  1. Databases: SQL etc. (with passing mention of big data)
  2. APIs and Web scraping
  3. Models -- Logit beyond logit: ordered logit, multinomial logit
  4. Models -- panel data, FE/RE, standard error corrections
  5. Models -- mixed models
  6. CVS -- Git/GitHub
  7. Dynamic documents -- R Markdown, Quarto
  8. Networks (data, viz, models)
  9. Text (this time covering e.g. topic models)
  10. JavaScript visualization libraries

So, basically, this advanced course could be, over 12 sessions:

  • 2 'refresher' sessions to get started again
    • introduce Git/GitHub
    • introduce R Markdown
  • 4 more sessions on data
  • 1 'refresher' session on (linear) models
  • 3 more sessions on models (with a refresher on logit in the first one)
  • 1 'extra' session on text
  • 1 'extra' session on networks

(Still no space for JS libraries, fair enough.)

This neglects PCA-style stuff and ML, which should probably be its own project, centered on tidymodels, covering random forests, gradient-boosted trees, etc.

Improve Deaton 2021 (dsr-03/01-covid-income)

dsr-03/01-covid-income needs more structure, and some steps need to be fully documented, e.g.

  • better document the code
    • locale()
    • as.Date()
    • skip in read_excel
  • merging
    • explain merging in detail (use code and comments from dsr-04/01-debt)

Also:

  • The header of the script comes from QUANTI2 and should be changed.
  • The "quick check" part should show str, head and tail, View and glimpse.
  • hard-code OECD membership: document source in README
    • remove dependency to memberstates::memberstates$oecd$iso3c
  • add ggsave at end

Syllabus, emails, readings, slides

Slides

Current size of slide sets (cap at 25, except Week 1). Revise readings, practice sessions and exercises, and include screenshots of videos when relevant.

  • 1. 37 -- OK, cap at ~ 40
  • 2. 24 -- OK
  • 3. 33 -- slightly too long, cut down a bit
  • 4. 25 -- OK
  • 5. 17 -- expand (description, sampling) -- add 'how to get help online' for Exercise 5
  • 6. 20 -- expand a bit (association) -- cover bootstrapping, Bayesian reasoning?
  • 7. TODO (correlation) -- cover bivariate OLS geometry
  • 8. 15 -- expand (regression)
    • expand further with slides specifically on regression output (use SRQM)
  • 9. 16 -- expand (logit) -- cover nonlinearity via LOESS, splines?
  • 10. TODO (surveys) -- cover data, again, data wrangling, again, weighting
  • 11. 30 -- slightly too long, cut down a bit (classification)
  • 12. 11 -- expand a bit

Syllabus

  • Port essentials from the current syllabus from PDF to Google Docs Done.
  • Continue moving session recaps from syllabus to DSR-outline-2.txt to GitHub README and emails
  • Fix #28
  • Possibly use stuff from EMSS-emails-2013
  • Look at more old stuff esp.
    • QUANTI1-2020
    • QUANTI2-2019
  • Store the emails online: GitHub wiki? emails/ folder?

Other courses and tutorials

Make better use of great tutorials:

Finish digging into (and reorganising…) those:

Possible additions for the wiki:

Paper to turn into an exercise

Amelia McNamara, Nick Horton, "Wrangling categorical data in R," citing her website:

Wrangling categorical data in R, a paper co-authored with Nick Horton. This paper describes some common mistakes data analysts make when working with categorical data (factors) in R. The paper was published jointly in The American Statistician, Vol. 72, Issue 1 and as a pre-print in the Practical Data Science for Stats collection on PeerJ.

  • Final exercise is great. Use it.
  • Cite in readings.

Readings

Finalize the list:

  • Finalize
    • Sync emails with list (working backwards…)
  • Document readings in slides, with "(on Google Drive)" markers when relevant
  • List (almost) all material mentioned in emails and slides
  • Establish session-per-session roadmap
  • Copy to syllabus Link to wiki in syllabus

Handbooks:

Export to QUANTI1 and QUANTI2

Working backwards…

QUANTI1 (also add handbooks):

  1. Software [ex: cty-level CSV]
  2. (2+3) Data + Visualization [csv via ex, CWS/xlsx, QOG-cs/dta, ex: cty]
  3. (4+5) Description + sampling (distributions) [svy, ex: svy]
  4. (6+7) Association + surveys (labels, factors) [svy, EXAM: svy]
  5. (8+9) Correlation + OLS.1 (simple, residuals) [cty, ex: cty/covid]
  6. (XX+10) OLS.2 (dummies, interactions) + Diagnostics & mfx [svy, EXAM: svy]

QUANTI2:

  • Whatever's not reused from data wrangling goes to Session 4
  • Whatever's not reused from data viz. goes to Session 5
  • Session on classification goes to Session 12 (if ever taught again)

TODO

  • Actually do the DSR → QUANTI move
  • Create the two repos

Website does not exist?

Hello their,

I tried running this code to pull the most recenct CPJ data into, R, but the link seemingly does not work and I can't find the CPJ database that goes up to 2017 in Excel. Would you happen to possess that dataset to update the link and/or code?

u = "https://www.cpj.org/killed/cpj-database.xls"
f = basename(u) # New_York_Times_Brexit_coverage.xlsx

if (!file.exists(f))
  download.file(u, f, mode = "wb")

Thanks for taking the time to look at this pull-request.

Data — U.S. Most Important Problem dataset

http://faculty.missouri.edu/williamslaro/mipdata.html

Email from Laron K. Williams to EPSA mailing-list below.


Hi everyone.

It is my pleasure to announce the release of the 'Most Important Problem' Dataset (MIPD). It is the result of a collaborative project on Americans' issue attentiveness with two Ph.D. students at the University of Missouri (Colton Heffington and Brandon Beomseob Park).

The MIPD collects all available individual-level responses to the 'most important problem facing this country' question in America from 1939 to June 2015. In addition to providing a sense of how individuals prioritize problems, the MIPD paints a picture of Americans' evolving issue attention over time. There are two datasets connected to the MIPD:

a) MIPD: This dataset contains individual-level responses (from over 670 surveys totaling almost a million respondents) to the MIP question, in addition to demographics, partisan preferences (including vote intention, previous vote, ideology and party identification), approval (general approval and approval of specific policies), economic evaluations (both retrospective and prospective, personal and national), and party competency (best party to address MIP, best party to address specific problems). We code the MIP responses into three coding schemes to ease the incorporation of other datasets: manifesto research group, the comparative agendas project, and Matt Singer's (2011) coding scheme. In addition to being the largest possible data collection of individual-level responses dealing with issue attention, we could certainly envision other uses of the dataset ranging from studies of vote intention, party competency, and partisan rationalization.

b) MIPD Aggregate: This dataset contains the aggregate percentage (weighted by population weights, if available) of respondents identifying that problem as the MIP. These aggregate data are available monthly, quarterly, and annually. This dataset would be a valuable component in an overall portrait of dynamic representation, and in observing how public opinion shifts in response to changing domestic and foreign circumstances.

The data are available from my website (http://faculty.missouri.edu/williamslaro/mipdata.html ) or from my Dataverse (https://dataverse.harvard.edu/dataverse/laronwilliams ).

If you use it, please cite: Colton Heffington, Brandon Beomseob Park and Laron K. Williams (forthcoming). "The 'Most Important Problem' Dataset (MIPD): A New Dataset on American Issue Importance", Conflict Management and Peace Science.

Thanks in advance. Feel free to contact me if you have any questions!

Laron

Surveys and survey weighting

Clean up early sessions

  • dsr-03/01-covid-income
    • better document the code
    • explain merging in detail (use code and comments from dsr-04/01-debt)
    • hard-code OECD membership (document source in README)
    • add ggsave at end
  • dsr-03/02-eu-mood
    • lose countrycode dependency?
  • dsr-04/01-debt
    • remove merge
    • adjust README accordingly
    • focus on building line plots, ending with small multiples
  • dsr-04/03-anscombe
    • cover base plots
    • cover lattice?

Import the best IDA examples

… as 'bonus' demos?

https://github.com/briatte/ida

  • Not all examples are very academic, some are just fun data
  • It would be nice to turn the course into a website like I did back then, but it would be too much work to write up everything.
  • Another solution would be to just update the code from IDA and republish that…

Candidates for Week 2 (functions)

Candidates for Week 3 (data)

Candidates for Week 4 (plots)

Candidates for Week 8 (linear models)

Candidates for Week 12 (extra stuff)

Maps (covered by #34)

Web scraping:

Text:

Candidates for unassigned weeks

Networks:

Time series:

Reproducible/open science:

  • That rant deserved to exist in 2013, and still does in 2023…

Maps and spatial data analysis

Shapefiles

  • World Bank
  • EuroGeographics
  • GADM data -- example
  • Other world map essentials?

Geolocalisations

Data

Old links

Everything below relates to discussions and experiments that happened prior to sf.

Related to an old blog post and to this old issue: #1

Document workload in slides for Session 1

Workload of the course (min. 2, max. 6, recommended: 4):

  • 2 hours in class
  • 1-2 hour going through the weekly emails and a selection of the listed readings and videos
  • 1-2 hours going through the exercise (which will include going back to the readings, and Web searches)
    • should be done in groups

Minimize/optimize package dependencies

  • Choose between moments and e1071
    • Only mention e1071 and its alternative measures?
  • Session 9
    • Use ggeffects or marginaleffects -- or both? use in more than one session?
  • Session 11
    • Replace car with anything that plots scatterplot matrices?
    • Remove corrr entirely?
    • Remove ggcorrplot entirely? (or find a package that does the same, plus scatterplot matrices?)
    • Keep factoextra and ggfortify
    • Find a way to re-use plotly? e.g. https://plotly-r.com/d-charts.html -- or remove entirely
  • Session 12
    • Install sf earlier and have a map in Session 3
    • Replace life expectancy example from Session 12 with more complex example?
    • Add a full-fledged network example to justify installing igraph and ggraph?
    • Add a full-fledged dynamic document example using modelsummary (and possibly texreg)?
    • Make pdftools optional (only recommend it in the slides?)
  • Extras
    • Provide an additional example using rvest?
    • Keep WDI, but use it more than once? -- see #21
      • Modify exercise-04 to use WDI but save the data
      • Modify life-expectancy to use WDI and save the data

Checks:

  • Check that countrycode is used in multiple sessions
  • Check that broom is used in multiple sessions
  • Check that ggmosaic is used in multiple sessions
  • Check that ggrepel is used in multiple sessions, "", "", "",
  • Check that performance is used in multiple sessions
  • Check that texreg is used in multiple sessions

Surveys - Solidarity in Europe (SiE)

https://twitter.com/alexandreafonso/status/1658451225174753280

https://europeangovernanceandpolitics.eui.eu/eui-yougov-solidarity-in-europe-project/
https://cadmus.eui.eu/handle/1814/72778

d <- "2020 SiE year dataset/2020 SiE dataset_dataset.csv" %>%
  readr::read_csv()

count(d, qcountry)
cty <- c("United Kingdom" ,"Denmark" ,"Finland" ,"France" ,"Germany" ,"Sweden" ,"Greece" ,"Hungary" ,"Italy" ,"Lithuania" ,"Netherlands" ,"Poland" ,"Romania" ,"Spain")
d$qcountry <- factor(d$qcountry, labels = cty)

summary(d$weight)

str_subset(names(d), "q20")
d <- mutate_at(d, vars(starts_with("q20")), ~ case_when(.x == 1 ~ 1L, .x == 2 ~ 0L, .x == 3 ~ NA_integer_))

# Austria
count(d, qcountry, q20_1)
tapply(d$q20_1, d$qcountry, mean, na.rm = TRUE)

# TODO: survey weights
# TODO: make dyadic version

Possibly present as a ‘freestyle’ exercise -- here's the data, do whatever you like with it. In this case, provide the data as a truly dyadic dataset, with q20_x merged into a single item, and additional columns corresponding to the model in the paper.

Break down the answer to Exercise 4

Do what the README of exercise-04-answer (not in the repo) recommends doing: below the actual answer (Step 2 of the preston-curve-answer.r script), show a detailed breakdown of the steps.

Remove `docs` folders?

Some of them probably breach idiotic copyrights, and the links in the README files should do. Not including the docs is in fat a good incentive to go and read those READMEs in the first place.

Improve Anscombe example

Possible reorganisation:

  • Make it Exercise 1, focusing
    • setting the working directory correctly
    • having installed the tidyverse
    • understanding the stats (optional)
  • Make the generative art Example 1

Might be too complex, and having in Session 3 makes sense if the code is updated to show multiple graph engines (base, lattice, ggplot2).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.