The srqm's discuss from briatte

Using Linux Libertine in the math appendix

The math appendix uses knitr, ggplot2 and gridExtra to produce plots with math notation in the document. If you try to add extrafont to the R script, the plots will fail to generate:

Warning in grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family 'LinLibertine' not found in PostScript font database
Quitting from lines 154-194 (A_math.Rnw) 
Error in grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y,  : 
  invalid font type

I'll leave the code in a FALSE condition in case someone finds a fix.

Update to ESS Round 6

Round 6 (2012) of the ESS is now available.

QOG 2020: make sure GDP documentation has been corrected

Spotted by a smart group of students. Relevant tweet.

Package installation on restricted computers

This is still an issue, and the code in setup/srqm_pkgs.ado and utils.ado (the pkgs utility) is too complicated and not even guaranteed to work properly.

Three possible situations:

running the course on a laptop with full admin privileges – everything works fine
running the course on a computer with restricted rights, from a USB stick – issue (1)
running the course on a computer with restricted rights, from the hard drive – issue (2)

Issue (1) might be easy:

detect if the pwd either contains /Volumes/ (Mac) or does not contain c:\ (Win), or equivalent on Unix
if so, install packages locally in the srqm/pkgs folder

This might fail if the hard drive is not C: on a Windows machine.

Issue (2) is bothersome. So far, the approach is to try the PLUS folder, and if it fails, the PERSONAL folder, and if it fails, install locally.

Perhaps it would be easier and better to just try out the default option, using something as simple as ssc inst fre, and if it fails for whatever reason, to fall back on the local install.

Use `desctable` for Table 1 export?

https://www.trentonmize.com/software/desctable

Requires Stata 15, though.

Link broken

On http://f.briatte.org/teaching/quanti/, the link to the syllabus is broken (since you moved it around).

Update to QOG 20 December 2013

The new version should not be very different from the 15May2013 version.

Try to list all former instructors (and admin)

Lost count around September 2018, it seems…

Fall 2023 team: Pol-angély PESCAYRE, Alexis GRIGORIEFF and myself.

For historical research via the Wayback Machine…

http://formation.sciences-po.fr/enseignement/2018/KGLM/2015
http://formation.sciences-po.fr/enseignement/2018/KOUT/2030

GLM, Fall 2018: myself + MCAVAY, Haley
GLM, Fall 2019: myself + Antoine
PSIA, Fall 2019: tons of people
- myself
- CUPILLARD, Emilie (Chargée d'études statistiques)
- DE LAEVER, Antonin (Enseignant)
- ESLAMILOUTIJ, Siyavash (Doctorant)
- JARDIN, Antoine A. (Ingénieur de recherche CNRS)
- MEHMOOD, Sultan (Etudiant doctorant)
- MONTALBO, Adrien (Doctorant)
- PESCAYRE, Pol-Angely C. (Teacher)
- SCHNEIDER, Sarah (Statistical Reasoning)

Might also be useful to dig into:
https://moodle.sciences-po.fr/course/search.php?q=reasoning&areaids=core_course-course

… helps to find, for semester '202010' (Fall, probably)

Siyavash ESLAMILOUTIJ
Malo MOFAKHAMI

Update to (support) Stata 13, 14, 15

The course was originally written for Stata 10/11, and has been tested with Stata 12, but not Stata 13.

Reintroduce the slides

Reintroduce the slides
Credit the initial template

Slides should be an excuse to structure the first 45' of class:

15' on project management
- 10' to monitor all groups from the Google Docs projects sheet
- 5' on questions and answers about the projects
5' to remind everyone of how far they should be in their textbook readings
- Show course webpage with textbook and Stata Guide chapters
- Answer theoretical questions briefly, provide empirical examples during practice
20' to introduce the session topic and replication code
- Every do-file contains detailed notes and is meant to be finished at home
- The solutions for the short coding exercises in the slides are on the wiki

… so the slides basically tie up everything together:

the first and second part of each session
the course documentation (website, readings, code, Google Docs, wiki)

Disclosure: count 5' to 10' for lateness to class (both teacher and student-induced). This means that the 20' for (3) are more like 15', and that the break has to be 10' at most, even when the students have interesting questions and ideas about their projects to share over coffee.

Use better `require` util?

https://github.com/sergiocorreia/stata-require

Update to GSS 2018

http://gss.norc.org/get-the-data/stata

Nice example use:
https://kieranhealy.org/blog/archives/2019/03/22/a-quick-and-tidy-look-at-the-2018-gss/

Update the wiki

Some of the linked resources are probably outdated or unavailable.

Course utilities is the most useful, at least to me. I keep rediscovering some of the stuff it lists…

Data lists stuff that I communicate to students.

The course history was never clear or accurate… Let's see if planning for 2.0 (#31) helps.

The "Code" wiki page does not exactly correspond to what I teach students. For instance, my first session focuses almost just on pwd…

The "Stata" wiki page links only to English-language stuff, but I could add @methevenin's courses in French, which are very up-to-date:

https://mthevenin.github.io/stata_fr/
https://github.com/mthevenin/stata_fr
https://github.com/mthevenin/formation_stata

There are other pages, some with close to zero usefulness, unless I put the links or references in the Stata Guide, for instance. I keep re-creating lists of courses every time I teach a new course anyway.

Bottom line — use the wiki only to document the srqm internals (utilities), move everything else to the Stata Guide.

Add Erik Gahner's PolData listing to data sources

@erikgahner – https://github.com/erikgahner/PolData

profile.do error

In the profile.do file, I encounter an error with Linux/Stata 12 at line 34. The problem seems to be with the c() instruction :

c(update_query) undefined

Not checked under Stata 13

Country-level data: World Economics and Politics Dataverse

https://ncgg.princeton.edu/wep/dataverse.html
https://ncgg.princeton.edu/wep/download.html
https://ncgg.princeton.edu/wep/IPE_Codebook.pdf (outdated)

Make the data cross-sectional 2017 ±3 years, à la QOG
- Email the original authors for feedback
- Call it wep2020? ipe2020?
Code an example merge between WEP and QOG — see 34
- e.g. with the RCS variables, which are more recent than the lp_ ones in QOG (1980s)

Update `burd` sheme?

https://github.com/sergiocorreia/stata-schemes

Should be opened in its own repo too.

world-c.data and world-d.dta ?

What are the world-c and world-d data files? I can't seem to open them, Stata says "file data/world-d.dta not Stata format".

svyplot

Related to #25 in a way: release svyplot.

https://gist.github.com/briatte/5099538

It's been used for good: https://twitter.com/PetGran/status/1046824377151619074

Reintroduce Stata Guide

Version 2 is XeTeX-coded, so the sources should be out there too.

(Once publishable, it should be easy to bundle the replication material as a Stata package, which would also be a better way to distribute the course utilities. See Mark Lunt's epidemiology course or J. Scott Long's course for examples of courses-as-packages.)

Additional do-files

Using notes from SRQM-TODO-2018. Other TODO files need to be added too.

Repeats of stuff already covered

With the aim of explaining better what the options are through a different use case:

working-with-summary-graphs.do
working-with-summary-statistics.do
working-with-survey-weights.do
working-with-regression-results.do
working-with-marginal-effects.do
working-with-clustered-standard-errors.do

Aim would be to have one 'extra' per week.

Extra stuff

working-with-wide-datasets-reshaping.do (will require a different dataset, e.g. OECD)
working-with-multiple-datasets-merging.do (use WEP? see #24)
Cronbach's alpha and/or factor to create synthetic indexes
PCA and clustering
multilevel models
panel models (will require adding a panel or CSTS dataset — see below)
maps! and possibly spatial models, segregation indexes (ask Antoine)?
Bayesian example (taken from Gelman et al.? Bread and Peace)?

Suggestion (3) might be a good 'extra'.

CSTS or panel datasets

CPDS, OECD, WDI are good candidates. Scruggs, if not too outdated?

See https://f.briatte.org/teaching/quanti/data for ideas.

Syllabus source

Could you share (with me, at least) the syllabus source? I need to change some details, such as my name, the room etc. Thanks!

Test the course with Stata 10 or 11

The course has not been tested with Stata 10 or 11 for a while, and keeping backward compatibility is important.

Using srqm_get

srqm_get fetches course material. It's useful to distribute do-files and slides, which are often edited at the last minute.

The code for srqm_get now points to srqm.briatte.org, which will redirect to srqm.apinc.org as soon as my zone file refreshes. There is a page at this address to remind students how srqm_get works.

@joelgombin: I'll send you the address and password to the FTP.

Compatibility with different versions of Stata

Closes #12 and #18 in favour of a reassessment in early ~~2021~~ 2023 (updated).

The ideal goal would be to maintain compatibility with all (SE/IC/MP) versions of Stata released in the last 10 years, with a focus on Stata SE.

(For reference, the course started running shortly after Stata 11.1 was released, and I think I remember testing it with Stata 10, from June 2007, possibly even Stata 9, April 2005.)

Currently (~~2021~~ 2023), that means going back to Stata ~~11-12~~ 13, so Stata ~~12+~~ 13+ compatibility seems like a reasonable goal. This would avoid e.g. having to enclose calls to marginsplot in "if version" conditionals.
A more reasonable goal is to support Stata 13+ only, primarily because of UTF-8 and HTTPS support. Stata 13 was released in 2013, so supporting Stata 13+ would be an ideal goal for… 2023. (Update: it's 2023, and this is now a good goal.)
Some stuff like the syntax of ci and changes to the defaults of margins (for xtlogit, re, so not affecting the course) suggest supporting only Stata 14+. That's the lazy option.

Given the above and some of the details below, the lazy objective of supporting only the last 3 versions (Stata 14+) might be more reasonable… Stata 14 was released in 2015, so that would result in a 5-year compatibility window, which is not so shabby.

Dataset format

There are comments about this in srqm_data.ado on that. The current format for all teaching datasets in Stata 12.

Number of variables

2,048 at most for Stata/IC.

Solution: warn (or fail?) if datasets go over 2,048 - 100 variables in srqm_data.ado (leaving 100 variables for the user).

Requires checking the QOG dataset (updated below in 2023)
- qog2019 is fine, 1,983 vars (leaves 65 free for the Stata/MP user to create)
- 2020, 2021, 2023 editions are at ~ 1,700 vars
- see #30 to understand why sticking with qog2019 makes sense
Requires changing the GSS dataset by breaking it down to multiple years
- see #30
- see code below

GSS limited to 1976 and 2016 has ~ 1,100 vars and weighs 7.7 MB -- that should work.

keep if year==1976 | year == 2016
d, varl
foreach i of varlist `r(varlist)' {
	di "`i'"
	count if !mi(`i')
	if r(N) == 0 {
		drop `i'
	}
}

HTTPS

Discussed in #12. It's probably time to drop HTTP support — there is no satisfying solution to continue doing so, the course is available outside of Sciences Po only via HTTPS-only GitHub, and all Sciences Po students are on HTTPS.

There are more comments about this (HTTP/S on my Stata access point) in srqm_grab.ado.

Import commands for non-DTA data

srqm_grab.ado contains commands to import CSV/TSV and Excel data in Stata: it will show the commands to do so for Stata 13+ (one more argument in favour of dropping Stata 12 at that stage).

Outdated commands

memory fails gracefully. Affects week1.do.

ci does not fail gracefully — it does not require mean in Stata 12 or in Stata 13, but does in Stata 14+. Affects week4.do and week5.do.

marginsplot is not supported in Stata 11-. Affects week11.do.

Check whether the permissions fix still works

This fix used to be able to set up the course on admin-restricted computers in the Sciences Po microlab. It failed today, so it needs to be tested again or modified.

Update the Github Page

https://briatte.github.io/srqm/

So badly outdated… Never really used it in class anyway. Deleting it entirely might be a better idea.

Stata 18: descriptive stats table

https://www.stata.com/new-in-stata/create-export-descriptive-statistic-tables/

Ten years after — v2.0

This course is now roughly 10 13 years old (first run: Fall 2010). The pretty dirty repo history shows it. It's time to think of version 2.0, although version 1.0 never got its release tag (release tags did not even exist when we started).

And perhaps even more importantly, but (perhaps, even) more time-consumingly:

Update the Stata Guide
- Continue the LaTeX rewrite
- Take the decade-old (let's not wait for decades…) comments from Filip, Joël… into account
- Finish writing the regression bits using Bittmann 2019 and Mehmetoglu and Jakobsen
Update the course slides (let's stick with LaTeX despite love/hate)

Additional do-files

My many TODO files from 2017, (especially) 2018, 2019, 2020 have suggestions of extra do-files to create — shorter ones, ones that cover extra stuff beyond the scope of the course (e.g. merging, panel data).

I also have some very short "demo" do-files that I use in the first hour, as recaps of the previous session + introduction to the second hour of the current one.

Use that as an opportunity to…

… include and demo more datasets?
… document estout properly? (both for "Table 1" and regression tables)
… rename the do-files, week01, week02… week12 for obsessive neatness?
- ~~have week0*-recap do-files with just the essentials~~ too much work: streamline the week** ones
- have xtra01 to xtra12 -- one 'bonus' do-file per week (see below)

Bonus do-files (which will move out some stuff from the main ones, and will cover some intermediate/advanced topics):

xtra01-pca -- plot a map + demo PCA (see below)
xtra02-merge -- download additional data from online + merge
xtra03-svy -- survey weights: WVS 99-04
xtra04-bootstrap -- survey weights: NHIS 2017 (repeat?) + bootstrap
xtra05-export -- export descriptive stats with estout
xtra06-tests -- survey weights: ESS 2008 (repeat?) + other association tests with ranks
xtra07-ts -- QOG time series with (extract of) 2023 edition? (serial correlation)
xtra08-panels -- robust and clustered SEs, fixed and random effects with QOG time series
xtra09-export -- export regression results with estout
xtra10-logit -- AUC/ROC, predicted probabilities, ordinal logit, multinomial (?)
xtra11-mfx -- marginal effects, ~~bootstrap~~ (already there at end, remove)
xtra12-count -- survey weights: GSS + neg binomial, count, Poisson etc.?

PCA example:

pca popgrowth-safewater
scoreplot, ms(i) mlab(country)
// note: tried using `kountry` to convert country names, failed so far
loadingplot
// demo arch effect, no strong 2nd dimension
pca lexp-safewater
scoreplot

Leaves out:

MCA
quantile regression, L1 (lasso), L2 (ridge)
bootstrapped SEs in models
Bayesian models
multilevel models

Beyond teaching

I once considered publishing the Stata Guide, but publishing a Stata Guide, even though some publishers would take it, sounds bizarre in 2021. R is the current standard, with Julia and Python probably coming next or along.

At least look at LeanPub, like Roger D. Peng
Ask PSIA or the Presses de Sciences Po about it
Go for Sage, like some kind of updated, no-menus Mehmetoglu and Jakobsen?

Update to NHIS 1997–2014

The online data now goes up to 2014.

Update to QOG January 2020

Basically a follow-up of #14

All links to QOG Standard 2016 are now dead.
Recent QOG Standard datasets have dropped Barro & Lee variables, so the course do-files need to use the gea_ versions of educational attainment.
Remove eu_ variables (small small size) during data preparation? The data trimming script probably already does so.

License? And Jupyter notebook example

Hello! I was just perusing Stata-related github repositories and came across this. There's no license file, so I wasn't sure if the content was openly-licensed. By default, repositories without a license file have all rights reserved.

Secondly, I've been working on a Jupyter kernel for Stata, which allows Stata to be run directly from Jupyter Notebooks. I'd been wanting to make an example notebook anyways, so I converted the week2.do file into a Jupyter Notebook. It's nice because Github includes the output and graphs when it displays Jupyter Notebook files.

If you click here: https://github.com/kylebarron/srqm/blob/master/code/week2.ipynb you'll see all the output of the week2.do file rendered next to the Markdown descriptions.

0_myboxes package?

The sty.tex file loads a "0_myboxes" package that is, however, not included in the folder. Any idea where I could get it? (I looked quicky at Taraborelli's github page but didn't find it).

Dataset updates

Closes #21, #22 and #23 (copied below), #27.

Update from 2023

Stop updating the data, really.

'Freeze' as it is ~~(except for ESS, perhaps)~~
Archive the original freezed datasets/codebooks in data-raw/
Update srqm_data to use data-raw/
Slightly improve the _readme documents
- Document freezes
- Document codebook issues, e.g. #27
- Ideally, this would be in the Stata Guide…
Add WEP? #24

Detailed notes

QOG: ~~qog2023~~ -- since QOG 2023 is out
- freeze: qog2019
- would require rewriting code and looking at less clear results… see code at end of section
- only advantage would be lower codebook size → just downsample the 2019 one, it only loses the intra-doc links
- note the codebook issue! #27
- Perhaps simply drop the eu_* variables
GSS: ~~gss7221~~ -- since GSS has updated too
- freeze: gss7616 (but see below)
- not fun to keep only one year: keep ~~older years~~ one old year too
- ~~possibly break down single data into yearly ones?~~ restrict to 1976 and 2016
  - would solve "max 2,048 vars" issue from #28
  - ~~raises question as to how to zip it all (currently uses gss7616* to match files)~~
ESS: ~~ess2008~~ -- in order to continue using torture question?
- freeze: ess0816, or ess2008 and ess2016 (different codebooks, so it's fine)
- keep using Round 4 for both torture example and health services ones (results are not as clear-cut with Round 8(
- keep Round 8 to cover e.g. climate change
- problem: DTA file is too large -- divide, to avoid _merge problem
- document existence of ess2016 despite not in use anywhere in the course do-files
WVS: wvs9904 -- keep old version for sharia law question
- update to last version, check encoding
- possibly also include a more recent wave? (raises same question as ess2016)
NHIS: update to ~~nhis202* recent year~~ nhis1020?
- check if sampling frame and variables have changed first
- see below on how URL structure for fetching has changed

Note on QOG -- offers only this as a replacement in 2023, which is not ideal:

// school life expectancy
sc wdi_fertility wef_lse, ms(i) mlab(ccodealp) || lfit wdi_fertility wef_lse, ///
	name(g1, replace)
// linear fit + SSA data points only, underpredicted
sc wdi_fertility wef_lse if ht_region == 4, ms(i) mlab(ccodealp) || ///
	lfit wdi_fertility wef_lse, ///
	name(g2, replace)
// all regions
forv i = 1/10 {
	sc wdi_fertility wef_lse if ht_region == `i', ms(i) mlab(ccodealp) || ///
	lfit wdi_fertility wef_lse, ///
	name("region`i'", replace)
}

The plan for 2021:

Additional things to consider:

Dataset names

I like the initial "acronym + year" convention, but it produces strange names for multiple-year survey datasets:

ess1214 (not used) and ess0816
wvs9904 (unavoidable)
nhis1017 (unavoidable, unless we use a single year, but that removes any demo of keep if year)
gss7616 (unavoidable, unless we separate the years)

Merged datasets

Is it still a good idea to do that for e.g. ESS? Probably not, esp. if we need to limit datasets at 2,048 variables for Stata/IC.

Keep NHIS with multiple years. Use it to demo keep if year.
Keep WVS with multiple years (country-dependent).
Break down GSS.
Break down ESS.

Both WVS and ESS are used to demo keep if inlist(country, …), the other subset we want to show.

Additional datasets

It would make a lot of sense to have more datasets for the students to use than those used in the do-files.

Currently, the do-files are selective anyway: we provide ESS 2016 (Round 8) but do not use the data, even though the dependent variable also exists in that round.

GSS has a single codebook, so bundling many years would duplicate the codebook in the ZIP archives. Not ideal.
ESS could be broken down to Rounds 4 (2008), 8 (2016) and 9 (2018).

Update code to new survey datasets

ESS Rounds 4 and 8
- check Round 4 results
- use Round 8 in at least one of the do-files?
U.S. GSS (new selected years)
U.S. NHIS (new selected years)
WVS Round 4 (new data file)

See #21 for the equivalent issue with the QOG dataset.

Graph scheme

Digging in tweets and bookmarks… My own scheme-burd seems to have a few issues in recent versions of Stata, so at least update it, or switch to one of those below.

stata-scheme-modern

https://github.com/mdroste/stata-scheme-modern#screenshots

Seems most promising. Would ideally like to support BuRd diverging colours.

538

ssc install g538schemes, replace all

https://danbischof.com/2017/09/05/a-final-stata-gift-538-schemes/

Via @rivelino22.

Improve srqm_datatrim

Trim to max(n) variables (to deal with Stata 11 and limited versions of Stata).
Find the number of observations per each variable.
Lose all variables for which there no observations.
Gradually lose all other variables, starting with those with less observations.

Many QOG variables have low-N for cross-sectional analysis

library(tidyverse)
d <- haven::read_dta('/Users/fr/Documents/Teaching/SRQM/data/qog2019.dta')

tibble(
  var = names(d),
  # data sources
  src = str_extract(names(d), ".*?_"),
  n = apply(d, 2, function(x) sum(!is.na(x)))
) %>% 
  group_by(src) %>% 
  summarise(n_vars = n(), min_N = min(n), max_N = max(n)) %>%
  arrange(min_N) %>% 
  # arbitrary threshold at N = 50
  filter(!is.na(src), min_N < 50) %>% 
  print(n = 100)

PSI, EU, OECD, WWBI and a few others are particularly at fault:

# A tibble: 28 x 5
   src     n_vars min_N med_N max_N
   <chr>    <int> <int> <dbl> <int>
 1 psi_         6     1  10.5    20
 2 mad_         4    15  29     163
 3 eu_        277    16  34      48
 4 une_        47    16 146     193
 5 wwbi_       38    17  41      62
 6 oecd_      281    19  37      44
 7 wdi_       278    19 156     192
 8 dev_         4    20  20      20
 9 dpi_        70    26 160.    175
10 bs_          8    28  28      28
11 ess_         9    28  28      28
12 ideavt_      6    28 107     180
13 wel_        36    29  32     189
14 wvs_        42    29  34      34
15 aid_         6    31 139     139
16 cses_        2    31  31.5    32
17 gol_        20    33 127     129
18 wiid_       18    34  35      35
19 ucdp_        2    35  70     105
20 cpds_       49    36  36      36
21 h_          11    37 165     185
22 lis_        23    37  37      37
23 r_           5    40  98     144
24 sgi_        29    41  41      41
25 top_         2    41  41      41
26 nelda_      10    44  45      45
27 vi_         13    45  48      50
28 qs_          9    47 112     115

Not a bug, but leads students to build designs with low sample sizes.

Take a look at Trendon Mize's CDA course

https://www.trentonmize.com/teaching/cda

… and possibly others at the same location, e.g.

https://www.trentonmize.com/teaching/surveys

Also, basic Stata guide there:

https://drive.google.com/file/d/1wX0bXu7WOb3OW9eAyTCYTccsQLdGU2bF/view

briatte / srqm Goto Github PK

srqm's Issues

Repeats of stuff already covered

Extra stuff

CSTS or panel datasets

Dataset format

Number of variables

HTTPS

Import commands for non-DTA data

Outdated commands

Additional do-files

Beyond teaching

Update from 2023

The plan for 2021:

Dataset names

Merged datasets

Additional datasets

stata-scheme-modern

538

Recommend Projects

Recommend Topics

Recommend Org