lost-stats / lost-stats.github.io Goto Github PK
View Code? Open in Web Editor NEWSource code for the Library of Statistical Techniques
Home Page: https://lost-stats.github.io/
License: GNU General Public License v2.0
Source code for the Library of Statistical Techniques
Home Page: https://lost-stats.github.io/
License: GNU General Public License v2.0
In looking over several examples, I've come around to the idea that we should strongly encourage re-use of the same datasets across language implementations (i.e for a specific page). Advantages include:
I recently did this for the Collapse a Dataset page and, personally, think it's a lot easier to read and compare the code examples now. @vincentarelbundock's Rdatasets is a very useful resource in this regard, since it provides a ton of datasets that can be directly read in as CSVs. (Both Julia and Python statsmodels have nice wrappers for it too.)
Question: Do others agree? If so, I'll add some text to Contributing page laying this out.
PS. I also think we should discourage use of really large files, especially since this is going to start becoming a drag on our GA Actions builds. There is one big offender here that I'll try to swap out when I get a sec. (Sorry, that's my student and I should have warned her about it.)
Per #3, we would like to support syntax highlighting for all the relevant languages here. https://github.com/rouge-ruby/rouge doesn't have a Stata or GRETL lexer. I've written a Stata one, but don't want to fix up all the test issues to get it through Rouge's process, so for now it lives in the _plugins directory.
If you'd like to add a GRETL lexer, feel free to follow the same process as for the Stata lexer. :)
I think it would be great if all the examples here assumed standard environments and we added a page about it. It would also allow #6 to actually happen.
E.g., for R, you should install.packages(c('tidyverse', 'mass'))
or whatnot.
Thoughts?
I would like to add a section at the end of the granger causality page dedicated to impulse response functions. I think they are a very helpful way to visualize granger causality. Would it be possible to make me a contributor to add these changes?
I think maybe I am messing up how I am suggesting edits to pages but I'm not sure. I know I'm a contributor, so I don't think I'm supposed to be doing PRs, but I have been submitting my edits by just clicking the green "Propose changes" button at the bottom of each page. Should I be doing anything else? Over the last few days I made some modifications on the synthetic control page, the density plots page, and the support vector machines page.
The fixest code is all broken here: https://lost-stats.github.io/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html
(Likely due to changes to i()
introduced around version 7.0.0).
The solution is something like:
library(fixest)
od = read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv')
od =
within(od, {
Date = as.Date(paste(substr(Quarter, 3, 7),
as.integer(substr(Quarter, 2, 2))*3-2,
1,
sep = "-"))
Treated = State == 'California'
})
fmod = feols(Rate ~ i(Date, Treated, ref = "2011-04-01") | State + Date,
data = od)
coefplot(fmod)
iplot(fmod)
But stepping back, I actually think we should change the dataset for this page. It requires several tedious data cleaning steps across the different languages and ultimately ends up producing an event study plot because of the time dimension (which is confusing in of itself). Aren't we just looking for something like
mod = lm(mpg ~ factor(vs) * factor(am), mtcars)
summary(mod)
marginaleffects::plot_cap(mod, condition = "am")
? (And obvs the equivalent in other languages)
There are a few code samples which don't quite work. For Python, they are here. For R they are here.
Most likely, this is because code samples are split by text. For example, if the Markdown looked like this:
Set _x_ to 1.
```r
x <- 1
```
Then add 1 to it.
```r
x <- x + 1
```
In this case, x
is not defined in the second code block. To fix this, you can name the code block, e.g.,
Set _x_ to 1.
```r?example=simple
x <- 1
```
Then add 1 to it.
```r?example=simple
x <- x + 1
```
The code tester will gather each of these code blocks together and run them sequentially.
There are a few examples where the example isn't literally meant to work. For example,
```python
import pandas as pd
df = pd.read_csv("name_of_file")
```
In this case, "name_of_file"
is obviously a placeholder for the path to the file. But the code doesn't literally work, so it will cause a failure.
As a solution, you can indicate you want the system to skip
it:
```python?skip=true&skipReason=file_does_not_exist
import pandas as pd
df = pd.read_csv("name_of_file")
```
Here note that we added ?skip=true
, which tells the test runner to ignore the test. We also add &skipReason=file_does_not_exist
, which is just an optional explanation for why we are skipping the test.
If you are fixing these locally, you can (if you have python
, poetry
, and docker
installed) run the tests locally with:
poetry install
poetry run py.test -n 4
Alternatively, you can rerun the tests on Github by:
Run monthly [python|r] tests
Run workflow
buttonRun workflow
Currently, native syntax formatting only works for some languages (Python, R, SAS). Other languages like Stata, Matlab, GRETL, etc. are just formatted as plain text chunks. See here for some examples: https://lost-stats.github.io/Model_Estimation/ordinary_least_squares.html#implementations
Do we need to add syntax support for these other languages? (Maybe from Pygments or some other source?)
Hey,
Really cool resource guys!
There is a mistake on this tutorial explaining how to train causal forests:
https://lost-stats.github.io/Machine_Learning/causal_forest.html
In the stata part, you split data into test and training using
g split = runiform() > .5
and you send over the testing data with the preserve block and calling
keep if split == 0
But at not point do you call
keep if split == 1
to explicitly only keep the training data for the training!
I believe after the preserve block
preserve
* Keep the predictors from the holding data, send it over, so later we can make an X matrix to predict with
keep if split == 0
keep year prbarr prbconv prbpris avgsen polpc density taxpc regionn smsan pctmin wcon
* R needs that data pre-processed! So using the same variables as in the main model, process the variables
fvrevar year prbarr prbconv prbpris avgsen polpc density taxpc i.regionn i.smsan pctmin wcon
keep `r(varlist)'
* Then send the data to R
rcall: df.hold <- st.data()
restore
You just need to add one line which is
keep if split == 1
So that you train with the training data and not the whole dataset.
Thanks!
Hello, I would like to be added as a contributor to the LOST repo. I want to add a page on color palettes, I found it in the non-existing pages tab, and I think it would be really useful for others to have as a resource.
I've been working through and fixing all the broken links. There are, unsurprisingly, a bunch of links to pages that don't exist yet. This makes it annoying to fix the broken links, since you have to remember what pages exist, and also is probably annoying for anyone who clicks on them. What would be the best way to make these links "work?"
Having them go to a page that says 'this page doesn't exist yet' seems ideal. I assume it wouldn't quite work to do that automatically, Wikipedia-style, right? If not I can set something up manually.
First, thanks a lot for putting this out for learners like us. I was wondering if i can use the following for repeated cross sectional data as well?
Code copy and pasted from the diff-in-diff event study section************
create the lag/lead for treated states
fill in control obs with 0
This allows for the interaction between treat
and time_to_treat
to occur for each state.
Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
this will determine the difference
btw controls and treated states
g treat = !missing(_nfd)
Stata won't allow factors with negative values, so let's shift
time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)
Regress on our interaction terms with FEs for group and year,
clustering at the group (state) level
use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)
Pull out the coefficients and SEs
g coef = .
g se = .
levelsof shifted_ttt, l(times)
foreach t in times' { replace coef = _b[
t'.shifted_ttt] if shifted_ttt == t' replace se = _se[
t'.shifted_ttt] if shifted_ttt == `t'
}
Make confidence intervals
g ci_top = coef+1.96se
g ci_bottom = coef - 1.96se
Limit ourselves to one observation per quarter
now switch back to time_to_treat to get original timing
keep time_to_treat coef se ci_*
duplicates drop
sort time_to_treat
twoway (sc coef time_to_treat, connect(line)) ///
(rcap ci_top ci_bottom time_to_treat) ///
(function y = 0, range(time_to_treat)) ///
(function y = 0, range(bottom_range'
top_range') horiz), ///
xtitle("Time to Treatment") caption("95% Confidence Intervals Shown")
Hi,
I made a few edits in the Data Manipulations by add data.table code for reshaping (both long-to-wide and wide-to-long), and also in OLS by adding Python code for fixed effects regression.
Would you please add me as the contributor so that I can push my changes to the repo and make a pull request? Currently, when I push, it consistently says "The requested URL returned error: 403".
Best,
Wensong
Many of the examples in this are broken, for example in the balance test, the guide instructs us to do
bal.text(treat ~ foreign, data = mtcars)
Despite the fact that neither treat
nor foreign
exist in the mtcars
dataset. Looking through the code-base, there doesn't seem to be an automated testing for this repository. Is this something on the radar?
Didn't realize this wasn't an open issue. Just finished fixing all the broken links (except for the ones that don't lead to existing pages of course). Opening this issue to close it.
@NickCH-K and I have discussed some of these over Twitter, but I'm going to jot down a laundry list of stylistic elements that I feel could be improved/changed. These are pretty opinionated, so I'd welcome thoughts from others.
lfe
package we can...". Code should be reserved for, well, code. Similarly, we can revert to code when referring to a specific function from that package. (E.g. "Use the lfe::felm()
function..." is fine.)See: #5 (comment)
We want to tidy up page hierarchies, starting with Model Estimation and maybe move on to some of the other sections. The tricky thing is deciding the best categories, although some overlap is probably unavoidable. Here's a first stab... mostly working off the existing pages, but with one or two non-existing (but obvious) counterparts thrown in. Feel to edit or make changes.
Hi I have used Python package stargazer to generate a nice table to showcase all the regression output in html format.
I am wondering how shall I save this format into a certain folder so that I could share it with my co-authors?
Hello,
I am happy to have contributed on LOST however, after I edited the new page template into the new Tobit.md, I realized I've proposed erasing the new page template and replacing it with my contribution. Also, I got this message "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository." I'm not exactly sure how to get my page I wrote into the correct place on a new page.. Help please.
I was just taking a look at the page for Combining Datasets Overview and the Markdown tables used at the top are not being shown as tables. They show up properly in the code/Markdown preview for that page, but not the final version. How can we fix this?
I couldn't find any mention of general multilevel models on the Desired Nonexistent Pages page.
Did I miss the right page, or is there a specific reasons against providing a basic explanation of multilevel models?
Given the twitter conversation that some of us participated in today, and the need for reproducibility in projects, should we add a section/page on best practice for reproducibility or would that be out of scope? Personally, I think it could be really useful for site users but I'd like to hear everyone's thoughts.
I guess we might want to cover:
reproducing a set of packages (eg pip freeze > requirements.txt
or conda env export --from-history -f environment.yml
and the R equivalents)
reproducing the execution sequence of a series of scripts and commands (eg using, and perhaps drawing, a makefile or using a tool like ploomber)
reproducing the operating system (Docker)
anything else?
We could draw on the relevant sections of The Turing Way if necessary.
Hi all,
This might be tangential to #73 , but I was thinking it would be useful to have a page on basic package creation and/or namespace management.
I'd be willing to write brief R and Julia tutorials. Also, as a learner I would love to see best practices for Python!
My apologies if this already exists and I'm failing to notice it :~)
I had on my list of additions I was planning to make to LOST some of the Stata code for importing various kinds of files, but the only importing page I can currently find is the Import a Foreign Data File page, and the "Also consider" links on that page seem to be broken. I may be missing them somewhere else, but just wanted to flag it.
Hi,
I have created 'faceted graphs' R markdown document but I am not able to create pull request to submit it. Pls give me access to LOST-STATS repo so that I can push my work to the repo. My git handle is pramod-dudhe.
Thanks,
Pramod
Breaking out off-topic discussion from #55 into a new issue....
In order to support better proofing and modularity, we should change all the absolute links (e.g., https://lost-stats.github.io/Time_Series/model.html
) to Liquid-friendly relative links, i.e.,
{{ "/Time_Series/model.html" | relative_url }}
This will allow #59 to more easily pick out dead links as well as making TOC migrations easier.
It looks like the recent github changes have made the site kick the can. Working on getting it back up. But just FYI may cause some more downtime. :-/
The master branch does not appear to be auto-updating. I made an update to source about six hours ago and so far there hasn't been an update in master or on the website. Does something need to be turned back on @khwilson ?
Hi all, I recently uploaded a page on color palettes and pushed the image files to the figures subdirectory instead of the images subdirectory within the figures subdirectory. Any idea on how I can move them to the proper location so that the graphs can show up on the LOST page?
Thinking of going through at least the data manipulation pages and adding data.table versions for all the R examples. Is this different enough to be worth it?
For this code:
create the lag/lead for treated states
fill in control obs with 0
This allows for the interaction between treat
and time_to_treat
to occur for each state.
Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
this will determine the difference
btw controls and treated states
g treat = !missing(_nfd)
Stata won't allow factors with negative values, so let's shift
time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)
Regress on our interaction terms with FEs for group and year,
clustering at the group (state) level
use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)
--> For the regression in the last line, i do not see the interaction term: treat## ib`true_neg1'.shifted_ttt. Is it a typo? Or is there a logic behind not including the interaction with "treat" variable?
Thank you guys!
Having just added quantile regression to the GLS category, I'm realizing that the category name doesn't really work that well. Quantile regression is not a least-squares method but there's not really another good place for it to go. Some of the other pages in that category are pretty tenuous too.
I think the idea behind the GLS category is linear-index models that aren't just extensions of OLS. Probably the name that would make the most sense is "generalized linear models" or at least that would make the most sense if that weren't a specific thing that wouldn't actually apply to all the methods.
"Other Linear Models"?
"Linear-in-Predictors Models" (ew)
"Other Regression Models"?
Drop the category and merge it back with OLS, retitling the whole thing "Regression Models"?
Do the HLM thing and call them Fixed Effects Models, thereby confusing everybody? :P
Drawing a blank for anything better than that. Any ideas? Updating every page that links there will be a bit of a chore so hoping to get it right!
I keep getting warnings that there is a security hole in the nokogiri dependency. Is this something to be worried about (or fixed) @khwilson ?
The diff-in-diff event study code in Python generate the interactions in the wrong order INX_0, INX_1, INX_10, INX_11, INX_12, ... instead of INX_0, INX_1, INX_2, ... INX_11, ...
The resulting figure also has evidence of this
I think the best place to reorder these variables in the df is right before running the regression, but I have been getting tripped up with the factors.union line.
@khwilson I keep getting failed build warnings on all the new security updates, and so haven't been merging them. Is there a place I can look to figure out what's going on? I'm not quite able to figure it out from the logs.
This is fairly minor as issues go, but the code syntax highlighting on the site is subtle and, in some examples, looks like plain text. I wondered if you'd be interested in changing the settings in a way that produces more contrast between different code elements.
I think the site uses rouge for syntax highlighting but it seems like the rouge default is more colourful than what is being currently displayed on the site.
It seems like rouge settings are configured in _config.yml
but I'm not sure how one would change them to get the rouge default syntax highlighting to display instead of the current setup.
I made a few commits and kept getting the following error when it tries to push to Pages:
Error: Unable to process command '::set-env name=DEPLOYMENT_STATUS::success' successfully.
Error: The `set-env` command is disabled. Please upgrade to using Environment Files or opt into unsecure command execution by setting the `ACTIONS_ALLOW_UNSECURE_COMMANDS` environment variable to `true`. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/
Any idea how to fix this?
First of all, thank you for this wonderful resource!
I am confused by the Stata event study code, and think it might not be totally correct. For reference, here it is
use "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta", clear
* create the lag/lead for treated states
* fill in control obs with 0
* This allows for the interaction between `treat` and `time_to_treat` to occur for each state.
* Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
* this will determine the difference
* btw controls and treated states
g treat = !missing(_nfd)
* Stata won't allow factors with negative values, so let's shift
* time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)
* Regress on our interaction terms with FEs for group and year,
* clustering at the group (state) level
* use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)
My problem stems from the line
replace time_to_treat = 0 if missing(_nfd)
This means that states which are not treated are given 0
, meaning they are treated in that year. This gives the following
time_to_tre
at Freq. Percent Cum.
-21 1 0.06 0.06
-20 2 0.12 0.19
-19 2 0.12 0.31
-18 2 0.12 0.43
-17 2 0.12 0.56
-16 3 0.19 0.74
-15 3 0.19 0.93
-14 3 0.19 1.11
-13 6 0.37 1.48
-12 7 0.43 1.92
-11 9 0.56 2.47
-10 12 0.74 3.22
-9 22 1.36 4.58
-8 25 1.55 6.12
-7 32 1.98 8.10
-6 34 2.10 10.20
-5 36 2.23 12.43
-4 36 2.23 14.66
-3 36 2.23 16.88
-2 36 2.23 19.11
-1 36 2.23 21.34
0 465 28.76 50.09
1 36 2.23 52.32
2 36 2.23 54.55
3 36 2.23 56.77
4 36 2.23 59.00
5 36 2.23 61.22
6 36 2.23 63.45
7 36 2.23 65.68
8 36 2.23 67.90
9 36 2.23 70.13
10 36 2.23 72.36
11 36 2.23 74.58
12 35 2.16 76.75
13 34 2.10 78.85
14 34 2.10 80.95
15 34 2.10 83.06
16 34 2.10 85.16
17 33 2.04 87.20
18 33 2.04 89.24
19 33 2.04 91.28
20 30 1.86 93.14
21 29 1.79 94.93
22 27 1.67 96.60
23 24 1.48 98.08
24 14 0.87 98.95
25 11 0.68 99.63
26 4 0.25 99.88
27 2 0.12 100.00
Total 1,617 100.00
It's possible that because in control units, time_to_treat
does not vary across years, the state (stfips
) fixed effects "take care" of this. But I can't intuitively reason about what's really happening given 0
stands for both untreated and treated
, but year 0
.
I would recommend making the time_to_treat
variable 100
or the maximum plus 100
, to avoid this confusion. The values don't matter since they are used as fixed effects anyways.
Excited about all the new PRs! @FeiyiShao @marykmcd @Evanmj7
I will be processing one PR a day, so it may take me a bit to get to yours, but don't worry, it's happening.
The simplest way to perform KNN in R is with the package class. It has a KNN function that is rather user friendly and does not require you to do distance computing as it runs everything with euclidean distance. For more advanced types nearest neighbors testing it would be best to use the matchit function from the matchit package. To verify results this example also used the confusionMatrix function from the package caret.
Due to how this package is designed the easiest room for error would be during normalization by normalizing variables such as character or other ones that do not require normalization. Another good source of error is not including drop = TRUE for your target, or y, vector which will prevent the model from running. Finally, the way this example verifies results it is vital to convert the target into a factor as the data has to be in similar kind in order for R to give you an output.
library(tidyverse)
library(readr)
#For KNN
library(class)
library(caret)
#Import the Dataset
df <- read_csv("wdbc.csv")
view(df)
#the first column is an identifier so remove that, anything that does not aid in classifying can be removed
df <- df[-1]
#See the count of the target, either B, benign, or M, malignant
table(df[1])
#Normalize the Dataset
normal<- function(x) { return ((x - min(x)) / (max(x) - min(x))) }
#Apply to what needs to be normalized, in this case not the target
df_norm <- as.data.frame(lapply(df[2:31], normal))
#Verify that normalization has occurred
summary(df_norm[1])
summary(df_norm[3])
summary(df_norm[11])
summary(df_norm[23])
#Split the dataframe into test and train datasets - note there are two dataframes
#First test and train from the features, here is an example of about a 70/30 split for testing and training
x_train <- df_norm[1:397,]
x_test <- df_norm[398:568,]
#Now test and train for the target - here is import that you do ", 1" to indicate only one column
#It will not work unless you use drop = TRUE
y_train <- df[1:397, 1, drop = TRUE]
y_test <- df[398:568, 1, drop = TRUE]
#The purpose of installing those packages were to use these next functions, first KNN
#Like the python example states, best practice for K unless assigned is the square root of the number of observations
pred <- knn(train = x_train, test = x_test, cl = y_train, k = 23)
#Confusion Matrix from Caret
#KNN converts to a factor with two levels so we need to make sure the test dataset is similar
y_test <- y_test %>% factor(levels = c("B", "M"))
#See how well the model did
confusionMatrix(y_test, pred)
The dataset used is from the UCI Machine Learning Repository under Breast Cancer Wisconsin (Diagnostic) Data Set. Rdocumentation for KNN was used in order to work on this example. Also, statology's "how to create a confusion matrix"
wdbc.csv
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.