lost-stats / lost-stats.github.io Goto Github PK

Source code for the Library of Statistical Techniques

Home Page: https://lost-stats.github.io/

License: GNU General Public License v2.0

R 3.79% HTML 1.24% Ruby 21.62% SCSS 17.68% Python 55.67%

lost-stats.github.io's Issues

Encourage use of same datasets across implementations

In looking over several examples, I've come around to the idea that we should strongly encourage re-use of the same datasets across language implementations (i.e for a specific page). Advantages include:

Common input and output for direct comparisons.
Avoids duplicate typing up of the task (we can write that once under the implementation header) and unnecessary in-code commenting (which, frankly, I think we have too much of atm and personally find quite distracting).

I recently did this for the Collapse a Dataset page and, personally, think it's a lot easier to read and compare the code examples now. @vincentarelbundock's Rdatasets is a very useful resource in this regard, since it provides a ton of datasets that can be directly read in as CSVs. (Both Julia and Python statsmodels have nice wrappers for it too.)

Question: Do others agree? If so, I'll add some text to Contributing page laying this out.

PS. I also think we should discourage use of really large files, especially since this is going to start becoming a drag on our GA Actions builds. There is one big offender here that I'll try to swap out when I get a sec. (Sorry, that's my student and I should have warned her about it.)

Add GRETL Lexer

Per #3, we would like to support syntax highlighting for all the relevant languages here. https://github.com/rouge-ruby/rouge doesn't have a Stata or GRETL lexer. I've written a Stata one, but don't want to fix up all the test issues to get it through Rouge's process, so for now it lives in the _plugins directory.

If you'd like to add a GRETL lexer, feel free to follow the same process as for the Stata lexer. :)

Standard Environments

I think it would be great if all the examples here assumed standard environments and we added a page about it. It would also allow #6 to actually happen.

E.g., for R, you should install.packages(c('tidyverse', 'mass')) or whatnot.

Thoughts?

Add vector autoregression and impulse response functions to granger causality section

I would like to add a section at the end of the granger causality page dedicated to impulse response functions. I think they are a very helpful way to visualize granger causality. Would it be possible to make me a contributor to add these changes?

Did I successfully suggest changes?

I think maybe I am messing up how I am suggesting edits to pages but I'm not sure. I know I'm a contributor, so I don't think I'm supposed to be doing PRs, but I have been submitting my edits by just clicking the green "Propose changes" button at the bottom of each page. Should I be doing anything else? Over the last few days I made some modifications on the synthetic control page, the density plots page, and the support vector machines page.

Fix marginal effects plot (with categorical interactions) page

The fixest code is all broken here: https://lost-stats.github.io/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html

(Likely due to changes to i() introduced around version 7.0.0).

The solution is something like:

library(fixest)

od = read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv')

od = 
  within(od, {
    Date = as.Date(paste(substr(Quarter, 3, 7), 
                         as.integer(substr(Quarter, 2, 2))*3-2, 
                         1, 
                         sep = "-"))
    Treated = State == 'California'
})

fmod = feols(Rate ~ i(Date, Treated, ref = "2011-04-01") | State + Date, 
             data = od)

coefplot(fmod)
iplot(fmod)

But stepping back, I actually think we should change the dataset for this page. It requires several tedious data cleaning steps across the different languages and ultimately ends up producing an event study plot because of the time dimension (which is confusing in of itself). Aren't we just looking for something like

mod = lm(mpg ~ factor(vs) * factor(am), mtcars)
summary(mod)
marginaleffects::plot_cap(mod, condition = "am")

? (And obvs the equivalent in other languages)

Code samples which don't pass tests

There are a few code samples which don't quite work. For Python, they are here. For R they are here.

Likely reasons for breaking

Split blocks

Most likely, this is because code samples are split by text. For example, if the Markdown looked like this:

Set _x_ to 1.

```r
x <- 1
```

Then add 1 to it.

```r
x <- x + 1
```

In this case, x is not defined in the second code block. To fix this, you can name the code block, e.g.,

Set _x_ to 1.

```r?example=simple
x <- 1
```

Then add 1 to it.

```r?example=simple
x <- x + 1
```

The code tester will gather each of these code blocks together and run them sequentially.

Examples that aren't meant to work

There are a few examples where the example isn't literally meant to work. For example,

```python
import pandas as pd

df = pd.read_csv("name_of_file")
```

In this case, "name_of_file" is obviously a placeholder for the path to the file. But the code doesn't literally work, so it will cause a failure.

As a solution, you can indicate you want the system to skip it:

```python?skip=true&skipReason=file_does_not_exist
import pandas as pd

df = pd.read_csv("name_of_file")
```

Here note that we added ?skip=true, which tells the test runner to ignore the test. We also add &skipReason=file_does_not_exist, which is just an optional explanation for why we are skipping the test.

Rerunning the tests

If you are fixing these locally, you can (if you have python, poetry, and docker installed) run the tests locally with:

poetry install
poetry run py.test -n 4

Alternatively, you can rerun the tests on Github by:

Clicking the Actions tab
Choose Run monthly [python|r] tests
Find the Run workflow button
And then click Run workflow

Correct syntax highlighting for all languages

Currently, native syntax formatting only works for some languages (Python, R, SAS). Other languages like Stata, Matlab, GRETL, etc. are just formatted as plain text chunks. See here for some examples: https://lost-stats.github.io/Model_Estimation/ordinary_least_squares.html#implementations

Do we need to add syntax support for these other languages? (Maybe from Pygments or some other source?)

Causal Forest Walkthrough is wrong for the stata - R interface example

Hey,

Really cool resource guys!
There is a mistake on this tutorial explaining how to train causal forests:
https://lost-stats.github.io/Machine_Learning/causal_forest.html

In the stata part, you split data into test and training using
g split = runiform() > .5

and you send over the testing data with the preserve block and calling
keep if split == 0

But at not point do you call
keep if split == 1

to explicitly only keep the training data for the training!

I believe after the preserve block

preserve
* Keep the predictors from the holding data, send it over, so later we can make an X matrix to predict with
keep if split == 0
keep year prbarr prbconv prbpris avgsen polpc density taxpc regionn smsan pctmin wcon
* R needs that data pre-processed! So using the same variables as in the main model, process the variables
fvrevar year prbarr prbconv prbpris avgsen polpc density taxpc i.regionn i.smsan pctmin wcon
keep `r(varlist)'
* Then send the data to R
rcall: df.hold <- st.data()
restore

You just need to add one line which is
keep if split == 1

So that you train with the training data and not the whole dataset.

Thanks!

Added as a contributer

Hello, I would like to be added as a contributor to the LOST repo. I want to add a page on color palettes, I found it in the non-existing pages tab, and I think it would be really useful for others to have as a resource.

Nonexistent pages

I've been working through and fixing all the broken links. There are, unsurprisingly, a bunch of links to pages that don't exist yet. This makes it annoying to fix the broken links, since you have to remember what pages exist, and also is probably annoying for anyone who clicks on them. What would be the best way to make these links "work?"

Having them go to a page that says 'this page doesn't exist yet' seems ideal. I assume it wouldn't quite work to do that automatically, Wikipedia-style, right? If not I can set something up manually.

Event study in Stata

First, thanks a lot for putting this out for learners like us. I was wondering if i can use the following for repeated cross sectional data as well?

Code copy and pasted from the diff-in-diff event study section************

use "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta", clear

create the lag/lead for treated states
fill in control obs with 0
This allows for the interaction between treat and time_to_treat to occur for each state.
Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
this will determine the difference
btw controls and treated states
g treat = !missing(_nfd)
Stata won't allow factors with negative values, so let's shift
time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)
Regress on our interaction terms with FEs for group and year,
clustering at the group (state) level
use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)
Pull out the coefficients and SEs
g coef = .
g se = .
levelsof shifted_ttt, l(times)
foreach t in times' { replace coef = _b[t'.shifted_ttt] if shifted_ttt == t' replace se = _se[t'.shifted_ttt] if shifted_ttt == `t'
}
Make confidence intervals
g ci_top = coef+1.96se
g ci_bottom = coef - 1.96se
Limit ourselves to one observation per quarter
now switch back to time_to_treat to get original timing
keep time_to_treat coef se ci_*
duplicates drop

sort time_to_treat

Create connected scatterplot of coefficients
with CIs included with rcap
and a line at 0 both horizontally and vertically
summ ci_top
local top_range = r(max)
summ ci_bottom
local bottom_range = r(min)

twoway (sc coef time_to_treat, connect(line)) ///
(rcap ci_top ci_bottom time_to_treat) ///
(function y = 0, range(time_to_treat)) ///
(function y = 0, range(bottom_range' top_range') horiz), ///
xtitle("Time to Treatment") caption("95% Confidence Intervals Shown")

Adding python code for fixed effect and data.table for reshaping in R

Hi,

I made a few edits in the Data Manipulations by add data.table code for reshaping (both long-to-wide and wide-to-long), and also in OLS by adding Python code for fixed effects regression.

Would you please add me as the contributor so that I can push my changes to the repo and make a pull request? Currently, when I push, it consistently says "The requested URL returned error: 403".

Best,
Wensong

Add a test script?

Many of the examples in this are broken, for example in the balance test, the guide instructs us to do

bal.text(treat ~ foreign, data = mtcars)

Despite the fact that neither treat nor foreign exist in the mtcars dataset. Looking through the code-base, there doesn't seem to be an automated testing for this repository. Is this something on the radar?

Broken links

Didn't realize this wasn't an open issue. Just finished fixing all the broken links (except for the ones that don't lead to existing pages of course). Opening this issue to close it.

Stylistic (documentation) changes

@NickCH-K and I have discussed some of these over Twitter, but I'm going to jot down a laundry list of stylistic elements that I feel could be improved/changed. These are pretty opinionated, so I'd welcome thoughts from others.

Package names: Following JOSS, the R Journal and others, package names in the text should be in bold rather than code. If possible, the package homepage should also be hyperlinked when it is first mentioned. i.e. "...using the lfe package (link) we can...", rather than "...using the lfe package we can...". Code should be reserved for, well, code. Similarly, we can revert to code when referring to a specific function from that package. (E.g. "Use the lfe::felm() function..." is fine.)
Use of grandchild pages. We have support for grandchild pages and I think we should use them, or else individual topics are going to be become pretty unwieldy. Conversely, the default alphabetical ordering means that closely related topics might not be indexed in an optimal way (e.g. logit and probit models). Another example might be non-standard errors: HC, HAC and clustered SEs probably all deserve their own page. A three-level page hierarchy of Model estimation -> Non-standard errors -> HC (and HAC and Clustered SEs) makes sense to me.
Implementation position: Should we move the "Implementation" sections higher up on the page? I feel like this is the main reason people are going to be using LOST and at present they first have to scroll down passed the "Keep in Mind" and "Also consider" sections before hitting pay dirt. Maybe these two sections could be moved below "Implementations" instead?
Section TOC: Relatedly...This one is potentially more complicated, but should we have a section TOC at the top of every page? I think this could be automated with git-hooks and doctoc. But I'll let @khwilson weigh in here.

Page nesting

See: #5 (comment)

We want to tidy up page hierarchies, starting with Model Estimation and maybe move on to some of the other sections. The tricky thing is deciding the best categories, although some overlap is probably unavoidable. Here's a first stab... mostly working off the existing pages, but with one or two non-existing (but obvious) counterparts thrown in. Feel to edit or make changes.

Ordinary Least Squares
- (Simple?) Linear Regression
- Fixed Effects
- Interaction terms and Polys
Generalised Least Squares
- Logit
- Probit
- Tobit
- Heckman
- McFadden / clogit
Multilevel (hierarchical?) models
- Random effects
- Linear mixed effects
- Bayesian hierarchical models
- etc
Inference (maybe "Design"?)
- DiD
- RDD
- IV
- Linear Hypothesis Tests
Nonstandard errors
- This sub-section is already nested

how to save the stargazer output table into a file/png using Python

Hi I have used Python package stargazer to generate a nice table to showcase all the regression output in html format.
I am wondering how shall I save this format into a certain folder so that I could share it with my co-authors?

Post contribution warning and possible mistake changing new page template

Hello,
I am happy to have contributed on LOST however, after I edited the new page template into the new Tobit.md, I realized I've proposed erasing the new page template and replacing it with my contribution. Also, I got this message "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository." I'm not exactly sure how to get my page I wrote into the correct place on a new page.. Help please.

Markdown tables not compiling

I was just taking a look at the page for Combining Datasets Overview and the Markdown tables used at the top are not being shown as tables. They show up properly in the code/Markdown preview for that page, but not the final version. How can we fix this?

Multilevel Models

I couldn't find any mention of general multilevel models on the Desired Nonexistent Pages page.
Did I miss the right page, or is there a specific reasons against providing a basic explanation of multilevel models?

Add content on best practice for reproducibility?

Given the twitter conversation that some of us participated in today, and the need for reproducibility in projects, should we add a section/page on best practice for reproducibility or would that be out of scope? Personally, I think it could be really useful for site users but I'd like to hear everyone's thoughts.

I guess we might want to cover:

reproducing a set of packages (eg pip freeze > requirements.txt or conda env export --from-history -f environment.yml and the R equivalents)
reproducing the execution sequence of a series of scripts and commands (eg using, and perhaps drawing, a makefile or using a tool like ploomber)
reproducing the operating system (Docker)
anything else?

We could draw on the relevant sections of The Turing Way if necessary.

Create page on package creation

Hi all,

This might be tangential to #73 , but I was thinking it would be useful to have a page on basic package creation and/or namespace management.

I'd be willing to write brief R and Julia tutorials. Also, as a learner I would love to see best practices for Python!

My apologies if this already exists and I'm failing to notice it :~)

Broken links for data importing pages?

I had on my list of additions I was planning to make to LOST some of the Stata code for importing various kinds of files, but the only importing page I can currently find is the Import a Foreign Data File page, and the "Also consider" links on that page seem to be broken. I may be missing them somewhere else, but just wanted to flag it.

access to the repo

Hi,
I have created 'faceted graphs' R markdown document but I am not able to create pull request to submit it. Pls give me access to LOST-STATS repo so that I can push my work to the repo. My git handle is pramod-dudhe.

Thanks,
Pramod

Convert Absolute Links to Relative Links

Breaking out off-topic discussion from #55 into a new issue....

In order to support better proofing and modularity, we should change all the absolute links (e.g., https://lost-stats.github.io/Time_Series/model.html) to Liquid-friendly relative links, i.e.,

{{ "/Time_Series/model.html" | relative_url }}

This will allow #59 to more easily pick out dead links as well as making TOC migrations easier.

Site Down

It looks like the recent github changes have made the site kick the can. Working on getting it back up. But just FYI may cause some more downtime. :-/

Page not updating?

The master branch does not appear to be auto-updating. I made an update to source about six hours ago and so far there hasn't been an update in master or on the website. Does something need to be turned back on @khwilson ?

New security advisory

There's a "high severity" security issue listed here. I think I see the fix but don't want to break whatever the GH-actions flow is. @khwilson how should this be fixed?

Pushed color palette image files to wrong folder

Hi all, I recently uploaded a page on color palettes and pushed the image files to the figures subdirectory instead of the images subdirectory within the figures subdirectory. Any idea on how I can move them to the proper location so that the graphs can show up on the LOST page?

data.table variants

Thinking of going through at least the data manipulation pages and adding data.table versions for all the R examples. Is this different enough to be worth it?

Event study in stata with reghdfe

For this code:

use "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta", clear

create the lag/lead for treated states
fill in control obs with 0
This allows for the interaction between treat and time_to_treat to occur for each state.
Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
this will determine the difference
btw controls and treated states
g treat = !missing(_nfd)
Stata won't allow factors with negative values, so let's shift
time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)
Regress on our interaction terms with FEs for group and year,
clustering at the group (state) level
use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)

--> For the regression in the last line, i do not see the interaction term: treat## ib`true_neg1'.shifted_ttt. Is it a typo? Or is there a logic behind not including the interaction with "treat" variable?

Thank you guys!

New name for the Generalized Least Squares category

Having just added quantile regression to the GLS category, I'm realizing that the category name doesn't really work that well. Quantile regression is not a least-squares method but there's not really another good place for it to go. Some of the other pages in that category are pretty tenuous too.

I think the idea behind the GLS category is linear-index models that aren't just extensions of OLS. Probably the name that would make the most sense is "generalized linear models" or at least that would make the most sense if that weren't a specific thing that wouldn't actually apply to all the methods.

"Other Linear Models"?
"Linear-in-Predictors Models" (ew)
"Other Regression Models"?
Drop the category and merge it back with OLS, retitling the whole thing "Regression Models"?
Do the HLM thing and call them Fixed Effects Models, thereby confusing everybody? :P

Drawing a blank for anything better than that. Any ideas? Updating every page that links there will be a bit of a chore so hoping to get it right!

An Example of Web Scraping in R

web_scraping.md

nokogiri dependency

I keep getting warnings that there is a security hole in the nokogiri dependency. Is this something to be worried about (or fixed) @khwilson ?

DiD Event Study Code in Python Interactions in the Wrong Order

The diff-in-diff event study code in Python generate the interactions in the wrong order INX_0, INX_1, INX_10, INX_11, INX_12, ... instead of INX_0, INX_1, INX_2, ... INX_11, ...

The resulting figure also has evidence of this

I think the best place to reorder these variables in the df is right before running the regression, but I have been getting tripped up with the factors.union line.

Build failure

@khwilson I keep getting failed build warnings on all the new security updates, and so haven't been merging them. Is there a place I can look to figure out what's going on? I'm not quite able to figure it out from the logs.

Better syntax highlighting

This is fairly minor as issues go, but the code syntax highlighting on the site is subtle and, in some examples, looks like plain text. I wondered if you'd be interested in changing the settings in a way that produces more contrast between different code elements.

I think the site uses rouge for syntax highlighting but it seems like the rouge default is more colourful than what is being currently displayed on the site.

It seems like rouge settings are configured in _config.yml but I'm not sure how one would change them to get the rouge default syntax highlighting to display instead of the current setup.

Build error

I made a few commits and kept getting the following error when it tries to push to Pages:

Error: Unable to process command '::set-env name=DEPLOYMENT_STATUS::success' successfully.
Error: The `set-env` command is disabled. Please upgrade to using Environment Files or opt into unsecure command execution by setting the `ACTIONS_ALLOW_UNSECURE_COMMANDS` environment variable to `true`. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/

Any idea how to fix this?

Event study in Stata

First of all, thank you for this wonderful resource!

I am confused by the Stata event study code, and think it might not be totally correct. For reference, here it is

use "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta", clear

* create the lag/lead for treated states
* fill in control obs with 0
* This allows for the interaction between `treat` and `time_to_treat` to occur for each state.
* Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
* this will determine the difference
* btw controls and treated states
g treat = !missing(_nfd)

* Stata won't allow factors with negative values, so let's shift
* time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)

* Regress on our interaction terms with FEs for group and year,
* clustering at the group (state) level
* use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)

My problem stems from the line

replace time_to_treat = 0 if missing(_nfd)

This means that states which are not treated are given 0, meaning they are treated in that year. This gives the following


time_to_tre	
at	Freq.	Percent	Cum.
			
-21	1	0.06	0.06
-20	2	0.12	0.19
-19	2	0.12	0.31
-18	2	0.12	0.43
-17	2	0.12	0.56
-16	3	0.19	0.74
-15	3	0.19	0.93
-14	3	0.19	1.11
-13	6	0.37	1.48
-12	7	0.43	1.92
-11	9	0.56	2.47
-10	12	0.74	3.22
-9	22	1.36	4.58
-8	25	1.55	6.12
-7	32	1.98	8.10
-6	34	2.10	10.20
-5	36	2.23	12.43
-4	36	2.23	14.66
-3	36	2.23	16.88
-2	36	2.23	19.11
-1	36	2.23	21.34
0	465	28.76	50.09
1	36	2.23	52.32
2	36	2.23	54.55
3	36	2.23	56.77
4	36	2.23	59.00
5	36	2.23	61.22
6	36	2.23	63.45
7	36	2.23	65.68
8	36	2.23	67.90
9	36	2.23	70.13
10	36	2.23	72.36
11	36	2.23	74.58
12	35	2.16	76.75
13	34	2.10	78.85
14	34	2.10	80.95
15	34	2.10	83.06
16	34	2.10	85.16
17	33	2.04	87.20
18	33	2.04	89.24
19	33	2.04	91.28
20	30	1.86	93.14
21	29	1.79	94.93
22	27	1.67	96.60
23	24	1.48	98.08
24	14	0.87	98.95
25	11	0.68	99.63
26	4	0.25	99.88
27	2	0.12	100.00
			
Total	1,617	100.00

It's possible that because in control units, time_to_treat does not vary across years, the state (stfips) fixed effects "take care" of this. But I can't intuitively reason about what's really happening given 0 stands for both untreated and treated, but year 0.

I would recommend making the time_to_treat variable 100 or the maximum plus 100, to avoid this confusion. The values don't matter since they are used as fixed effects anyways.

PR Processing

Excited about all the new PRs! @FeiyiShao @marykmcd @Evanmj7

I will be processing one PR a day, so it may take me a bit to get to yours, but don't worry, it's happening.

Add to KNN page - R walkthrough

R

The simplest way to perform KNN in R is with the package class. It has a KNN function that is rather user friendly and does not require you to do distance computing as it runs everything with euclidean distance. For more advanced types nearest neighbors testing it would be best to use the matchit function from the matchit package. To verify results this example also used the confusionMatrix function from the package caret.
Due to how this package is designed the easiest room for error would be during normalization by normalizing variables such as character or other ones that do not require normalization. Another good source of error is not including drop = TRUE for your target, or y, vector which will prevent the model from running. Finally, the way this example verifies results it is vital to convert the target into a factor as the data has to be in similar kind in order for R to give you an output.


library(tidyverse)
library(readr)

#For KNN
library(class)
library(caret)


#Import the Dataset
df <- read_csv("wdbc.csv")
view(df)

#the first column is an identifier so remove that, anything that does not aid in classifying can be removed
df <- df[-1]


#See the count of the target, either B, benign, or M, malignant
table(df[1])

#Normalize the Dataset

normal<- function(x) { return ((x - min(x)) / (max(x) - min(x))) }

#Apply to what needs to be normalized, in this case not the target
df_norm <- as.data.frame(lapply(df[2:31], normal))

#Verify that normalization has occurred
summary(df_norm[1])
summary(df_norm[3])
summary(df_norm[11])
summary(df_norm[23])


#Split the dataframe into test and train datasets - note there are two dataframes
#First test and train from the features, here is an example of about a 70/30 split for testing and training

x_train <- df_norm[1:397,]

x_test <- df_norm[398:568,]


#Now test and train for the target - here is import that you do ", 1" to indicate only one column
#It will not work unless you use drop = TRUE
y_train <- df[1:397, 1, drop = TRUE]

y_test <- df[398:568, 1, drop = TRUE]


#The purpose of installing those packages were to use these next functions, first KNN
#Like the python example states, best practice for K unless assigned is the square root of the number of observations
pred <- knn(train = x_train, test = x_test, cl = y_train, k = 23)

#Confusion Matrix from Caret

#KNN converts to a factor with two levels so we need to make sure the test dataset is similar
y_test <- y_test %>% factor(levels = c("B", "M"))

#See how well the model did
confusionMatrix(y_test, pred)

References for R walkthrough

The dataset used is from the UCI Machine Learning Repository under Breast Cancer Wisconsin (Diagnostic) Data Set. Rdocumentation for KNN was used in order to work on this example. Also, statology's "how to create a confusion matrix"
wdbc.csv