School Shootings Introduction - data, purpose

Utilizing R, data maintained by the Washington Post was employed to model school shooting casualties with a negative binomial model (which models a Poisson distribution with overdipersion). K-means clustering was then used to classify school shootings into three distinct groups.

Publication

The findings were published in Spire in October 2018. I reran the solution in 2020 with an updated dataset and the same findings hold.

Caveats

The dataset utilized does not contain enough observations to offer any reliable predictive power. This project attempted to be fully descriptive in nature. Although some of the findings can be disputed, through solid statistical techniques, there were interesting insights gleaned, especially about resource officers' presence and how that is suggestive of increasing casualties in a given school shooting, how poverty affects casualties, and which types of weapons are most prolific in these situations.

Research Question

Can the number of casualties per school shooting be described statistically with the factors in this dataset?

Process

The data was cleaned and missing data was given its own category of unknown. All data was converted to numeric. Independent research was perfomed to fill in all missing data, especially the age of the shooter(s), where possible. Ages of shooters were grouped into buckets to reduce error rate of filling data.

casualties	resource_officer	two_shooters	wt.known	wt.handgun	wt.shotgun	wt.rifle	ws.known	ws.friend	minority.proportion	public	reas.known	reas.targeted	reas.indiscriminate	gender.known	gender.male	age.adult	age.teen	lunch.known	lunch2550	lunch5075	lunch75.plus	shooter.deceased	shooter.race.known	shooter.white
34	1	1	1	1	1	0	1	1	0.092620865	1	1	0	1	1	1	1	0	1	0	0	0	1	1	1
1	0	0	1	1	0	0	0	0	0.991496599	1	1	1	0	1	1	0	1	1	0	0	1	0	0	0
6	1	0	1	1	0	1	0	0	0.131482834	1	1	0	1	1	1	0	1	1	0	0	0	0	1	1
1	1	0	1	1	0	0	1	1	0.933587544	1	1	1	0	1	1	0	1	1	0	1	0	0	0	0
1	0	0	0	0	0	0	0	0	0.964157706	1	1	1	0	1	1	0	0	1	1	0	0	0	0	0

The naming of the columns follow a logical convention. Most of the meanings of the columns can be deduced intutively. The following abbreviations were used:

ws: weapon source. Example: ws.relative_ext = the weapon came from an extended relative
wt: weapon type: Example: wt.shotgun = the shooter used a shotgun
reas: reason. Example: reas.accidental = the shooting was accidental, reas.targeted = the shooter was going after a specific individual(s)
lunch: the percentage of students eligibile for a reduced-priced lunch program. Proxy variable for poverty. Example: lunch5075 = between 50% and 75% of the student population is eligible for this program

The final dataset included 238 rows of data (each row represents a shooting), with 32 independent variables to predict 1 target (casualties). Timespan of the dataset covers 4/20/1999 through 4/3/2019. This should represent every school shooting within that timeframe.

A linear model, poisson model, and negative binomial model were each specified with the cleaned model inputs. The negative binomial returned the best information criteria.

Model	LL	AIC	BIC
Linear	-629.1234	1326.2468	1444.3040
Poisson	-390.5976	847.1951	961.7801
NB	-358.5415	785.0829	903.1401

A summary of the model outputs:

Call:
glm.nb(formula = casualties ~ ., data = D.main, init.theta = 2.843998801, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3518  -0.9645  -0.1604   0.4071   2.7071  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -0.57227    0.53524  -1.069  0.28499    
resource_officer     0.56893    0.15737   3.615  0.00030 ***
two_shooters         0.38337    0.43444   0.882  0.37753    
wt.known            -0.36334    0.37096  -0.979  0.32736    
wt.handgun           0.50435    0.27736   1.818  0.06900 .  
wt.shotgun           0.48950    0.28514   1.717  0.08603 .  
wt.rifle             0.90682    0.28543   3.177  0.00149 ** 
ws.known             0.34667    0.27370   1.267  0.20529    
ws.illegal          -0.45373    0.54674  -0.830  0.40660    
ws.friend            0.21905    0.41421   0.529  0.59692    
ws.relative_ext      0.34126    0.53964   0.632  0.52714    
ws.relative_imm      0.58613    0.29489   1.988  0.04685 *  
ws.police           -0.22585    0.41142  -0.549  0.58303    
minority.proportion  0.55106    0.31299   1.761  0.07830 .  
public              -0.54770    0.49757  -1.101  0.27101    
reas.known           0.02566    0.52809   0.049  0.96125    
reas.accidental      0.68259    0.35277   1.935  0.05300 .  
reas.targeted        0.10908    0.29894   0.365  0.71520    
reas.indiscriminate  0.77559    0.31505   2.462  0.01382 *  
gender.known         1.26671    1.76662   0.717  0.47336    
gender.male         -1.48035    1.71844  -0.861  0.38899    
gender.female       -1.34100    1.75282  -0.765  0.44424    
age.adult            0.06551    0.28200   0.232  0.81631    
age.teen            -0.18750    0.29357  -0.639  0.52302    
age.child           -0.70879    0.58645  -1.209  0.22681    
lunch.known          0.77684    0.38564   2.014  0.04397 *  
lunch2550           -0.70990    0.22885  -3.102  0.00192 ** 
lunch5075           -0.67489    0.25244  -2.673  0.00751 ** 
lunch75.plus        -0.65428    0.28031  -2.334  0.01959 *  
shooter.deceased     0.14610    0.20016   0.730  0.46544    
shooter.race.known  -1.40270    1.25942  -1.114  0.26538    
shooter.white        1.68253    1.25697   1.339  0.18071    
shooter.minority     1.38222    1.26253   1.095  0.27360    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(2.844) family taken to be 1)

    Null deviance: 453.86  on 237  degrees of freedom
Residual deviance: 210.78  on 205  degrees of freedom
AIC: 785.08

Number of Fisher Scoring iterations: 1


              Theta:  2.844 
          Std. Err.:  0.627 

 2 x log-likelihood:  -717.083

Distribution of the dependent variable follows a Poisson shape with overdispersion:

The statistically significant inpupts at the 95% confidence level, which are the ones with a p-value less than 0.05, from the best model were then examined and interpreted. These included:

resouce_officer (positive effect on casualties -- this can have several reasons, and I'm not sure there is one definitive answer)
wt.handgun (positive effect on casualties compared to an unknown weapon type)
wt.shotgun (positive effect on casualties compared to an unknown weapon type)
wt.rifle (highest positive effect on casualties compared to an unknown weapon type)
reas.indiscriminate (positive effect on casualties to an unknown reason)
school lunch variables (as the proportion of students eligible for these reduced-priced lunch programs go up, casualties actually go down--probably due to the decreased notoriety one would get from targeting a low-income area vs. an affluent one)

This is not an all-inclusive list of the statistically significant variables, but these were the most interesting to interpret.

Some interesting non-statistically significant inputs were:

ws.illegal
public

After the Negative Binomial model was interpreted, a K-Means Cluster analysis was performed to further understand how the datapoints relate to one another. Using this method, school shootings were divided into three sections and the findings from the statistical models were generally confirmed.

Key Findings

Resource officers on campus lead to higher casualties, except in high-poverty schools, all esle constant.
More impoverished schools see fewer casualties generally.
Rifles are the weapons which cause the highest amounts of casualties.
The legaility/illegality of a weapon is inconclusive as to how it affects casualties.

Casualties decrease as poverty increases, and the fitted line representing schools with resource officers present is higher on the casualties axis

Rifles are indicative of higher casualties

Illegality of the weapon in causing casualties is inconclusive

Answering questions involving the above points was not my aim in doing this analysis, but these were the things I found most interesting at the conclusion.

Cluster analysis

A cluster analysis reinforces the implications of the negative binomial model. After trying several different cluster analysis options, I decided to reduce the dimensionality of the dataset using two principal components, capturing 20.6% of the variance in the dataset. Then, a k-means cluster analysis with three centers seemed to return an intuitive breakdown of the data.

With three centers, the clusters can be viewed in the following visualization:

The interpretation can be thought of in the following manner:

Cluster 1: High-casualty, low-poverty -- These are the points in the bottom-right -- highlighted by several notorious shootings, such as Marjory Stoneman Douglas and Columbine
Cluster 2: Low-casualty, high-poverty -- These are the points in the top-middle
Cluster 3: Low-casualty, low-povery -- These are the points in the bottom-left

Using these interpretations, we can think of the first principal component as the high-casualty, low-poverty component, and the second principal component as the high-poverty, low-casualty component. Indeed, looking at the rotation, that is what is suggested:

Variable	PC1 Rotation	PC2 Rotation
casualties	0.204	-0.209
lunch2550	0.123	-0.010
lunch5075	-0.028	0.265
lunch75.plus	-0.137	0.106

The lunch variables are binary - first one, lunch2550, is 1 if the school has between 25-50% of the student body eligible for a reduced-price lunch program, and 0 otherwise, and so forth through the three lunch-related variables - these were the proxies used to represent poverty levels in the analysis

Other breakdowns of the data seem to confirm these findings:

High casulaties cluster to the bottom-right of the graph

High poverty clusters to the middle / left of the graph

mikekeith52 / school-shootings-supervised-unsupervised-learning Goto Github PK

school-shootings-supervised-unsupervised-learning's Introduction

School Shootings Introduction - data, purpose

Publication

Caveats

Research Question

Process

Key Findings

Cluster analysis

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent