michaelkotrous / daedalus Goto Github PK
View Code? Open in Web Editor NEWConvert NTSB eADMS tables to research-friendly formats
License: MIT License
Convert NTSB eADMS tables to research-friendly formats
License: MIT License
Adding the NULL values corrected problems importing the data into R, but Stata treats it as a string instead of a missing value, which results in variables that need to be treated as integers being treated as strings.
Potential fixes:
aircra.sh
that allows user to specify if empty fields should be labeled NULL or left empty.README.md
for do file that converts NULL to missing and variables to the correct format for analysis.The shell script is currently stripping the NULL values generated by the csv export from MySQL. This is causing an error in R because it is detecting the end of a row at column 203 instead of 204 when the final value is NULL in the MySQL table.
Hi, thanks for creating this repository, I'm looking forward to sifting through this data.
Aircraft table is created, but getting an error while importing the csv file on the first datetime value it encounters in aircraft.csv.
It seems like the csv has an incorrect date format (12/16/1998 0:00:00 instead of 1998-12-16 00:00:00) although I tried hardcoding different date formats and still got the error.
MacOS, MySQL 8
I've assumed thus far that an event involving one aircraft would identify the aircraft with key equal to 1, two aircraft events would have aircraft keys 1 and 2, and so on. This is not always the case! I work a fix for my working dataset with R code, but a fix in the dataset build scripts in MySQL is needed.
My R fix is pasted below.
# occdata is a subset of the dataset produced by daedalus I'm currently working with
# accordingly, the next line is needed to recount the 'levels' of the event id variable present
occdata$ev_id <- factor(occdata$ev_id)
# create data frame of event ids by their aircraft keys
# note: 3 is the highest aircraft_key present in my dataset. Tweak this code to fit your slice of the dataset as necessary
aircraft_keys <- as.data.frame.matrix(table(occdata$ev_id,occdata$aircraft_key))
names(aircraft_keys)[1] <- "aircraft1"
names(aircraft_keys)[2] <- "aircraft2"
names(aircraft_keys)[3] <- "aircraft3"
aircraft_keys$ev_id <- row.names(aircraft_keys)
aircraft_keys <- aircraft_keys[c("ev_id", "aircraft1", "aircraft2", "aircraft3")]
# denote classes of errors in aircraft keys that will need to be fixed
aircraft_keys$error1 <- 0
aircraft_keys$error1[aircraft_keys$aircraft1 == 0 & aircraft_keys$aircraft2 == 1 & aircraft_keys$aircraft3 == 0] <- 1
aircraft_keys$error2 <- 0
aircraft_keys$error2[aircraft_keys$aircraft1 == 0 & aircraft_keys$aircraft2 == 1 & aircraft_keys$aircraft3 == 1] <- 1
aircraft_keys$error2[aircraft_keys$aircraft1 == 0 & aircraft_keys$aircraft2 == 0 & aircraft_keys$aircraft3 == 1] <- 1
aircraft_keys$error3 <- 0
aircraft_keys$error3[aircraft_keys$aircraft1 == 1 & aircraft_keys$aircraft2 == 0 & aircraft_keys$aircraft3 == 1] <- 1
aircraft_key_errors <- subset(aircraft_keys, error1 == 1 | error2 == 1 | error3 == 1)
aircraft_key_errors$ev_id <- factor(aircraft_key_errors$ev_id)
error1_ids <- aircraft_key_errors$ev_id[aircraft_key_errors$error1 == 1]
error2_ids <- aircraft_key_errors$ev_id[aircraft_key_errors$error2 == 1]
error3_ids <- aircraft_key_errors$ev_id[aircraft_key_errors$error3 == 1]
for(id in error1_ids) {
occdata$aircraft_key[occdata$aircraft_key == 2 & occdata$ev_id == id] <- 1
}
for(id in error2_ids) {
occdata$aircraft_key[occdata$aircraft_key == 3 & occdata$ev_id == id] <- 1
}
for(id in error3_ids) {
occdata$aircraft_key[occdata$aircraft_key == 3 & occdata$ev_id == id] <- 2
}
According to the NTSB data dictionary, cause factors were moved from the seq_of_events
table to the findings
table beginning in 2008. This explains the discontinuity in occurrence codes, phase of flight codes, etc. from Jan. 1, 2008 to present (see chart below)
It would be nice to integrate the finding table into the final dataset. Careful consideration will have to be made as to how the differences in the table structures will present themselves to analysis across the entire time series and conclusions that can be drawn.
This appears to have no affects on MacOS, but it prevents R on Windows from importing the entire database. Refer to line 187 of aircraft-GAaccidents-final.csv
for first occurrence of this character, and to Wikipedia for explanation of meaning behind the character.
Not sure at which point in the conversion process this problem is being introduced. The character does not appear in the raw csv files included in directory ntsb_mdb_export
. Perhaps explicitly setting the encoding of the exported csv will strip this character from the dataset.
My custom mysql script defaults to requiring password for user, whereas mysql commands, like mysqldump, default to not requiring a password unless the -p
flag is specified. Modifying the script to default to no password and using the -p
flag to prompt for a password would make my script consistent with mysql commands, which will make the tool more user friendly.
Columns modifier_code
and person_code
need to concatenate multiple values when more than one is present for a given occurrence in the dataset. This is caused by not encoding the variables correctly in the aircraft
and occurrences
tables.
See this Stack Overflow answer for explanation of problem. I'll have to find another way in cra.sh script to prepend MySQL view export with header so that integers won't be wrapped in quotes and treated as strings in R and Stata.
Line number correct for export without header. pilot_id
is '20001213X28774-1-1' for reference. Field causing error is coded as "THE 180 CLUB" C/O JAMES DIERKE"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.