michaelkotrous / daedalus Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 39.1 MB

Convert NTSB eADMS tables to research-friendly formats

License: MIT License

Shell 3.99% R 96.01%

ntsb general-aviation dataset-generation daedalus general-aviation-accidents aviation

daedalus's People

Contributors

Stargazers

Watchers

daedalus's Issues

NULL does not import into Stata nicely

Adding the NULL values corrected problems importing the data into R, but Stata treats it as a string instead of a missing value, which results in variables that need to be treated as integers being treated as strings.

Potential fixes:

Add option to aircra.sh that allows user to specify if empty fields should be labeled NULL or left empty.
Post Stata snippet in README.md for do file that converts NULL to missing and variables to the correct format for analysis.

Empty values throw errors with import into R

The shell script is currently stripping the NULL values generated by the csv export from MySQL. This is causing an error in R because it is detecting the end of a row at column 203 instead of 204 when the final value is NULL in the MySQL table.

ERROR 1292 (22007) at line 225: Truncated incorrect date value: '12/16/1998 0:00:00'

Hi, thanks for creating this repository, I'm looking forward to sifting through this data.

Aircraft table is created, but getting an error while importing the csv file on the first datetime value it encounters in aircraft.csv.

It seems like the csv has an incorrect date format (12/16/1998 0:00:00 instead of 1998-12-16 00:00:00) although I tried hardcoding different date formats and still got the error.

MacOS, MySQL 8

Aircraft Keys do not reflect number of aircraft in event

I've assumed thus far that an event involving one aircraft would identify the aircraft with key equal to 1, two aircraft events would have aircraft keys 1 and 2, and so on. This is not always the case! I work a fix for my working dataset with R code, but a fix in the dataset build scripts in MySQL is needed.

My R fix is pasted below.

# occdata is a subset of the dataset produced by daedalus I'm currently working with
# accordingly, the next line is needed to recount the 'levels' of the event id variable present
occdata$ev_id <- factor(occdata$ev_id)

# create data frame of event ids by their aircraft keys
# note: 3 is the highest aircraft_key present in my dataset. Tweak this code to fit your slice of the dataset as necessary
aircraft_keys <- as.data.frame.matrix(table(occdata$ev_id,occdata$aircraft_key))
names(aircraft_keys)[1] <- "aircraft1"
names(aircraft_keys)[2] <- "aircraft2"
names(aircraft_keys)[3] <- "aircraft3"
aircraft_keys$ev_id <- row.names(aircraft_keys)
aircraft_keys <- aircraft_keys[c("ev_id", "aircraft1", "aircraft2", "aircraft3")]

# denote classes of errors in aircraft keys that will need to be fixed
aircraft_keys$error1 <- 0
aircraft_keys$error1[aircraft_keys$aircraft1 == 0 & aircraft_keys$aircraft2 == 1 & aircraft_keys$aircraft3 == 0] <- 1

aircraft_keys$error2 <- 0
aircraft_keys$error2[aircraft_keys$aircraft1 == 0 & aircraft_keys$aircraft2 == 1 & aircraft_keys$aircraft3 == 1] <- 1
aircraft_keys$error2[aircraft_keys$aircraft1 == 0 & aircraft_keys$aircraft2 == 0 & aircraft_keys$aircraft3 == 1] <- 1

aircraft_keys$error3 <- 0
aircraft_keys$error3[aircraft_keys$aircraft1 == 1 & aircraft_keys$aircraft2 == 0 & aircraft_keys$aircraft3 == 1] <- 1

aircraft_key_errors <- subset(aircraft_keys, error1 == 1 | error2 == 1 | error3 == 1)
aircraft_key_errors$ev_id <- factor(aircraft_key_errors$ev_id)

error1_ids <- aircraft_key_errors$ev_id[aircraft_key_errors$error1 == 1]
error2_ids <- aircraft_key_errors$ev_id[aircraft_key_errors$error2 == 1]
error3_ids <- aircraft_key_errors$ev_id[aircraft_key_errors$error3 == 1]

for(id in error1_ids) {
    occdata$aircraft_key[occdata$aircraft_key == 2 & occdata$ev_id == id] <- 1
}

for(id in error2_ids) {
    occdata$aircraft_key[occdata$aircraft_key == 3 & occdata$ev_id == id] <- 1
}

for(id in error3_ids) {
    occdata$aircraft_key[occdata$aircraft_key == 3 & occdata$ev_id == id] <- 2
}

Occurrences & Sequence Events data recoded in 2008

According to the NTSB data dictionary, cause factors were moved from the seq_of_events table to the findings table beginning in 2008. This explains the discontinuity in occurrence codes, phase of flight codes, etc. from Jan. 1, 2008 to present (see chart below)

It would be nice to integrate the finding table into the final dataset. Careful consideration will have to be made as to how the differences in the table structures will present themselves to analysis across the entire time series and conclusions that can be drawn.

Exported dataset csv includes ASCII char SUB

This appears to have no affects on MacOS, but it prevents R on Windows from importing the entire database. Refer to line 187 of aircraft-GAaccidents-final.csv for first occurrence of this character, and to Wikipedia for explanation of meaning behind the character.

Not sure at which point in the conversion process this problem is being introduced. The character does not appear in the raw csv files included in directory ntsb_mdb_export. Perhaps explicitly setting the encoding of the exported csv will strip this character from the dataset.

Consistent Use of `-p` flag

My custom mysql script defaults to requiring password for user, whereas mysql commands, like mysqldump, default to not requiring a password unless the -p flag is specified. Modifying the script to default to no password and using the -p flag to prompt for a password would make my script consistent with mysql commands, which will make the tool more user friendly.

michaelkotrous / daedalus Goto Github PK

daedalus's People

Contributors

Stargazers

Watchers

daedalus's Issues

NULL does not import into Stata nicely

Empty values throw errors with import into R

ERROR 1292 (22007) at line 225: Truncated incorrect date value: '12/16/1998 0:00:00'

Aircraft Keys do not reflect number of aircraft in event

Occurrences & Sequence Events data recoded in 2008

Exported dataset csv includes ASCII char SUB

Consistent Use of `-p` flag

Two variables not concatenating correctly

Included csv header in export script causes all values to be treated as strings

Line 15352 has extraneous double-quote causing import error

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent