stevenwingett / lifesciencestrainingdatasets Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 5.0 30.66 MB

A collection of datasets and accompanying scripts for learning how to analyse data

License: GNU General Public License v3.0

R 0.40% CSS 0.03% HTML 99.53% JavaScript 0.05% Batchfile 0.01%

lifesciencestrainingdatasets's People

Contributors

Stargazers

Watchers

Forkers

cgaud babasaraki khaten antonia-chalka bramamoorthy

lifesciencestrainingdatasets's Issues

Sleep_Deprivation_CSD issues

Usual thing with dots in column names.

Docs say "Mean CSD" but data says "Mean CSD Threshold"

Not sure what the Range is supposed to be given that it's only one number.

Data_Record columns somewhat messed up

The logical columns aren't actually logical values but strings. All of the Normally distributed and Machine Learning values are NA

Would be good to have some other fields too:

Contains NA
Contains Replicates
Has known problems

Column names still messed up

You still appear to be breaking the column names for column IDs which have spaces in them.

For example:

> colnames(Trainingdata::Biomass_of_Herbivorous_Fish)
[1] "Family"                                "Species"                              
[3] "Morpho..Functional.Group"              "Deep.Mean.Biomass"                    
[5] "Deep.Standard.Deviation.of.Biomass"    "Deep.Percentage"                      
[7] "Shallow.Mean.Biomass"                  "Shallow.Standard.Deviation.of.Biomass"
[9] "Shallow.Percentage"

When you load the files from the original tsv/csv files you need to add the option:

check.names=FALSE

..so that R doesn't try to get clever and 'fix' the column names.

Making Training Dataset Webpages

I've added to the "devel" branch instructions and a script to make it easier to add new datasets to the R package. Are there already instructions to the make the web version of the datasets?

Please let me know an I'll incorporate all this into the devel branch and then merge to the master branch.

Also, are we supposed to use Tidyverse rather than Base R for the website interface scripts?

Documentation wrong for Cuttlefish_Buoyancy

Most of the columns are lower case in the data but upper case in the documentation.

Have a way to list or query the datasets

At the moment I couldn't see a way to list all of the datasets in the package, or get an overview for what they are. Would be nice if there was an annotation file for the datasets which you could then add a function to see. Eg:

list_training_data()

..which gives you back a data frame of all of the names descriptions, sources and other annotation.

No help file for Biomass_of_Herbivorous_Fish

> ?Biomass_of_Herbivorous_Fish
No documentation for ‘Biomass_of_Herbivorous_Fish’ in specified packages and libraries

Plant_CO2_Uptake issues

There seems to be an additional column (probably left over rownames?) at the start.

Plant is supposed to be a factor, but it's just text.

Same thjing for Type and Treatment

Add original format files to the repository

Rather than just having data in proprietary R formats it would be useful to have the original files as well. That way we can also include data which has unusual problems during import. We could also give people the option of getting the filename for the original file, or the pre-loaded data and both would be useful.

Errors in Training Data List

Spaces are still being converted to dots.

Brain bodyweight column names don't match the documentation

Body.weight.kg vs Body.weight etc.

Install failure of R pacakge due to local paths

> install_github("StevenWingett/LifeSciencesTrainingDatasets/Trainingdata")
WARNING: Rtools is required to build R packages, but is not currently installed.

Please download and install Rtools custom from https://cran.r-project.org/bin/windows/Rtools/.
Downloading GitHub repo StevenWingett/LifeSciencesTrainingDatasets@master
WARNING: Rtools is required to build R packages, but is not currently installed.

Please download and install Rtools custom from https://cran.r-project.org/bin/windows/Rtools/.
√  checking for file 'C:\Users\andrewss\AppData\Local\Temp\RtmpYVV8Wb\remotes2710a79739f\StevenWingett-LifeSciencesTrainingDatasets-2b9d343\Trainingdata/DESCRIPTION' (792ms)
-  preparing 'Trainingdata':
√  checking DESCRIPTION meta-information ... 
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
     NB: this package now depends on R (>= 3.5.0)
     WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Biochemical_Oxygen_Demand.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Biomass_of_Herbivorous_Fish.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Bird_Sighting_Texas_2018.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Brain_Bodyweight.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Child_Variants.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Childrens_Indoor_Hobbies_During_Lockdown.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Childrens_Outdoor_Hobbies_During_Lockdown.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Cuttlefish_Buoyancy.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Modern_Pollen_Plant_Diversity_Relationships.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Trainingdata/data/Neutrophils.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Plant_CO2_Uptake.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Sleep_Deprivation_CSD.rds'
-  building 'Trainingdata_0.1.0.tar.gz'
   
Installing package into ‘C:/Users/andrewss/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
* installing *source* package 'Trainingdata' ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
Warning in gzfile(file, "rb") :
  cannot open compressed file 'D:/Documents/R/Trainingdata/data/Biochemical_Oxygen_Demand.rds', probable reason 'No such file or directory'
Error in gzfile(file, "rb") : cannot open the connection
Error: unable to load R code in package 'Trainingdata'
Execution halted
ERROR: lazy loading failed for package 'Trainingdata'
* removing 'C:/Users/andrewss/Documents/R/win-library/4.0/Trainingdata'
Error: Failed to install 'Trainingdata' from GitHub:
  (converted from warning) installation of package ‘C:/Users/andrewss/AppData/Local/Temp/RtmpYVV8Wb/file271048247ad8/Trainingdata_0.1.0.tar.gz’ had non-zero exit status

Modern_Pollen_Plant_Diversity_Relationships column names

Usual problems with column names not matching the docs and gaining extra dots.

Add date parse code to examples for Bird_Sighting_Texas_2018

The date string in this data isn't directly readable by as.Date() so it would be good to add the appropriate format string example so that it can be understood by R.

Hobbies data percent column meaning

From the documentation it sounds like the percent column in the hobby dataset should be the percentage of time which each hobby occupied, which would mean that it should add up to 100? However it doesn't:

> sum(Childrens_Indoor_Hobbies_During_Lockdown$Percent)
[1] 61.2
> sum(Childrens_Outdoor_Hobbies_During_Lockdown$Percent)
[1] 24

The values I get are somewhat different;

> Childrens_Indoor_Hobbies_During_Lockdown %>% mutate(mypc=100*Number/sum(Number))
         Indoor.Hobby Number Percent       mypc
1       Arts & Crafts    195    12.5 21.8120805
2   Puzzles and Games    162    11.3 18.1208054
3      Building stuff     11     0.8  1.2304251
4            Cleaning      4     0.3  0.4474273
5            Computer     35     2.4  3.9149888
6      Cooking/Baking     32     2.1  3.5794183
7  Designing/Creating      8     0.6  0.8948546
8      Helping Others      1     0.1  0.1118568
9    Imaginitive Play      9     0.6  1.0067114
10             Tablet     12     0.8  1.3422819
11           Learning     12     0.8  1.3422819
12               Lego     51     3.4  5.7046980
13      Making Videos      3     0.2  0.3355705
14              Music     28     2.0  3.1319911
15   Physical ability     37     2.6  4.1387025
16              Phone      2     0.2  0.2237136
17            Reading     67     4.5  7.4944072
18             Sewing      1     0.7  0.1118568
19           Sleeping      1     0.1  0.1118568
20         Television     47     3.2  5.2572707
21               Toys     18     1.2  2.0134228
22         Video Chat      7     0.5  0.7829978
23        Video Games    149    10.2 16.6666667
24            Writing      2     0.1  0.2237136

Remove files which shouldn't be under version control

Some of the files commited to github (and therefore in the R package) should really be removed and just kept locally.

The main ones would be

.Rhistory
.Rproj

..but there may be more.

Error in Neutrophils

TGX.221 vs TGX-221

Errors in Sleep Deprivation Data

I think the annotation on this data is wrong, and there is a relevant column missing. It's wrong in the original paper too.

I'm pretty sure the "mean" value quoted should be the median, and that the "average" value on the paper, which is missing in your data is actually the mean.

It's a shame that they didn't include the full dataset, because this is quite a nasty piece of data :-)

Errors in Cuttlefish docs

hatching_Date vs hatching_date

days_until_Hatching_trt vs days_until_hatching_trt

sampling_Date vs sampling_date

Odd column names in Biomass_of_Herbivorous_Fish

This probably isn't what was intended. We should have readable informative column names:

> colnames(Biomass_of_Herbivorous_Fish)
1] "Family"                   "Species"                  "Morpho..Functional.Group" "D.mean"                  
[5] "D.s.d."                   "D."                       "S.mean"                   "S.s.d."                  
[9] "S."

Incorrect column annotation in Child_Variants

dbSNP should be described as a DB identifier for the dbSNP database

For REF 'reference' is spelled incorrectly

For REF and ALT it should be 'sequence' rather than 'base' since some are multi-base strings

ENST is the Ensembl Transcript ID

Column mistmatchs in Hobby datasets

The docs for both of the hobby datasets say "Hobby" but the actual data has "Indoor.Hobby" and "Outdoor.Hobby"

Wrong column in Sleep Deprivation

Range.CSD.Threshold is a character vector, but the docs say it's numeric

Improvements to Brain Bodyweight

In the brain bodyweight data there are a couple of things which could be better:

The column names are a bit weird, there are extra dots in them, eg Body.weight..kg., just Body.weight.kg would be nicer.
There is a version of this data with an additional column for the type of species (Domesticated, Wild, Extinct) which would be a useful addition.

One of the dataset names has a newline in it:

Something like this wouldn't be too bad in real data, but we shouldn't have it in our metadata table:

> Training_Data_List$Data_set_Name[4]
[1] "Herbiverous  fish in reefs of\nArraial do Cabo"