Code Monkey home page Code Monkey logo

lifesciencestrainingdatasets's People

Contributors

casey-brown avatar cgaud avatar s-andrews avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lifesciencestrainingdatasets's Issues

Sleep_Deprivation_CSD issues

Usual thing with dots in column names.

Docs say "Mean CSD" but data says "Mean CSD Threshold"

Not sure what the Range is supposed to be given that it's only one number.

Data_Record columns somewhat messed up

The logical columns aren't actually logical values but strings. All of the Normally distributed and Machine Learning values are NA

Would be good to have some other fields too:

  • Contains NA
  • Contains Replicates
  • Has known problems

Column names still messed up

You still appear to be breaking the column names for column IDs which have spaces in them.

For example:

> colnames(Trainingdata::Biomass_of_Herbivorous_Fish)
[1] "Family"                                "Species"                              
[3] "Morpho..Functional.Group"              "Deep.Mean.Biomass"                    
[5] "Deep.Standard.Deviation.of.Biomass"    "Deep.Percentage"                      
[7] "Shallow.Mean.Biomass"                  "Shallow.Standard.Deviation.of.Biomass"
[9] "Shallow.Percentage"

When you load the files from the original tsv/csv files you need to add the option:

check.names=FALSE

..so that R doesn't try to get clever and 'fix' the column names.

Making Training Dataset Webpages

I've added to the "devel" branch instructions and a script to make it easier to add new datasets to the R package. Are there already instructions to the make the web version of the datasets?

Please let me know an I'll incorporate all this into the devel branch and then merge to the master branch.

Also, are we supposed to use Tidyverse rather than Base R for the website interface scripts?

Have a way to list or query the datasets

At the moment I couldn't see a way to list all of the datasets in the package, or get an overview for what they are. Would be nice if there was an annotation file for the datasets which you could then add a function to see. Eg:

list_training_data()

..which gives you back a data frame of all of the names descriptions, sources and other annotation.

Plant_CO2_Uptake issues

There seems to be an additional column (probably left over rownames?) at the start.

Plant is supposed to be a factor, but it's just text.

Same thjing for Type and Treatment

Add original format files to the repository

Rather than just having data in proprietary R formats it would be useful to have the original files as well. That way we can also include data which has unusual problems during import. We could also give people the option of getting the filename for the original file, or the pre-loaded data and both would be useful.

Install failure of R pacakge due to local paths

> install_github("StevenWingett/LifeSciencesTrainingDatasets/Trainingdata")
WARNING: Rtools is required to build R packages, but is not currently installed.

Please download and install Rtools custom from https://cran.r-project.org/bin/windows/Rtools/.
Downloading GitHub repo StevenWingett/LifeSciencesTrainingDatasets@master
WARNING: Rtools is required to build R packages, but is not currently installed.

Please download and install Rtools custom from https://cran.r-project.org/bin/windows/Rtools/.
√  checking for file 'C:\Users\andrewss\AppData\Local\Temp\RtmpYVV8Wb\remotes2710a79739f\StevenWingett-LifeSciencesTrainingDatasets-2b9d343\Trainingdata/DESCRIPTION' (792ms)
-  preparing 'Trainingdata':
√  checking DESCRIPTION meta-information ... 
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
     NB: this package now depends on R (>= 3.5.0)
     WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Biochemical_Oxygen_Demand.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Biomass_of_Herbivorous_Fish.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Bird_Sighting_Texas_2018.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Brain_Bodyweight.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Child_Variants.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Childrens_Indoor_Hobbies_During_Lockdown.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Childrens_Outdoor_Hobbies_During_Lockdown.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Cuttlefish_Buoyancy.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Modern_Pollen_Plant_Diversity_Relationships.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: 'Trainingdata/data/Neutrophils.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Plant_CO2_Uptake.rds'  WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects:  'Trainingdata/data/Sleep_Deprivation_CSD.rds'
-  building 'Trainingdata_0.1.0.tar.gz'
   
Installing package into ‘C:/Users/andrewss/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
* installing *source* package 'Trainingdata' ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
Warning in gzfile(file, "rb") :
  cannot open compressed file 'D:/Documents/R/Trainingdata/data/Biochemical_Oxygen_Demand.rds', probable reason 'No such file or directory'
Error in gzfile(file, "rb") : cannot open the connection
Error: unable to load R code in package 'Trainingdata'
Execution halted
ERROR: lazy loading failed for package 'Trainingdata'
* removing 'C:/Users/andrewss/Documents/R/win-library/4.0/Trainingdata'
Error: Failed to install 'Trainingdata' from GitHub:
  (converted from warning) installation of package ‘C:/Users/andrewss/AppData/Local/Temp/RtmpYVV8Wb/file271048247ad8/Trainingdata_0.1.0.tar.gz’ had non-zero exit status

Hobbies data percent column meaning

From the documentation it sounds like the percent column in the hobby dataset should be the percentage of time which each hobby occupied, which would mean that it should add up to 100? However it doesn't:

> sum(Childrens_Indoor_Hobbies_During_Lockdown$Percent)
[1] 61.2
> sum(Childrens_Outdoor_Hobbies_During_Lockdown$Percent)
[1] 24

The values I get are somewhat different;

> Childrens_Indoor_Hobbies_During_Lockdown %>% mutate(mypc=100*Number/sum(Number))
         Indoor.Hobby Number Percent       mypc
1       Arts & Crafts    195    12.5 21.8120805
2   Puzzles and Games    162    11.3 18.1208054
3      Building stuff     11     0.8  1.2304251
4            Cleaning      4     0.3  0.4474273
5            Computer     35     2.4  3.9149888
6      Cooking/Baking     32     2.1  3.5794183
7  Designing/Creating      8     0.6  0.8948546
8      Helping Others      1     0.1  0.1118568
9    Imaginitive Play      9     0.6  1.0067114
10             Tablet     12     0.8  1.3422819
11           Learning     12     0.8  1.3422819
12               Lego     51     3.4  5.7046980
13      Making Videos      3     0.2  0.3355705
14              Music     28     2.0  3.1319911
15   Physical ability     37     2.6  4.1387025
16              Phone      2     0.2  0.2237136
17            Reading     67     4.5  7.4944072
18             Sewing      1     0.7  0.1118568
19           Sleeping      1     0.1  0.1118568
20         Television     47     3.2  5.2572707
21               Toys     18     1.2  2.0134228
22         Video Chat      7     0.5  0.7829978
23        Video Games    149    10.2 16.6666667
24            Writing      2     0.1  0.2237136

Errors in Sleep Deprivation Data

I think the annotation on this data is wrong, and there is a relevant column missing. It's wrong in the original paper too.

I'm pretty sure the "mean" value quoted should be the median, and that the "average" value on the paper, which is missing in your data is actually the mean.

It's a shame that they didn't include the full dataset, because this is quite a nasty piece of data :-)

Errors in Cuttlefish docs

hatching_Date vs hatching_date

days_until_Hatching_trt vs days_until_hatching_trt

sampling_Date vs sampling_date

Odd column names in Biomass_of_Herbivorous_Fish

This probably isn't what was intended. We should have readable informative column names:

> colnames(Biomass_of_Herbivorous_Fish)
1] "Family"                   "Species"                  "Morpho..Functional.Group" "D.mean"                  
[5] "D.s.d."                   "D."                       "S.mean"                   "S.s.d."                  
[9] "S." 

Incorrect column annotation in Child_Variants

dbSNP should be described as a DB identifier for the dbSNP database

For REF 'reference' is spelled incorrectly

For REF and ALT it should be 'sequence' rather than 'base' since some are multi-base strings

ENST is the Ensembl Transcript ID

Improvements to Brain Bodyweight

In the brain bodyweight data there are a couple of things which could be better:

  1. The column names are a bit weird, there are extra dots in them, eg Body.weight..kg., just Body.weight.kg would be nicer.

  2. There is a version of this data with an additional column for the type of species (Domesticated, Wild, Extinct) which would be a useful addition.

One of the dataset names has a newline in it:

Something like this wouldn't be too bad in real data, but we shouldn't have it in our metadata table:

> Training_Data_List$Data_set_Name[4]
[1] "Herbiverous  fish in reefs of\nArraial do Cabo"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.