stevenwingett / lifesciencestrainingdatasets Goto Github PK
View Code? Open in Web Editor NEWA collection of datasets and accompanying scripts for learning how to analyse data
License: GNU General Public License v3.0
A collection of datasets and accompanying scripts for learning how to analyse data
License: GNU General Public License v3.0
Usual thing with dots in column names.
Docs say "Mean CSD" but data says "Mean CSD Threshold"
Not sure what the Range is supposed to be given that it's only one number.
The logical columns aren't actually logical values but strings. All of the Normally distributed and Machine Learning values are NA
Would be good to have some other fields too:
You still appear to be breaking the column names for column IDs which have spaces in them.
For example:
> colnames(Trainingdata::Biomass_of_Herbivorous_Fish)
[1] "Family" "Species"
[3] "Morpho..Functional.Group" "Deep.Mean.Biomass"
[5] "Deep.Standard.Deviation.of.Biomass" "Deep.Percentage"
[7] "Shallow.Mean.Biomass" "Shallow.Standard.Deviation.of.Biomass"
[9] "Shallow.Percentage"
When you load the files from the original tsv/csv files you need to add the option:
check.names=FALSE
..so that R doesn't try to get clever and 'fix' the column names.
I've added to the "devel" branch instructions and a script to make it easier to add new datasets to the R package. Are there already instructions to the make the web version of the datasets?
Please let me know an I'll incorporate all this into the devel branch and then merge to the master branch.
Also, are we supposed to use Tidyverse rather than Base R for the website interface scripts?
Most of the columns are lower case in the data but upper case in the documentation.
At the moment I couldn't see a way to list all of the datasets in the package, or get an overview for what they are. Would be nice if there was an annotation file for the datasets which you could then add a function to see. Eg:
list_training_data()
..which gives you back a data frame of all of the names descriptions, sources and other annotation.
> ?Biomass_of_Herbivorous_Fish
No documentation for ‘Biomass_of_Herbivorous_Fish’ in specified packages and libraries
There seems to be an additional column (probably left over rownames?) at the start.
Plant is supposed to be a factor, but it's just text.
Same thjing for Type and Treatment
Rather than just having data in proprietary R formats it would be useful to have the original files as well. That way we can also include data which has unusual problems during import. We could also give people the option of getting the filename for the original file, or the pre-loaded data and both would be useful.
Spaces are still being converted to dots.
Body.weight.kg vs Body.weight etc.
> install_github("StevenWingett/LifeSciencesTrainingDatasets/Trainingdata")
WARNING: Rtools is required to build R packages, but is not currently installed.
Please download and install Rtools custom from https://cran.r-project.org/bin/windows/Rtools/.
Downloading GitHub repo StevenWingett/LifeSciencesTrainingDatasets@master
WARNING: Rtools is required to build R packages, but is not currently installed.
Please download and install Rtools custom from https://cran.r-project.org/bin/windows/Rtools/.
√ checking for file 'C:\Users\andrewss\AppData\Local\Temp\RtmpYVV8Wb\remotes2710a79739f\StevenWingett-LifeSciencesTrainingDatasets-2b9d343\Trainingdata/DESCRIPTION' (792ms)
- preparing 'Trainingdata':
√ checking DESCRIPTION meta-information ...
- checking for LF line-endings in source and make files and shell scripts
- checking for empty or unneeded directories
NB: this package now depends on R (>= 3.5.0)
WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Biochemical_Oxygen_Demand.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Biomass_of_Herbivorous_Fish.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Bird_Sighting_Texas_2018.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Brain_Bodyweight.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Child_Variants.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Childrens_Indoor_Hobbies_During_Lockdown.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Childrens_Outdoor_Hobbies_During_Lockdown.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Cuttlefish_Buoyancy.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Modern_Pollen_Plant_Diversity_Relationships.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Neutrophils.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Plant_CO2_Uptake.rds' WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R. File(s) containing such objects: 'Trainingdata/data/Sleep_Deprivation_CSD.rds'
- building 'Trainingdata_0.1.0.tar.gz'
Installing package into ‘C:/Users/andrewss/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
* installing *source* package 'Trainingdata' ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
Warning in gzfile(file, "rb") :
cannot open compressed file 'D:/Documents/R/Trainingdata/data/Biochemical_Oxygen_Demand.rds', probable reason 'No such file or directory'
Error in gzfile(file, "rb") : cannot open the connection
Error: unable to load R code in package 'Trainingdata'
Execution halted
ERROR: lazy loading failed for package 'Trainingdata'
* removing 'C:/Users/andrewss/Documents/R/win-library/4.0/Trainingdata'
Error: Failed to install 'Trainingdata' from GitHub:
(converted from warning) installation of package ‘C:/Users/andrewss/AppData/Local/Temp/RtmpYVV8Wb/file271048247ad8/Trainingdata_0.1.0.tar.gz’ had non-zero exit status
Usual problems with column names not matching the docs and gaining extra dots.
The date string in this data isn't directly readable by as.Date()
so it would be good to add the appropriate format string example so that it can be understood by R.
From the documentation it sounds like the percent column in the hobby dataset should be the percentage of time which each hobby occupied, which would mean that it should add up to 100? However it doesn't:
> sum(Childrens_Indoor_Hobbies_During_Lockdown$Percent)
[1] 61.2
> sum(Childrens_Outdoor_Hobbies_During_Lockdown$Percent)
[1] 24
The values I get are somewhat different;
> Childrens_Indoor_Hobbies_During_Lockdown %>% mutate(mypc=100*Number/sum(Number))
Indoor.Hobby Number Percent mypc
1 Arts & Crafts 195 12.5 21.8120805
2 Puzzles and Games 162 11.3 18.1208054
3 Building stuff 11 0.8 1.2304251
4 Cleaning 4 0.3 0.4474273
5 Computer 35 2.4 3.9149888
6 Cooking/Baking 32 2.1 3.5794183
7 Designing/Creating 8 0.6 0.8948546
8 Helping Others 1 0.1 0.1118568
9 Imaginitive Play 9 0.6 1.0067114
10 Tablet 12 0.8 1.3422819
11 Learning 12 0.8 1.3422819
12 Lego 51 3.4 5.7046980
13 Making Videos 3 0.2 0.3355705
14 Music 28 2.0 3.1319911
15 Physical ability 37 2.6 4.1387025
16 Phone 2 0.2 0.2237136
17 Reading 67 4.5 7.4944072
18 Sewing 1 0.7 0.1118568
19 Sleeping 1 0.1 0.1118568
20 Television 47 3.2 5.2572707
21 Toys 18 1.2 2.0134228
22 Video Chat 7 0.5 0.7829978
23 Video Games 149 10.2 16.6666667
24 Writing 2 0.1 0.2237136
Some of the files commited to github (and therefore in the R package) should really be removed and just kept locally.
The main ones would be
.Rhistory
.Rproj
..but there may be more.
TGX.221 vs TGX-221
I think the annotation on this data is wrong, and there is a relevant column missing. It's wrong in the original paper too.
I'm pretty sure the "mean" value quoted should be the median, and that the "average" value on the paper, which is missing in your data is actually the mean.
It's a shame that they didn't include the full dataset, because this is quite a nasty piece of data :-)
hatching_Date vs hatching_date
days_until_Hatching_trt vs days_until_hatching_trt
sampling_Date vs sampling_date
This probably isn't what was intended. We should have readable informative column names:
> colnames(Biomass_of_Herbivorous_Fish)
1] "Family" "Species" "Morpho..Functional.Group" "D.mean"
[5] "D.s.d." "D." "S.mean" "S.s.d."
[9] "S."
dbSNP should be described as a DB identifier for the dbSNP database
For REF 'reference' is spelled incorrectly
For REF and ALT it should be 'sequence' rather than 'base' since some are multi-base strings
ENST is the Ensembl Transcript ID
The docs for both of the hobby datasets say "Hobby" but the actual data has "Indoor.Hobby" and "Outdoor.Hobby"
Range.CSD.Threshold is a character vector, but the docs say it's numeric
In the brain bodyweight data there are a couple of things which could be better:
The column names are a bit weird, there are extra dots in them, eg Body.weight..kg.
, just Body.weight.kg
would be nicer.
There is a version of this data with an additional column for the type of species (Domesticated, Wild, Extinct) which would be a useful addition.
Something like this wouldn't be too bad in real data, but we shouldn't have it in our metadata table:
> Training_Data_List$Data_set_Name[4]
[1] "Herbiverous fish in reefs of\nArraial do Cabo"
In the documentation it says "Demand" but the data says "demand" (lower case). Would be good to use "Demand" since the other column is "Time"
It's supposed to be Percent, but the data uses X.
In the log transformed data the brain is labelled as body and vice versa.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.