bensutherland / simple_pop_stats Goto Github PK
View Code? Open in Web Editor NEWA short analysis of population statistics given specific inputs
A short analysis of population statistics given specific inputs
Need script to drop samples based on a minimum genotype rate
I have some code that can be used as a foundation for this, but requires generalization. Posting here for tracking.
From @bensutherland
"I noticed in the chinook microsat simulations something funky is going on with the '#collections' column. It used to be that this would show the number of collections in each grouping, but now it appears to show the sample size for the repunit?
Do you mind taking a look to see what's going on there? If we are going to show sample size for the repunit, that's all good, but let's change the colname to something such as n_per_repunit. I think the number of collections may have been more useful than the n_per_repunit though. "
When saving as a PDF from within a function, particularly for a ggplot object, it seems that this does not properly render and is saved as an empty plot.
As far as I can tell, there is no repunit path / filename variable yet used, and so this is hardcoded in several locations in the pipeline. I think this is because there exists non-constant repunit filenames (e.g., cm_repunits_full.txt vs. repunits_full.txt).
We'll therefore need to make all repunits filenames constant, and then create a variable (in select species?).
We identified an issue, where JS was attempting to remove collections from the analysis using the drop_file option in drop_pops()
. While the correct collections were printed to screen for "keeping", the final dataframe did not actually remove anything, retaining all collections.
This seems likely to be either a) the filtered dataframe is not actually being filtered (eg. filter(df,parameters)
, not df <-filter(df,parameters)
; or b) the filtered dataframe is being written over in subsequent steps (eg. the original filter works, but subsequent steps ignore this filtered df.
Follwing up...
In microsats, often multiple grouping levels will be used on the exact same baseline (i.e., different roll-ups). How should we deal with this in the baseline benchmarks? For example, results of simulations.
It is not clear what is causing this, but when I source the initiator script for simple_pop_stats (i.e., 01_scripts/simple_pop_stats_start.R
), then run my analysis, the function simple_pop_stats_pyes/01_scripts/utilities/pca_from_genind.r
does not properly use the variable retain_pca_obj = TRUE
. More specifically, second and third uses of this variable in if statements within the function do not use the variable.
The only way around this that I've found is re-sourcing the function itself manually a second time after the initiator script. The it handles the variable above correctly. I have no idea why this occurs, but I have found it multiple times. It does make me concerned that the standard way of sourcing functions is somehow not providing the full and correct function.
Initial run with using the initiator script only:
> pca_from_genind(data = obj
+ , PCs_ret = 4
+ , plot_eigen = TRUE
+ , plot_allele_loadings = TRUE
+ , retain_pca_obj = TRUE
+ , colour_file = "00_archive/formatted_cols.csv"
+ )
[1] "Converting genind to genlight"
Starting gi2gl
Starting gl.compliance.check
Processing genlight object with SNP data
Checking coding of SNPs
SNP data scored NA, 0, 1 or 2 confirmed
Checking locus metrics and flags
Recalculating locus metrics
Checking for monomorphic loci
No monomorphic loci detected
Checking for loci with all missing data
No loci with all missing data detected
Checking whether individual names are unique.
Checking for individual metrics
Warning: Creating a slot for individual metrics
Checking for population assignments
Population assignments confirmed
Spelling of coordinates checked and changed if necessary to
lat/lon
Completed: gl.compliance.check
Completed: gi2gl
[1] "Executing PCA, retaining 4 PCs"
[1] "Keeping pca.obj in global enviro"
[1] "Writing out per sample PC loading values"
[1] "Using custom colours file from 00_archive/formatted_cols.csv"
[1] "Plotting eigenvalues"
[1] "Plotting allele loadings"
RStudioGD
2
Note that there is no saving out.
Then I manually sourced the pca_from_genind function, and re-run
pca_from_genind(data = obj
, PCs_ret = 4
, plot_eigen = TRUE
, plot_allele_loadings = TRUE
, retain_pca_obj = TRUE
, colour_file = "00_archive/formatted_cols.csv"
)
outputs Saving a PCA plot as 'pc1_v_pc2.plot' into the enviro
in the readout and in the global environment:
[1] "Converting genind to genlight"
Starting gi2gl
Starting gl.compliance.check
Processing genlight object with SNP data
Checking coding of SNPs
SNP data scored NA, 0, 1 or 2 confirmed
Checking locus metrics and flags
Recalculating locus metrics
Checking for monomorphic loci
No monomorphic loci detected
Checking for loci with all missing data
No loci with all missing data detected
Checking whether individual names are unique.
Checking for individual metrics
Warning: Creating a slot for individual metrics
Checking for population assignments
Population assignments confirmed
Spelling of coordinates checked and changed if necessary to
lat/lon
Completed: gl.compliance.check
Completed: gi2gl
[1] "Executing PCA, retaining 4 PCs"
[1] "Keeping pca.obj in global enviro"
[1] "Writing out per sample PC loading values"
[1] "Using custom colours file from 00_archive/formatted_cols.csv"
[1] "Saving a PCA plot as 'pc1_v_pc2.plot' into the enviro"
[1] "Saving a PCA plot as 'pc3_v_pc4.plot' into the enviro"
[1] "Plotting eigenvalues"
[1] "Plotting allele loadings"
RStudioGD
2
The only difference between these two runs is the manual source of the function.
We currently write to the PBT folder with microsat benchmarking, which could become an issue in particular for species that have both microsatellite and SNP baselines in play. This is not quite as simple as using a global variable, since Rmarkdown does not work well with global variables. As per:
"For better or worse, this omission is intentional. Relying on objects created outside the document makes your document less reproducible--that is, if your document needs data in the global environment, you can't just give someone (or yourself in two years) the document and data files and let them recreate it themselves.
For this reason, and in order to perform the render in the background, RStudio actually creates a separate R session to render the document. That background R session cannot see any of the environments in the interactive R session you see in RStudio."
Workaround is setting the variable manually, then sourcing the start script:
on_network <- TRUE
This should prompt the user to re-build the colour file, not just crash without warnings.
This happens, for example, if one were to spell "blue" as "bleu"
Encountered the following warning when running Rmarkdown with the benchmark protocol:
In has_crop_tools() :
Tool(s) not installed or not in PATH: ghostcript
-> As a result, figure cropping will be disabled.
Not sure what is wrong here - I suspect it has to do with the way we installed Ghostscript through the Software Center. This: https://stackoverflow.com/questions/66305776/got-knit-issue-with-r as well as some others imply it in an environment variable patht that needs to be set.
To follow up...
There is a sorting error that is introduced when running annotate_rubias() specifically with datatype=="SNP" and custom_format==TRUE. At line 52-60:
# Attach the pops to the individuals
pop_code <- merge(x = pop_code, y = sample_to_pop_interp.df, by.x = "indiv.vec", by.y = "indiv", all.x = T, sort = F)
# Attach the repunits to the pops
pop_code <- merge(x = pop_code, y = sc.df, by.x = "pop", by.y = "collection", all.x = T, sort = F)
head(pop_code)
colnames(pop_code) <- c("collection", "ind", "repunit")
pop_code <- pop_code[, c("ind", "collection", "repunit")]
...the merging of pop_code and sc.df shuffles individual rows, and later the order of the rows is assumed to have remained in a constant order, at L129-130:
# Add these together followed by genetic data
all_data.df <- cbind(sample_type.vec, collection.vec, repunit.vec, indiv.vec, two_allele_data)
We've encountered a few errors recently on header consistency, especially capitalization, on some manually created files when doing microsats. Especially true of the repunits_full.txt
file now required for some summary stuff. Namely, some fields are fully capitalized CU
and CU_NAME
, some are partially Display_Order
and some are not repunit
. This is leading to headaches as these are often manually entered.
Goal - to try and either 1) use a script to export a template with the correct headers as a starting point? 2) Create a repunits file from the rubias maybe, where all you would have to do is fill in the Display_Order column? 3) Allow the script to be more flexible in the header capitalization? I think option 2 is porbably best, but any of the above will work. To Do!
Can the benchmark PDF name be sourced from the Rmd folder variable? Instead of baseline_benchmark.PDF, it could be the specific benchmark?
JS encountered an issue where genepop_to_rubias_microsat()
was spitting out an error:
Error: At least one of your input file stock names was not found in your stock code file, stopping...
This was confusing, as it was clear that they were all there.
Ran it line by line, and it became obvious in annotate_rubias()
that it was a mismatch in df lengths. The new dataframe was longer, not shorter! After further study, it was because each collection name was assigned to two stock codes. Thus, when the merge occurred, the data was being duplicated across two lines, one for each stock code.
I can fix the error in terms of better reporting the cause of the failure. But the underlying issue (multiple stock codes for a single pop in uSats) will need more thought...
The function genepop_to_rubias_microsat() has an option for data type as reference or mixture, but I am not certain that the script can handle the input mix file. In any case it is not tested and is not documented, so some work is needed here to improve this function.
Need to fix this script to just generate default colours if no colour dataframe is provided. Currently it just crashes due to error.
script: pca_from_genind()
Upon updating R to 4.0.3, I noticed that an error was produced when trying to run the benchmark r-markdown, about missing dependencies "pdfcrop" and "ghostscript". This seems to be a result of pdfcrop being added to tinytex, but it wasn't straightforward to fix:
install_tinytex()
tinytex::reinstall_tinytex()
tinytex::tlmgr_install("pdfcrop")
I thought it was isolated to me, so I didn't think to much of it, but we saw this error again this morning on another computer. So this is likely going to keep popping up.
The above is what I believe is the minimum required, but at one point tinytex::tlmgr_install asked me to install/update some other items. It wasn't clear if all were needed, but I did anyways, so would recommend doing so as well.
Automation needed to copy the PDF from simple_pop_stats/01_scripts/ to the target folder for ease of repeat building of the benchmark doc.
The output is more specific to 100% simulations, whereas if we want to also use this function for RFS, then we need to keep all stocks in the output, not just those with assigned proportions in the ppn input file. Otherwise allocation disappears in the results output. From discussion with Eric, this is probably specific to the creation of the summary file.
Is it possible to change from centre justified to left justified for the two notes.txt and changes.txt-sourced tables in the benchmark rendered document?
The landscape rotation of the simulation results may work better if in portrait for ease of reading. If we can only retain one metric for simulations and eliminate any other unnecessary columns, it may fit better.
Issue with 100% simulations, seems to be due to record names such as allele 1 ("ots_epic4_158_1") and allele 2 ("ots_epic4_158_1_1"). Potentially needs to be corrected in terms of marker names, ensuring that marker names do not have the _1 a the end.
Not clear how to resolve this in the meantime, other than deleting the offending record.
full_sim(rubias_base.FN = "S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt"
+ , num_sim_ind = 200, sim_reps = 100
+ )
Parsed with column specification:
cols(
.default = col_double(),
sample_type = col_character(),
collection = col_character(),
repunit = col_character(),
indiv = col_character(),
ots_epic4_158_1 = col_logical(),
ots_epic4_158_1_1 = col_logical()
)
See spec(...) for full column specifications.
Warning: 3 parsing failures.
row col expected actual file
1234 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE 2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1243 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE 2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1269 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE 2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
[1] "Generating counts per collection and per repunit"
[1] "Performing 100% simulation"
[1] "Running simulation and assessing the simulation assignments"
[1] "This will include a total of ***53*** scenarios"
Error in input. At diploid loci, either both or neither gene copies must be missing. Offending locus = ots_epic4_158_1
Note! This might indicate that the gen_start_col is incorrect.
Show Traceback
Rerun with Debug
Error in get_ploidy_from_frame(tmp) :
Bailing out due to single gene copies being missing data at non-haploid loci.
Need to either (a) re-direct to the working folder, likely via a rmarkdown::render()
wrapper script; or (b) create automated method to collect the output and put it in the target folder that runs after the knitr is finished (e.g., complete_benchmark() function could do this using paths in the environment).
This could also include the name to be used for the benchmark, that will be also used for the folder. .
The function drop_loci(df = obj, drop_monomorphic = TRUE)
is not dropping all monomorphic loci. An example is if data is read into R as a genind file, then filtered dropping all samples that contain the alternate allele. The locus will then be monomorphic, but may still have a second column for the second allele.
The issue is here:
https://github.com/bensutherland/simple_pop_stats/blob/master/01_scripts/utilities/drop_loci.r
which(nAll(df)==1)
There is an option of the formula that requires that the alternate allele is actually seen.
Adding this as issue for documentation.
I suspect that the comparison of physical and genetic distance is duplicating
e.g. pop1 vs pop2 AND pop2 vs pop1 both being retained.
This would be a problem for the model (overloading data points that are identical), and for the plot.
There should be a total number of comparisons that are shown as a QC step when this is done so the user gets informed on what contrasts are being made.
Using the baseline benchmarking knit command, I receive the following error, which I think occurs because I don't have a latex rendering program on my Windows machine.
output file: baseline_benchmark.knit.md
"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS baseline_benchmark.utf8.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output baseline_benchmark.tex --self-contained --highlight-style tango --latex-engine pdflatex --variable graphics --include-in-header "C:\Users\SUTHER~1\AppData\Local\Temp\2\RtmpINJRG4\rmarkdown-str3c4868ed51a5.html" --variable "geometry:margin=1in"
Error: LaTeX failed to compile baseline_benchmark.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips.
In addition: Warning message:
In system2(..., stdout = if (use_file_stdout()) f1 else FALSE, stderr = f2) :
'"pdflatex"' not found
Execution halted
No TeX installation detected (TeX is required to create PDF output). You should install a recommended TeX distribution for your platform:
Windows: MiKTeX (Complete) - http://miktex.org/2.9/setup
(NOTE: Be sure to download the Complete rather than Basic installation)
Mac OS X: TexLive 2013 (Full) - http://tug.org/mactex/
(NOTE: Download with Safari rather than Chrome _strongly_ recommended)
Linux: Use system package manager
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.