bensutherland / simple_pop_stats Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 413 KB

A short analysis of population statistics given specific inputs

R 100.00%

simple_pop_stats's People

Contributors

Stargazers

Watchers

Forkers

rycroftc

simple_pop_stats's Issues

Minimum number of genotypes/individual

Need script to drop samples based on a minimum genotype rate

Need stand-alone function to screen rubias file for a percentage or total number missing loci (micro and rubias)

I have some code that can be used as a foundation for this, but requires generalization. Posting here for tracking.

full_sim.r no longer reporting #collections, rather only what was supposed to be n_per_repunit

From @bensutherland

"I noticed in the chinook microsat simulations something funky is going on with the '#collections' column. It used to be that this would show the number of collections in each grouping, but now it appears to show the sample size for the repunit?
Do you mind taking a look to see what's going on there? If we are going to show sample size for the repunit, that's all good, but let's change the colname to something such as n_per_repunit. I think the number of collections may have been more useful than the n_per_repunit though. "

Printing from within a function produces empty PDF

When saving as a PDF from within a function, particularly for a ggplot object, it seems that this does not properly render and is saved as an empty plot.

Repunit filenames need to be constant for all species and added as variable to launch script

As far as I can tell, there is no repunit path / filename variable yet used, and so this is hardcoded in several locations in the pipeline. I think this is because there exists non-constant repunit filenames (e.g., cm_repunits_full.txt vs. repunits_full.txt).
We'll therefore need to make all repunits filenames constant, and then create a variable (in select species?).

drop_pops() not actually dropping pops

We identified an issue, where JS was attempting to remove collections from the analysis using the drop_file option in drop_pops(). While the correct collections were printed to screen for "keeping", the final dataframe did not actually remove anything, retaining all collections.

This seems likely to be either a) the filtered dataframe is not actually being filtered (eg. filter(df,parameters), not df <-filter(df,parameters); or b) the filtered dataframe is being written over in subsequent steps (eg. the original filter works, but subsequent steps ignore this filtered df.

Follwing up...

Baseline benchmarks: how to deal with same baseline but multiple grouping levels

In microsats, often multiple grouping levels will be used on the exact same baseline (i.e., different roll-ups). How should we deal with this in the baseline benchmarks? For example, results of simulations.

Function not operating correctly until sourced again manually

It is not clear what is causing this, but when I source the initiator script for simple_pop_stats (i.e., 01_scripts/simple_pop_stats_start.R), then run my analysis, the function simple_pop_stats_pyes/01_scripts/utilities/pca_from_genind.r does not properly use the variable retain_pca_obj = TRUE. More specifically, second and third uses of this variable in if statements within the function do not use the variable.
The only way around this that I've found is re-sourcing the function itself manually a second time after the initiator script. The it handles the variable above correctly. I have no idea why this occurs, but I have found it multiple times. It does make me concerned that the standard way of sourcing functions is somehow not providing the full and correct function.

Initial run with using the initiator script only:

> pca_from_genind(data = obj
+                 , PCs_ret = 4
+                 , plot_eigen = TRUE
+                 , plot_allele_loadings = TRUE
+                 , retain_pca_obj = TRUE
+                 , colour_file = "00_archive/formatted_cols.csv"
+                 )
[1] "Converting genind to genlight"
Starting gi2gl 
Starting gl.compliance.check 
  Processing genlight object with SNP data
  Checking coding of SNPs
    SNP data scored NA, 0, 1 or 2 confirmed
  Checking locus metrics and flags
  Recalculating locus metrics
  Checking for monomorphic loci
    No monomorphic loci detected
  Checking for loci with all missing data
    No loci with all missing data detected
  Checking whether individual names are unique.
  Checking for individual metrics
  Warning: Creating a slot for individual metrics
  Checking for population assignments
    Population assignments confirmed
  Spelling of coordinates checked and changed if necessary to 
            lat/lon
Completed: gl.compliance.check 
Completed: gi2gl 
[1] "Executing PCA, retaining 4 PCs"
[1] "Keeping pca.obj in global enviro"
[1] "Writing out per sample PC loading values"
[1] "Using custom colours file from 00_archive/formatted_cols.csv"                           
[1] "Plotting eigenvalues"
[1] "Plotting allele loadings"
RStudioGD 
        2

Note that there is no saving out.

Then I manually sourced the pca_from_genind function, and re-run

pca_from_genind(data = obj
                 , PCs_ret = 4
                 , plot_eigen = TRUE
                 , plot_allele_loadings = TRUE
                 , retain_pca_obj = TRUE
                 , colour_file = "00_archive/formatted_cols.csv"
                 )

outputs Saving a PCA plot as 'pc1_v_pc2.plot' into the enviro in the readout and in the global environment:

[1] "Converting genind to genlight"
Starting gi2gl 
Starting gl.compliance.check 
  Processing genlight object with SNP data
  Checking coding of SNPs
    SNP data scored NA, 0, 1 or 2 confirmed
  Checking locus metrics and flags
  Recalculating locus metrics
  Checking for monomorphic loci
    No monomorphic loci detected
  Checking for loci with all missing data
    No loci with all missing data detected
  Checking whether individual names are unique.
  Checking for individual metrics
  Warning: Creating a slot for individual metrics
  Checking for population assignments
    Population assignments confirmed
  Spelling of coordinates checked and changed if necessary to 
            lat/lon
Completed: gl.compliance.check 
Completed: gi2gl 
[1] "Executing PCA, retaining 4 PCs"
[1] "Keeping pca.obj in global enviro"
[1] "Writing out per sample PC loading values"
[1] "Using custom colours file from 00_archive/formatted_cols.csv"                           
[1] "Saving a PCA plot as 'pc1_v_pc2.plot' into the enviro"
[1] "Saving a PCA plot as 'pc3_v_pc4.plot' into the enviro"
[1] "Plotting eigenvalues"
[1] "Plotting allele loadings"
RStudioGD 
        2

The only difference between these two runs is the manual source of the function.

Benchmark folder structure is not yet constructed to handle both microsatellite and SNP data

We currently write to the PBT folder with microsat benchmarking, which could become an issue in particular for species that have both microsatellite and SNP baselines in play. This is not quite as simple as using a global variable, since Rmarkdown does not work well with global variables. As per:

"For better or worse, this omission is intentional. Relying on objects created outside the document makes your document less reproducible--that is, if your document needs data in the global environment, you can't just give someone (or yourself in two years) the document and data files and let them recreate it themselves.
For this reason, and in order to perform the render in the background, RStudio actually creates a separate R session to render the document. That background R session cannot see any of the environments in the interactive R session you see in RStudio."

on_network variable not being recognizing, causes select_species to fail

Workaround is setting the variable manually, then sourcing the start script:

on_network <- TRUE

PCA_from_genind fails if there is a typo in the colour dataframe

This should prompt the user to re-build the colour file, not just crash without warnings.
This happens, for example, if one were to spell "blue" as "bleu"

Ghostscript not in path

Encountered the following warning when running Rmarkdown with the benchmark protocol:

In has_crop_tools() :
Tool(s) not installed or not in PATH: ghostcript
-> As a result, figure cropping will be disabled.

Not sure what is wrong here - I suspect it has to do with the way we installed Ghostscript through the Software Center. This: https://stackoverflow.com/questions/66305776/got-knit-issue-with-r as well as some others imply it in an environment variable patht that needs to be set.

To follow up...

Sorting error in annotate_rubias

There is a sorting error that is introduced when running annotate_rubias() specifically with datatype=="SNP" and custom_format==TRUE. At line 52-60:

      # Attach the pops to the individuals
      pop_code <- merge(x = pop_code, y = sample_to_pop_interp.df, by.x = "indiv.vec", by.y = "indiv", all.x = T, sort = F)
      
      # Attach the repunits to the pops
      pop_code <- merge(x = pop_code, y = sc.df, by.x = "pop", by.y = "collection", all.x = T, sort = F)
      
      head(pop_code)
      colnames(pop_code) <- c("collection", "ind", "repunit")
      pop_code <- pop_code[, c("ind", "collection", "repunit")]

...the merging of pop_code and sc.df shuffles individual rows, and later the order of the rows is assumed to have remained in a constant order, at L129-130:

  # Add these together followed by genetic data
  all_data.df <- cbind(sample_type.vec, collection.vec, repunit.vec, indiv.vec, two_allele_data)

Capitalization of headers - more flexibility?

We've encountered a few errors recently on header consistency, especially capitalization, on some manually created files when doing microsats. Especially true of the repunits_full.txt file now required for some summary stuff. Namely, some fields are fully capitalized CU and CU_NAME, some are partially Display_Order and some are not repunit. This is leading to headaches as these are often manually entered.

Goal - to try and either 1) use a script to export a template with the correct headers as a starting point? 2) Create a repunits file from the rubias maybe, where all you would have to do is fill in the Display_Order column? 3) Allow the script to be more flexible in the header capitalization? I think option 2 is porbably best, but any of the above will work. To Do!

Benchmark PDF name

Can the benchmark PDF name be sourced from the Rmd folder variable? Instead of baseline_benchmark.PDF, it could be the specific benchmark?

genepop_to_rubias() not describing properly when collections are duplicated.

JS encountered an issue where genepop_to_rubias_microsat() was spitting out an error:

Error: At least one of your input file stock names was not found in your stock code file, stopping...

This was confusing, as it was clear that they were all there.

Ran it line by line, and it became obvious in annotate_rubias() that it was a mismatch in df lengths. The new dataframe was longer, not shorter! After further study, it was because each collection name was assigned to two stock codes. Thus, when the merge occurred, the data was being duplicated across two lines, one for each stock code.

I can fix the error in terms of better reporting the cause of the failure. But the underlying issue (multiple stock codes for a single pop in uSats) will need more thought...

Need method to convert geneclass/genepop mix file to rubias input

The function genepop_to_rubias_microsat() has an option for data type as reference or mixture, but I am not certain that the script can handle the input mix file. In any case it is not tested and is not documented, so some work is needed here to improve this function.

PCA from genind fails if colour file not set

Need to fix this script to just generate default colours if no colour dataframe is provided. Currently it just crashes due to error.
script: pca_from_genind()

PPN file example needed in README for Realistic Fishery Scenario

Need to save output filenames with whether SNP or msat so don't overwrite previous file

Additional packages pdfcrop and ghostscript in baseline benchmark markdown

Upon updating R to 4.0.3, I noticed that an error was produced when trying to run the benchmark r-markdown, about missing dependencies "pdfcrop" and "ghostscript". This seems to be a result of pdfcrop being added to tinytex, but it wasn't straightforward to fix:

I downloaded and installed a program called ghostscript: https://www.ghostscript.com/download/gsdnld.html; but it is also in the software catalogue for download. (search for "ghost" brings it up)
install_tinytex()
tinytex::reinstall_tinytex()
tinytex::tlmgr_install("pdfcrop")

I thought it was isolated to me, so I didn't think to much of it, but we saw this error again this morning on another computer. So this is likely going to keep popping up.

The above is what I believe is the minimum required, but at one point tinytex::tlmgr_install asked me to install/update some other items. It wasn't clear if all were needed, but I did anyways, so would recommend doing so as well.

Automate copy paste rename of benchmark into the reference database folder

Automation needed to copy the PDF from simple_pop_stats/01_scripts/ to the target folder for ease of repeat building of the benchmark doc.

Order by repunits in plot_summarize_100_sim() not yet working for microsatellites

Realistic Fishery Scenario does not show all populations, only those set in the ppn file with non-zero

The output is more specific to 100% simulations, whereas if we want to also use this function for RFS, then we need to keep all stocks in the output, not just those with assigned proportions in the ppn input file. Otherwise allocation disappears in the results output. From discussion with Eric, this is probably specific to the creation of the summary file.

Need to output amova results to file, not just as object in R env

pca_from_genind.r not using default colour option.

Left justify changes and notes tables in benchmark

Is it possible to change from centre justified to left justified for the two notes.txt and changes.txt-sourced tables in the benchmark rendered document?

Reduce width of simulation table and make portrait from landscape

The landscape rotation of the simulation results may work better if in portrait for ease of reading. If we can only retain one metric for simulations and eliminate any other unnecessary columns, it may fit better.

Error at diploid loci in 100% simulations

Issue with 100% simulations, seems to be due to record names such as allele 1 ("ots_epic4_158_1") and allele 2 ("ots_epic4_158_1_1"). Potentially needs to be corrected in terms of marker names, ensuring that marker names do not have the _1 a the end.
Not clear how to resolve this in the meantime, other than deleting the offending record.

full_sim(rubias_base.FN = "S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt"
+         , num_sim_ind = 200, sim_reps = 100
+         )
Parsed with column specification:
cols(
  .default = col_double(),
  sample_type = col_character(),
  collection = col_character(),
  repunit = col_character(),
  indiv = col_character(),
  ots_epic4_158_1 = col_logical(),
  ots_epic4_158_1_1 = col_logical()
)
See spec(...) for full column specifications.
Warning: 3 parsing failures.
 row               col           expected actual                                                                                                                  file
1234 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1243 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1269 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'

[1] "Generating counts per collection and per repunit"
[1] "Performing 100% simulation"
[1] "Running simulation and assessing the simulation assignments"
[1] "This will include a total of ***53*** scenarios"
Error in input.  At diploid loci, either both or neither gene copies must be missing. Offending locus = ots_epic4_158_1
	Note! This might indicate that the gen_start_col is incorrect.
 Show Traceback
 
 Rerun with Debug
 Error in get_ploidy_from_frame(tmp) : 
  Bailing out due to single gene copies being missing data at non-haploid loci.

Benchmark PDF is sent to github folder not target

Need to either (a) re-direct to the working folder, likely via a rmarkdown::render() wrapper script; or (b) create automated method to collect the output and put it in the target folder that runs after the knitr is finished (e.g., complete_benchmark() function could do this using paths in the environment).

Automate building of the folder structure for starting a new benchmark folder

This could also include the name to be used for the benchmark, that will be also used for the folder. .

drop_loci is not removing all monomorphic loci

The function drop_loci(df = obj, drop_monomorphic = TRUE) is not dropping all monomorphic loci. An example is if data is read into R as a genind file, then filtered dropping all samples that contain the alternate allele. The locus will then be monomorphic, but may still have a second column for the second allele.
The issue is here:
https://github.com/bensutherland/simple_pop_stats/blob/master/01_scripts/utilities/drop_loci.r
which(nAll(df)==1)

There is an option of the formula that requires that the alternate allele is actually seen.
Adding this as issue for documentation.

Need to only retain one data point per pairwise contrast in compare_physical_genetic dist

I suspect that the comparison of physical and genetic distance is duplicating
e.g. pop1 vs pop2 AND pop2 vs pop1 both being retained.
This would be a problem for the model (overloading data points that are identical), and for the plot.
There should be a total number of comparisons that are shown as a QC step when this is done so the user gets informed on what contrasts are being made.

LaTeX installation instructions needed on README

Using the baseline benchmarking knit command, I receive the following error, which I think occurs because I don't have a latex rendering program on my Windows machine.

output file: baseline_benchmark.knit.md

"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS baseline_benchmark.utf8.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output baseline_benchmark.tex --self-contained --highlight-style tango --latex-engine pdflatex --variable graphics --include-in-header "C:\Users\SUTHER~1\AppData\Local\Temp\2\RtmpINJRG4\rmarkdown-str3c4868ed51a5.html" --variable "geometry:margin=1in" 
Error: LaTeX failed to compile baseline_benchmark.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips.
In addition: Warning message:
In system2(..., stdout = if (use_file_stdout()) f1 else FALSE, stderr = f2) :
  '"pdflatex"' not found
Execution halted

No TeX installation detected (TeX is required to create PDF output). You should install a recommended TeX distribution for your platform:

  Windows: MiKTeX (Complete) - http://miktex.org/2.9/setup
  (NOTE: Be sure to download the Complete rather than Basic installation)

  Mac OS X: TexLive 2013 (Full) - http://tug.org/mactex/
  (NOTE: Download with Safari rather than Chrome _strongly_ recommended)

  Linux: Use system package manager