Code Monkey home page Code Monkey logo

simple_pop_stats's People

Contributors

bensutherland avatar cgwallace avatar erondeau avatar janinesupernault avatar rycroftc avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

rycroftc

simple_pop_stats's Issues

full_sim.r no longer reporting #collections, rather only what was supposed to be n_per_repunit

From @bensutherland

"I noticed in the chinook microsat simulations something funky is going on with the '#collections' column. It used to be that this would show the number of collections in each grouping, but now it appears to show the sample size for the repunit?
Do you mind taking a look to see what's going on there? If we are going to show sample size for the repunit, that's all good, but let's change the colname to something such as n_per_repunit. I think the number of collections may have been more useful than the n_per_repunit though. "

Repunit filenames need to be constant for all species and added as variable to launch script

As far as I can tell, there is no repunit path / filename variable yet used, and so this is hardcoded in several locations in the pipeline. I think this is because there exists non-constant repunit filenames (e.g., cm_repunits_full.txt vs. repunits_full.txt).
We'll therefore need to make all repunits filenames constant, and then create a variable (in select species?).

drop_pops() not actually dropping pops

We identified an issue, where JS was attempting to remove collections from the analysis using the drop_file option in drop_pops(). While the correct collections were printed to screen for "keeping", the final dataframe did not actually remove anything, retaining all collections.

This seems likely to be either a) the filtered dataframe is not actually being filtered (eg. filter(df,parameters), not df <-filter(df,parameters); or b) the filtered dataframe is being written over in subsequent steps (eg. the original filter works, but subsequent steps ignore this filtered df.

Follwing up...

Function not operating correctly until sourced again manually

It is not clear what is causing this, but when I source the initiator script for simple_pop_stats (i.e., 01_scripts/simple_pop_stats_start.R), then run my analysis, the function simple_pop_stats_pyes/01_scripts/utilities/pca_from_genind.r does not properly use the variable retain_pca_obj = TRUE. More specifically, second and third uses of this variable in if statements within the function do not use the variable.
The only way around this that I've found is re-sourcing the function itself manually a second time after the initiator script. The it handles the variable above correctly. I have no idea why this occurs, but I have found it multiple times. It does make me concerned that the standard way of sourcing functions is somehow not providing the full and correct function.

Initial run with using the initiator script only:

> pca_from_genind(data = obj
+                 , PCs_ret = 4
+                 , plot_eigen = TRUE
+                 , plot_allele_loadings = TRUE
+                 , retain_pca_obj = TRUE
+                 , colour_file = "00_archive/formatted_cols.csv"
+                 )
[1] "Converting genind to genlight"
Starting gi2gl 
Starting gl.compliance.check 
  Processing genlight object with SNP data
  Checking coding of SNPs
    SNP data scored NA, 0, 1 or 2 confirmed
  Checking locus metrics and flags
  Recalculating locus metrics
  Checking for monomorphic loci
    No monomorphic loci detected
  Checking for loci with all missing data
    No loci with all missing data detected
  Checking whether individual names are unique.
  Checking for individual metrics
  Warning: Creating a slot for individual metrics
  Checking for population assignments
    Population assignments confirmed
  Spelling of coordinates checked and changed if necessary to 
            lat/lon
Completed: gl.compliance.check 
Completed: gi2gl 
[1] "Executing PCA, retaining 4 PCs"
[1] "Keeping pca.obj in global enviro"
[1] "Writing out per sample PC loading values"
[1] "Using custom colours file from 00_archive/formatted_cols.csv"                           
[1] "Plotting eigenvalues"
[1] "Plotting allele loadings"
RStudioGD 
        2 

Note that there is no saving out.

Then I manually sourced the pca_from_genind function, and re-run

pca_from_genind(data = obj
                 , PCs_ret = 4
                 , plot_eigen = TRUE
                 , plot_allele_loadings = TRUE
                 , retain_pca_obj = TRUE
                 , colour_file = "00_archive/formatted_cols.csv"
                 )

outputs Saving a PCA plot as 'pc1_v_pc2.plot' into the enviro in the readout and in the global environment:

[1] "Converting genind to genlight"
Starting gi2gl 
Starting gl.compliance.check 
  Processing genlight object with SNP data
  Checking coding of SNPs
    SNP data scored NA, 0, 1 or 2 confirmed
  Checking locus metrics and flags
  Recalculating locus metrics
  Checking for monomorphic loci
    No monomorphic loci detected
  Checking for loci with all missing data
    No loci with all missing data detected
  Checking whether individual names are unique.
  Checking for individual metrics
  Warning: Creating a slot for individual metrics
  Checking for population assignments
    Population assignments confirmed
  Spelling of coordinates checked and changed if necessary to 
            lat/lon
Completed: gl.compliance.check 
Completed: gi2gl 
[1] "Executing PCA, retaining 4 PCs"
[1] "Keeping pca.obj in global enviro"
[1] "Writing out per sample PC loading values"
[1] "Using custom colours file from 00_archive/formatted_cols.csv"                           
[1] "Saving a PCA plot as 'pc1_v_pc2.plot' into the enviro"
[1] "Saving a PCA plot as 'pc3_v_pc4.plot' into the enviro"
[1] "Plotting eigenvalues"
[1] "Plotting allele loadings"
RStudioGD 
        2

The only difference between these two runs is the manual source of the function.

Benchmark folder structure is not yet constructed to handle both microsatellite and SNP data

We currently write to the PBT folder with microsat benchmarking, which could become an issue in particular for species that have both microsatellite and SNP baselines in play. This is not quite as simple as using a global variable, since Rmarkdown does not work well with global variables. As per:

"For better or worse, this omission is intentional. Relying on objects created outside the document makes your document less reproducible--that is, if your document needs data in the global environment, you can't just give someone (or yourself in two years) the document and data files and let them recreate it themselves.
For this reason, and in order to perform the render in the background, RStudio actually creates a separate R session to render the document. That background R session cannot see any of the environments in the interactive R session you see in RStudio."

Ghostscript not in path

Encountered the following warning when running Rmarkdown with the benchmark protocol:

In has_crop_tools() :
Tool(s) not installed or not in PATH: ghostcript
-> As a result, figure cropping will be disabled.

Not sure what is wrong here - I suspect it has to do with the way we installed Ghostscript through the Software Center. This: https://stackoverflow.com/questions/66305776/got-knit-issue-with-r as well as some others imply it in an environment variable patht that needs to be set.

To follow up...

Sorting error in annotate_rubias

There is a sorting error that is introduced when running annotate_rubias() specifically with datatype=="SNP" and custom_format==TRUE. At line 52-60:

      # Attach the pops to the individuals
      pop_code <- merge(x = pop_code, y = sample_to_pop_interp.df, by.x = "indiv.vec", by.y = "indiv", all.x = T, sort = F)
      
      # Attach the repunits to the pops
      pop_code <- merge(x = pop_code, y = sc.df, by.x = "pop", by.y = "collection", all.x = T, sort = F)
      
      head(pop_code)
      colnames(pop_code) <- c("collection", "ind", "repunit")
      pop_code <- pop_code[, c("ind", "collection", "repunit")]

...the merging of pop_code and sc.df shuffles individual rows, and later the order of the rows is assumed to have remained in a constant order, at L129-130:

  # Add these together followed by genetic data
  all_data.df <- cbind(sample_type.vec, collection.vec, repunit.vec, indiv.vec, two_allele_data)

Capitalization of headers - more flexibility?

We've encountered a few errors recently on header consistency, especially capitalization, on some manually created files when doing microsats. Especially true of the repunits_full.txt file now required for some summary stuff. Namely, some fields are fully capitalized CU and CU_NAME, some are partially Display_Order and some are not repunit. This is leading to headaches as these are often manually entered.

Goal - to try and either 1) use a script to export a template with the correct headers as a starting point? 2) Create a repunits file from the rubias maybe, where all you would have to do is fill in the Display_Order column? 3) Allow the script to be more flexible in the header capitalization? I think option 2 is porbably best, but any of the above will work. To Do!

Benchmark PDF name

Can the benchmark PDF name be sourced from the Rmd folder variable? Instead of baseline_benchmark.PDF, it could be the specific benchmark?

genepop_to_rubias() not describing properly when collections are duplicated.

JS encountered an issue where genepop_to_rubias_microsat() was spitting out an error:

Error: At least one of your input file stock names was not found in your stock code file, stopping...

This was confusing, as it was clear that they were all there.

Ran it line by line, and it became obvious in annotate_rubias() that it was a mismatch in df lengths. The new dataframe was longer, not shorter! After further study, it was because each collection name was assigned to two stock codes. Thus, when the merge occurred, the data was being duplicated across two lines, one for each stock code.

I can fix the error in terms of better reporting the cause of the failure. But the underlying issue (multiple stock codes for a single pop in uSats) will need more thought...

Need method to convert geneclass/genepop mix file to rubias input

The function genepop_to_rubias_microsat() has an option for data type as reference or mixture, but I am not certain that the script can handle the input mix file. In any case it is not tested and is not documented, so some work is needed here to improve this function.

Additional packages pdfcrop and ghostscript in baseline benchmark markdown

Upon updating R to 4.0.3, I noticed that an error was produced when trying to run the benchmark r-markdown, about missing dependencies "pdfcrop" and "ghostscript". This seems to be a result of pdfcrop being added to tinytex, but it wasn't straightforward to fix:

  1. I downloaded and installed a program called ghostscript: https://www.ghostscript.com/download/gsdnld.html; but it is also in the software catalogue for download. (search for "ghost" brings it up)
  2. install_tinytex()
  3. tinytex::reinstall_tinytex()
  4. tinytex::tlmgr_install("pdfcrop")

I thought it was isolated to me, so I didn't think to much of it, but we saw this error again this morning on another computer. So this is likely going to keep popping up.

The above is what I believe is the minimum required, but at one point tinytex::tlmgr_install asked me to install/update some other items. It wasn't clear if all were needed, but I did anyways, so would recommend doing so as well.

Error at diploid loci in 100% simulations

Issue with 100% simulations, seems to be due to record names such as allele 1 ("ots_epic4_158_1") and allele 2 ("ots_epic4_158_1_1"). Potentially needs to be corrected in terms of marker names, ensuring that marker names do not have the _1 a the end.
Not clear how to resolve this in the meantime, other than deleting the offending record.

full_sim(rubias_base.FN = "S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt"
+         , num_sim_ind = 200, sim_reps = 100
+         )
Parsed with column specification:
cols(
  .default = col_double(),
  sample_type = col_character(),
  collection = col_character(),
  repunit = col_character(),
  indiv = col_character(),
  ots_epic4_158_1 = col_logical(),
  ots_epic4_158_1_1 = col_logical()
)
See spec(...) for full column specifications.
Warning: 3 parsing failures.
 row               col           expected actual                                                                                                                  file
1234 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1243 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1269 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'

[1] "Generating counts per collection and per repunit"
[1] "Performing 100% simulation"
[1] "Running simulation and assessing the simulation assignments"
[1] "This will include a total of ***53*** scenarios"
Error in input.  At diploid loci, either both or neither gene copies must be missing. Offending locus = ots_epic4_158_1
	Note! This might indicate that the gen_start_col is incorrect.
 Show Traceback
 
 Rerun with Debug
 Error in get_ploidy_from_frame(tmp) : 
  Bailing out due to single gene copies being missing data at non-haploid loci. 

Benchmark PDF is sent to github folder not target

Need to either (a) re-direct to the working folder, likely via a rmarkdown::render() wrapper script; or (b) create automated method to collect the output and put it in the target folder that runs after the knitr is finished (e.g., complete_benchmark() function could do this using paths in the environment).

drop_loci is not removing all monomorphic loci

The function drop_loci(df = obj, drop_monomorphic = TRUE) is not dropping all monomorphic loci. An example is if data is read into R as a genind file, then filtered dropping all samples that contain the alternate allele. The locus will then be monomorphic, but may still have a second column for the second allele.
The issue is here:
https://github.com/bensutherland/simple_pop_stats/blob/master/01_scripts/utilities/drop_loci.r
which(nAll(df)==1)

There is an option of the formula that requires that the alternate allele is actually seen.
Adding this as issue for documentation.

Need to only retain one data point per pairwise contrast in compare_physical_genetic dist

I suspect that the comparison of physical and genetic distance is duplicating
e.g. pop1 vs pop2 AND pop2 vs pop1 both being retained.
This would be a problem for the model (overloading data points that are identical), and for the plot.
There should be a total number of comparisons that are shown as a QC step when this is done so the user gets informed on what contrasts are being made.

LaTeX installation instructions needed on README

Using the baseline benchmarking knit command, I receive the following error, which I think occurs because I don't have a latex rendering program on my Windows machine.

output file: baseline_benchmark.knit.md

"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS baseline_benchmark.utf8.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output baseline_benchmark.tex --self-contained --highlight-style tango --latex-engine pdflatex --variable graphics --include-in-header "C:\Users\SUTHER~1\AppData\Local\Temp\2\RtmpINJRG4\rmarkdown-str3c4868ed51a5.html" --variable "geometry:margin=1in" 
Error: LaTeX failed to compile baseline_benchmark.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips.
In addition: Warning message:
In system2(..., stdout = if (use_file_stdout()) f1 else FALSE, stderr = f2) :
  '"pdflatex"' not found
Execution halted

No TeX installation detected (TeX is required to create PDF output). You should install a recommended TeX distribution for your platform:

  Windows: MiKTeX (Complete) - http://miktex.org/2.9/setup
  (NOTE: Be sure to download the Complete rather than Basic installation)

  Mac OS X: TexLive 2013 (Full) - http://tug.org/mactex/
  (NOTE: Download with Safari rather than Chrome _strongly_ recommended)

  Linux: Use system package manager

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.