Hi Thierry, A few small bugs I've noticed with filter_rad() in inter

Do and Waples developed NeEstimator (<a href="https://onlinelibrary.wiley.com/doi/abs/

<input type="checkbox" id="" disabled=""

filter_rad issues,about thierrygosselin/radiator

Comments (72)

kevinmneal commented on June 6, 2024 1

Do and Waples developed NeEstimator (https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12157), which includes LDNe, molecular coancestry, and heterozygosity excess methods. Will give ldne in strataG a try - I'd been looking for an Ne implementation in R, thanks for pointing it out, and it looks like they implement some bias corrections for using thousands of SNPs that I don't think NeEstimator has even implemented yet: https://www.nature.com/articles/hdy201660

So far, clouds of closekin are almost always between 0.25 and 0.75

😅 I've since removed that top outlier, at least... my original relatedness filtering was based off of KING (implemented as --relatedness2 in vcftools) (https://academic.oup.com/bioinformatics/article/26/22/2867/228512) I think most of that cloud from 0.25-0.5 is from some artificial pools that are derived from a very small sample, though, so that's probably demonstrative of a founder effect.

Of course, whether it's appropriate to purge close relatives is still up for debate: https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14022

from radiator.

thierrygosselin commented on June 6, 2024 1

Will update radiator very soon with these changes:
Will now import and parse the VCF metadata CATG

Screenshot of the doc for the metadata

Screenshot during function execution to know what to expect

The CATG columns are automatically matched with REF and ALT columns to generate ALLELE_REF_DEPTH and ALLELE_ALT_DEPTH columns... this is what you need for radiator::detect_allele_problems
Documentation for the markers metadata parsing is also updated...

Screenshot of the doc for VCF import and parsing of markers metadata

from radiator.

thierrygosselin commented on June 6, 2024 1

I removed the top line (but it might have been something else, the top line pas just markers, not accounting for the first 6 columns...).

genomic_converter: should work too

from radiator.

thierrygosselin commented on June 6, 2024 1

Working non stop for 2weeks
Workshop is next week, got to have radiator ready ;)
https://thierrygosselin.github.io/genomics-workshops/

from radiator.

chasesmith15 commented on June 6, 2024 1

8 cores on my PC

from radiator.

thierrygosselin commented on June 6, 2024 1

Hi Kevin, I'm closing the issue, because I think everything was covered. If I missed something, let me know by reopening it.

from radiator.

kevinmneal commented on June 6, 2024 1

Hi Thierry,

Hadn't used filter_rad/genomic_converter in a while but reran filter_rad today with the latest version of radiator and still get this error that I brought up a month or so ago: `. The genomic_converter step of filter_rad Any ideas?

Transferring data to genomic converter...
Synchronizing data and strata...
    Number of strata: 1
    Number of individuals: 153

Writing tidy data set:
sphasouth153spatial.nohybrids.radiator.50pctmsng.oneSNPmac3.rad
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 10
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 10
Error in .f(.x[[i]], ...) : object 'GT' not found

Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):

##fileformat=VCFv4.3
##fileDate=20190329
##source=radiator_v.1.0.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
1	locus_100029__100029__50	NA	C	T	.	PASS	NS=115	GT	
1	locus_100077__100077__49	NA	A	G	.	PASS	NS=174	GT	
1	locus_10009__10009__11	NA	C	T	.	PASS	NS=178	GT
1	locus_100137__100137__36	NA	G	T	.	PASS	NS=163	GT	
1	locus_100170__100170__22	NA	T	A	.	PASS	NS=102	GT	
1	locus_100203__100203__29	NA	C	T	.	PASS	NS=108	GT

This is what the input vcf looks like:

##fileformat=VCFv4.0
##fileDate=2019/01/13
##source=ipyrad_v.0.7.28
##reference=Spea_genomeassembly_fromEvan.fasta
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With 
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
locus_100006	5	.	G	A	13	PASS	NS=120;DP=2002	GT:DP:CATG	
locus_100014	8	.	G	T	13	PASS	NS=251;DP=8028	GT:DP:CATG	
locus_100023	12	.	C	A	13	PASS	NS=95;DP=2406	GT:DP:CATG	
locus_100029	50	.	C	T	13	PASS	NS=173;DP=3987	GT:DP:CATG	
locus_100038	19	.	T	C	13	PASS	NS=261;DP=9508	GT:DP:CATG	
locus_100042	50	.	C	A	13	PASS	NS=91;DP=887	GT:DP:CATG

from radiator.

thierrygosselin commented on June 6, 2024

Hi Kevin, thanks a lot for the bug report. I don't have a PC nearby so it's very useful, I'm giving a workshop in 2 weeks and 19 out of the 25 participants have PCs...

I'll break the reply on topics...

1. Font Helvetica: just realized that my favorite font is not pre-installed on PCs. Snif. Will find a replacement found in all 3 OS...

so this one is done. Completely removed Helvetica from radiator

from radiator.

thierrygosselin commented on June 6, 2024

2. outliers stats
Figure that says it all.
The outliers are values outside the 1.5 x IQR range.
Sometimes useful for filtering, sometimes not.
They are there to help guide your decision.
Careful in Mac/maf, coverage, snp number per locus and position on the read.
It's obviously normal to have lots of dotted lines (outliers inside a Boxplot showing MAC)...
This will be explained further in the vignette coming max next week.
In radiator the dotted lines in the Boxplots are actually the outliers. It's faster computationally to do it this way.

from radiator.

thierrygosselin commented on June 6, 2024

3. filter_coverage

a plot for min mean would be useful
It's on my todo list to have the helper plot, it's there, just not outputting it.
But it's less important or rational, because it all falls down to your comfort with low coverage...
Personally I don't go below 7, because of null allele, heterozygosity, etc.

3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now.

I'll check that at the same time. But it wouldn't be the first time a pipeline says it's filtering x and it doesn't... do you know if it's <= or <, because for your dataset that would be a big deal (see also my observation/comment of Minot Allele Count below...

You could try this with your dataset and the latest release:

data <- radiator::tidy_vcf(data = "OCall208_75pctmsng.recode.vcf",
                           strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
                           filter.common.markers = FALSE,
                           verbose = TRUE)

cov.markers <- data %>%
  dplyr::group_by(MARKERS) %>%
  dplyr::summarise(
    MEAN = mean(READ_DEPTH, na.rm = TRUE),
    MEDIAN = median(READ_DEPTH, na.rm = TRUE)
    )

dplyr::filter(cov.markers, MEAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEAN < 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN < 6) %>% nrow(.)

You should get these numbers for the unfiltered dataset:

Now both helper plot (low and high are printed)
Using these thresholds:

Filter mean coverage thresholds: 7 / 100
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 26 / 5600 / 5600 / 6965
    Blacklisted: 0 / 0 / 1773 / 1773 / 2248
    After: 208 / 26 / 3827 / 3827 / 4717

There's definitely things to test with that filter and downstream analysis.
I know some people who wouldn't do closekin with markers below 20 of coverage ... but that's up for debate...

from radiator.

thierrygosselin commented on June 6, 2024

4. In filter_genotyping, is the threshold applied per-strata or only on the total?
It's overall.
The filter used to be able to have a pop threshold (some call that a joker).
This is actually generating bias and exacerbating pattern of missingness.

The reason I show the missing per strata, the mean value and overall is that it helps to see if the populations are "equal". You can already predict if a pattern of missingness is present from this filter.

Remember, their is nothing missing unless you compare it with something...

The course of action when some populations drags down the polymorphism discovery is to remove them, do the downstream analysis. Re-introduce the population and see if it changes something.

The impact depends on the analysis:

for close kin analysis it's noticeable
for population structure: depends on the number of markers and degree of differentiation. The weaker the more important.
pop. assignment: important
if you see a trend between bubble size (missing) and heterozygosity... it's usually because something went wrong in the wet lab...

Update: so that filter no longer crash RStudio on my end when the number of samples for a pop is below the number of CPU used...

from radiator.

thierrygosselin commented on June 6, 2024

5. outputting the full function call with args entered during the interactive session would help with reproducibility

You should see it if verbose = TRUE for main function. After, for the modules inside it's printed in the file with _args_ in it.

Additionally, the values for threshold, along CHROM/LOCUS/SNP kept and blacklisted for each filtering steps are written in the filters_parameters file in the 01_radiator folder

from radiator.

thierrygosselin commented on June 6, 2024

6. asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time

Most are fast enough: I usually test with 2M markers

As for HWE it's the only filter I'm not confortable 100% in removing markers. You really have to know what you're doing. Consequently, you should get the message Do you want to continue with the filtering ? after the plots are produced...

More on that filter below...

The only other now I saw useful to do it is the pruning of markers based on long ld... it also ask if you want to proceed or not. Because it's usually fast with a reference genome but can be long for de novo data.

from radiator.

thierrygosselin commented on June 6, 2024

7. Long LD filtering appears to work but only if pruned WITHOUT missing data statistics when CHROMs represent contigs; pruning with missing data statistics doesn't remove anything. Is there a reason for this? Actually I'm not sure loci are pruned either way. Is it possible to collapse the CHROMs down to a single CHROM to do the long LD filtering?

It should work. No need to collapse CHROM. Likely a bug. with your VCF.
I'm testing it with and without a genome, so I'll run it again to see if I broke something.
WITH MISSING is currently only available WITH a reference genome. It's a computational issue for now.

The problem you're seeing is it with the VCF you sent me yesterday ? If so I'll do the test with it.

update so do you really have a reference genome or an approach using a catalog of individuals as pseudo-reference ? It looks like the info in CHROM are more like locus than scaffolds. You have 6000 of them almost as much as the markers.

Ok so I checked ipyrad doc and the options are described here

options:

Change the CHROM info: I'll make available a series of codes in the function: filter_ld examples that will show how to easily change the CHROM column in your GDS object and/or tidy data. So you could skip the long LD step in filter_rad... and do it separately in filter_ld.
Tell me if there is a better way to code the CHROM right from the start...

Code to change the chromosome information:

#Change the CHROM column-------------------------------------------------------
gds <- read_rad("full path to gds or .rad.gds file")
radiator::summary_gds(gds)

markers.meta <- radiator::extract_markers_metadata(gds)

# Keep a back up
markers.meta.bk <- markers.meta

# change the CHROM column
markers.meta %<>%
  dplyr::mutate(CHROM = "chrom_1")

# Update the GDS markers metadata (this is the radiator part)
update_radiator_gds(
  gds = gds,
  node.name = "markers.meta",
  value = markers.meta,
  replace = TRUE
)

# Update the chromosome info in GDS (GDS, SeqArray part)
update_radiator_gds(
  gds = gds,
  radiator.gds = FALSE,
  node.name = "chromosome",
  value = markers.meta$CHROM,
  replace = TRUE
)
#Ready for LD without the code running by scaffolds...

from radiator.

thierrygosselin commented on June 6, 2024

8. Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.

Yes, a pain.
I have a fix for this that tracks if it's been done. It's always a problem with low coverage data + the removal of samples. It generates more monomorphic markers and requires recalibrating REF/ALT...

my plan in the next couple of weeks is to have a C++ code do it faster.

from radiator.

thierrygosselin commented on June 6, 2024

9.in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.

Sadly, no easy way.
I have a solution in the vignette that shows how to start with the GDS file, removes some filters, etc.

The other solution might be for you to just use the filter modules you want and use magrittr %>% in-between.

Todo list: use shiny and let use pick filters on the left side and click and enter thresholds as the modular pipeline goes...

from radiator.

thierrygosselin commented on June 6, 2024

10. when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.

Bug
I'll check the combination mentioned above. It's not for n < 10 (that's tested) not tested is when you enter to the same number of pops.

Usually with asterisk, if entering e.g. ***** returns no filtering, the interactive version will prompt to enter a new value... that might trigger an ever ending loop. I'll check.

from radiator.

thierrygosselin commented on June 6, 2024

11.detect_mixed_genomes doesn't abide by the parallel.core arg in filter_rad(); fixed by adding parallel.core = parallel.core in filter_rad:

Good catch.

from radiator.

thierrygosselin commented on June 6, 2024

12. May need to remove strata=NULL from filter_hwe in filter_rad():

gds <- filter_hwe(data = gds, interactive.filter = interactive.filter, 
    filter.hwe = filter.hwe, strata=NULL, hw.pop.threshold = hw.pop.threshold, 
    midp.threshold = midp.threshold, parallel.core = parallel.core, 
    parameters = filters.parameters, path.folder = wf, verbose = verbose, 
    internal = FALSE)

That one is actually necessary with other type of datasets.
I don't think it's the reason you experience problem with this function.

from radiator.

kevinmneal commented on June 6, 2024

1- Arial is probably the safest bet, a near-copy of Helvetica
3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now.
6- I think it's going through the HWE calculations before asking if you want to filter
7- Yes, with that VCF. The issue might be that I'm using filter.common.markers=FALSE (at least for some analyses, I'd like to keep as many SNPs per population as possible, even if not common, mostly for doing effective population size calculations). I think when I did have it =TRUE that it was filtering properly.
10- Yeah, it became a never-ending loop regardless of what I entered. Sorry I don't have the parameters that triggered it, but if it comes up again I'll report back
13 (below)- Maybe that was being caused by my removing strata=NULL from filter_hwe, hmm. EDIT: Although it got an error after the detect_mixed_genomes step too. EDIT2: Yup, tried after putting strata=NULL back, still get the error.

Thanks!

from radiator.

thierrygosselin commented on June 6, 2024

13. Any filter that removes samples/individuals causes failure in the next filtering step:

Bug. I don't see this with my test dataset.
Will check with your dataset

Awesome

from radiator.

kevinmneal commented on June 6, 2024

Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?

from radiator.

thierrygosselin commented on June 6, 2024

Will have a new commit with fixes by the end of the day
When you run this

test1 <- radiator::read_vcf(
     data = "OCall208_75pctmsng.recode.vcf",
     strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
     verbose = TRUE)

On my Mac I'm getting this at the end:

Computation time, overall: 11 sec
########################### completed read_vcf ###########################
Warning message:
In mclapply(seq_len(njobs), mc.preschedule = FALSE, mc.cores = njobs,  :
  1 function calls resulted in an error

It's likely related to some SeqArray function I'm using and some don't like when the number of CPU is lower than the number of pop, so I have a fix for it, but are you getting a similar message (it won't be
mclapply, but something else)

from radiator.

thierrygosselin commented on June 6, 2024

Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?

genome fasta was used for alignment right ?
I'm not sure I understand how this is related to your point 7 for the common markers ?

from radiator.

kevinmneal commented on June 6, 2024

test1 <- radiator::read_vcf(
data = "OCall208_75pctmsng.recode.vcf",
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
verbose = TRUE)

For me on Windows, the same input returns the error

Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx,  : 
  One of the nodes produced an error: Can not open file 'C:\Users\kevin\Documents\Rtest\read_vcf_20190320@1141\01_import_gds\[email protected]'. The process cannot access the file because it is being used by another process.

I did use a genome fasta for alignment. I was wondering if incorporating that fasta into radiator operations added more information than is found in the VCF, but probably not, so ignore that.

Also, I did run more tests on long LD filtering with filter.common.markers either TRUE or FALSE, using long.ld.missing TRUE or FALSE. No combination seems to be filtering, so I'm really not sure what to do now. I think I might have gotten it to successfully filter once, but I don't remember the parameters there again (I really should be better at logging, but I've been running this dozens of times.)

from radiator.

kevinmneal commented on June 6, 2024

More info, if it helps with the long LD filtering:

When I run using long.ld.missing=TRUE, no SNPs are filtered, and this plot is the output:

When I run using long.ld.missing=FALSE, no SNPs are filtered (regardless of threshold), and this plot is the output:

from radiator.

kevinmneal commented on June 6, 2024

One more thing, an error I get after filtering when filter_rad passes results to genomic_converter; looks like it's an issue with the pcadapt output. Outputs up to that point are still being written, but none of the ones listed after in my filter_rad input.

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv

Generating statistics after filtering
calculating individual stats...
[==================================================] 100%, completed in 0s
Extracting DP information...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...
Extracting DP information...
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s

Transferring data to genomic converter...
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 15
Synchronizing data and strata...
    Number of strata: 4
    Number of individuals: 208

Writing tidy data set:
OCall208.radiator.75pctmsng.oneSNPmac.rad
    * Number of sample pop, np = 4
    * Number of markers, nl = 6155
    * The highest number used to label an allele, nu = 4
    * The alleles are encoded with one digit number
Generating BayeScan file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1557

::radiatorfilter_common_markers function call arguments:
    data = tbl_df
    filter.common.markers = TRUE
    fig = FALSE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_common_markers:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_common_markers:
    parameters = NULL
    path.folder = NULL


Scanning for common markers...

Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1557

::radiatorfilter_monomorphic function call arguments:
    data = tbl_df
    filter.monomorphic = TRUE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_monomorphic:
    parameters = NULL
    path.folder = NULL


Scanning for monomorphic markers...

Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
Calibrating REF/ALT alleles...
    generating REF/ALT dictionary
    integrating genotypes codings...
writing BayeScan file with:
          Number of populations: 4
    Number of individuals: 208
    Number of biallelic markers: 6012
Writting populations dictionary
Writting markers dictionary
Generating pcadapt file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1604

::radiatorfilter_common_markers function call arguments:
    data = tbl_df
    filter.common.markers = TRUE
    fig = FALSE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_common_markers:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_common_markers:
    parameters = NULL
    path.folder = NULL


Scanning for common markers...

Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1604

::radiatorfilter_monomorphic function call arguments:
    data = tbl_df
    filter.monomorphic = TRUE
    parallel.core = 3
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
    internal = TRUE

Default "..." arguments assigned in ::radiatorfilter_monomorphic:
    parameters = NULL
    path.folder = NULL


Scanning for monomorphic markers...

Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20190320@1604

filter_ld function call arguments:
    interactive.filter = TRUE
    data = tbl_df
    filter.short.ld = NULL
    filter.long.ld = NULL
    parallel.core = 3
    filename = NULL
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in filter_ld:
    ld.method = r2
    long.ld.missing = FALSE

Default "..." arguments assigned in filter_ld:
    internal = FALSE
    ld.figures = TRUE
    parameters = NULL
    path.folder = NULL
    subsample.markers.stats = NULL


File written: [email protected]

Interactive mode: on

Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD


Filters parameters file generated: [email protected]
Filters parameters file: initiated
Minimizing short distance LD...

There is no variation in the number of SNP/locus across the data

Error: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
Call `rlang::last_error()` to see a backtrace
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Computation time, overall: 1 sec
############################# completed filter_ld ##############################

Computation time, overall: 1519 sec
############################# completed filter_rad #############################
> rlang::last_error()
<error>
message: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
class:   `rlang_error`
backtrace:
 1. radiator::filter_rad(...)
 2. radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
 3. radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
 4. radiator::filter_ld(...)
 5. radiator::radiator_parameters(...)
 6. tibble::tibble(...)
 7. tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
 8. tibble:::check_valid_cols(x)
Call `rlang::last_trace()` to see the full backtrace
> rlang::last_trace()
    x
 1. \-radiator::filter_rad(...)
 2.   \-radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
 3.     \-radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
 4.       \-radiator::filter_ld(...)
 5.         \-radiator::radiator_parameters(...)
 6.           \-tibble::tibble(...)
 7.             \-tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
 8.               \-tibble:::check_valid_cols(x)

Input was

OCall208.filtered <- radiator::filter_rad(data = OCall208.tidy, 
                                          strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata_K3_ARTIF.txt",
                                          verbose = TRUE,
                                          filename="OCall208.radiator.75pctmsng.oneSNPmac",
                                          parallel.core = 1,
                                          fig.upsetr=TRUE,
                                          filter.common.markers=FALSE,
                                          output=c("vcf", "genepop", "tidy", "plink", "genind", "genlight", "structure", "hierfstat", "bayescan", "betadiv", "pcadapt", "hzar", "related", "maverick"),
                                          path.folder="G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphaOCclust85/OCall208.radiator.filtered")

from radiator.

thierrygosselin commented on June 6, 2024

pcadapt : the function will only keep common markers and discard monomorphic markers automatically. Will check that one first.
maverick: Will no longer work. I have to check what changed with the rmaverick.

from radiator.

thierrygosselin commented on June 6, 2024

fig.upsetr: forget about it. I now detect automatically if the number of samples will break UpSetR

from radiator.

thierrygosselin commented on June 6, 2024

Will have a solution strategy soon for your uneven sample size dataset...
But you're not really going to do Ne on the populations with less than 15 samples ?

What's the specie?
Also, just for my brain and correlations: Sequencing technology used and type of RAD wet lab technique ?

from radiator.

thierrygosselin commented on June 6, 2024

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 26 / 6660 / 6660 / 8719
    Blacklisted: 0 / 0 / 4675 / 4675 / 6327
    After: 208 / 26 / 1985 / 1985 / 2392

You really don't want that one...

from radiator.

kevinmneal commented on June 6, 2024

I've been getting Ne values out of NeEstimator with pretty narrow parametric confidence intervals on ~10 samples (jackknife CI is bigger though...), but I have been worried about bias which was a big reason for why I wanted to use radiator :) I have a few sites where I have 20-30 individuals and will compare results between using all individuals in a pond vs. using 10, but since they're tadpoles usually sampled in one trip, I worry about very high relatedness (which is why most ponds have ~10 now, after filtering for relatedness). I should try getting Ne using Colony but that program is not the most user-friendly... I'm also calculating Ne for genetic clusters in addition to pond-level.

Species is the western spadefoot toad, Spea hammondii. Sequenced with Illumina HiSeq 4000, wet lab technique was 3RAD (ddRAD but with a third enzyme to cut primer dimers) https://www.biorxiv.org/content/10.1101/205799v3

I've been trying out using common markers on a different strata, on K=3 clusters + a category for artificial ponds (so 4 strata) that I got from FastStructure. When I do that, the filtering is more reasonable:

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 4 / 6660 / 6660 / 8719
    Blacklisted: 0 / 0 / 148 / 148 / 193
    After: 208 / 4 / 6512 / 6512 / 8526

from radiator.

thierrygosselin commented on June 6, 2024

Have you tried ldne ?
Eric Archer package strataG as a nice ldNe interface... ??strataG::ldNe my conversion function for gtypes won't work, he updated the internal structure and I'm currently updating it. But you could easily do it from a genind object with his conversion function.

from radiator.

thierrygosselin commented on June 6, 2024

Colony: I understand, and it's not the fastest.
My output used to work because I ran a lot of dataset on colony to get closekin

However, you can get a pretty good estimate of how close your samples are with the detect_duplicate_genomes (it's running automatically in filter_rad). So far, clouds of closekin are almost always between 0.25 and 0.75

from radiator.

thierrygosselin commented on June 6, 2024

filter for common markers
I wouldn't use it for now with low numbers like that, you'll likely throw out good markers.
Skip that one.

Toad: very cool

from radiator.

thierrygosselin commented on June 6, 2024

I love that function... it's basically an individual tree, but Manhattan style, which is way easier to highlight outlier... and so far you would believe how much are out there in published dataset...

With some of my own tuna datasets, removing duplicates or very close kin revealed another K (group) in DAPC and stockR... it was the filter with the biggest impact

from radiator.

thierrygosselin commented on June 6, 2024

FYI Eric Archer and Robin have work closely on some projects, so I would trust the implementation

KING: be careful with that one, I compared results with colony and KING and some of KING results made no sense.

It's also implemented in SNPRelate (that uses GDS that radiator supports).

from radiator.

thierrygosselin commented on June 6, 2024

As for purging those close relatives, I think it really depends on the number of close kin, the dataset, the number of markers and differentiation... it's been a while I read that paper and remember having talked about it to both authors...

You probably read this one:
Sampling related individuals within ponds biases estimates of population structure in a pond‐breeding amphibian by O'Connell et al.

from radiator.

thierrygosselin commented on June 6, 2024

lots of close relative in there... the relationship will be interesting to highlight...With colony you could also have a good idea on mating success and polygynandry

.

from radiator.

thierrygosselin commented on June 6, 2024

About minor allele count/frequency...
I don't know what you used in ipyrad or in radiator but the very minimum...is probably 3...

Filter mac threshold: 3
Number of individuals / strata / chrom / locus / SNP:
    Before: 208 / 26 / 6660 / 6660 / 8719
    Blacklisted: 0 / 0 / 1060 / 1060 / 1754
    After: 208 / 26 / 5600 / 5600 / 6965

And if everything before that was unfiltered, like in this example above, you really drop a lot of markers...

A MAC = 3 is like saying:
Overall samples, you need:
3 heterozygote individuals or
1 heterozygote individual + 1 homozygote for the ALT allele
to keep a marker
The minimum number of samples is 2

A MAC = 2 can potentially mean just 1 sample ... so you need good coverage and confidence.
If you use 2, check the coverage for those alternate alleles

from radiator.

kevinmneal commented on June 6, 2024

Yeah, the manhattan cloud is awesome. That stockR figure is really interesting. In my case, removing close relatives collapsed clusters--it was forcing a bunch of artificial pools derived from a single site (geographically and phylogenetically within another cluster) into their own cluster at K=2. Removing close relatives brought the artificial pools back in with the expected cluster. My guess is due to founder effect making the pools look far more differentiated than reality since they aren't very distant phylogenetically; could have been an issue of uneven sampling though. That cloud from 0.25-0.5 is almost all from these same artificial pools.

Re: MAC:
I've been using MAC=2 (the vcf I sent you used mac=2 in vcftools), though for the same reason you explain, I have been considering MAC=3. My clustering results don't really change whether using 2 or 3, though. Vcftools also has an option of filtering both singletons and private doubletons, which I also toyed with, but decided a single homozygous individual with the ALT allele still represents real genetic diversity in the population (assuming good coverage, yes) - I'll check coverage on those, but might just do mac=3 like you suggest for simplicity

from radiator.

thierrygosselin commented on June 6, 2024

I have a function that highlight coverage for lower MAC... will find it. It was quick. If there's good coverage keep them. But testing with or without and looking at downstream impact is the best and fastest options.

detect_mixed_genomes is probably my favorite. I'm working with Eric Anderson on heterozygote miscall rate... and this one is good to highlight wet-lab problems. But also find cryptic species and all sorts of coverage related problems

from radiator.

thierrygosselin commented on June 6, 2024

Having lots of homozygotes generates something similar to closekin... it generates false structure.
Having lots of heterozygotes is the opposite... because having 2 copies of each alleles makes you close to everybody.. you can see it easily in output of the function mentioned above

from radiator.

kevinmneal commented on June 6, 2024

There's a recent paper that tests maf/mac thresholds and their effect on structure and comes to the same conclusion, that rare alleles in general mess up structure: https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12995 so I'll do mac=3...

That function would be great to use if you find it, thanks!

from radiator.

thierrygosselin commented on June 6, 2024

I did the move of switching my MAF filter to MAC filter just a bit before reading that paper when it was underpress or biorxiv, don't remember, but I'm glad I did. It's really a nice read.

from radiator.

thierrygosselin commented on June 6, 2024

Question for coverage for low Mac...
with ipyrad: do you have coverage for REF/ALT allele or just the read depth of the genotypes ?

from radiator.

thierrygosselin commented on June 6, 2024

Because I have a function, not tested with your dataset, and it's been a while, work with tidy data, will update to work with GDS as well, but I think it requires there coverage for both alleles (which makes sense)...

??radiator::detect_allele_problems

from radiator.

thierrygosselin commented on June 6, 2024

Individual's heterozygosity

on the unfiltered data
that tell almost the complete story :)

individual.heterozygosity.manhattan.plot.pdf

from radiator.

kevinmneal commented on June 6, 2024

The ipyrad vcf contains read depth for each base pair in the CATG format (this is custom to ipyrad, I believe). I wanted to code up a way to calculate REF/ALT depths from those based on the REF/ALT base pair columns, but figured it would take me too long/got distracted/not sure I'd actually know how to. A quick perusal of the vcf does show some genotypes where there's e.g. 50 reads for A and only 1 for C; not exactly sure how ipyrad/downstream software are calling that as a homozygote vs heterozygote.

I'll try out that function, thanks Thierry.

That het manhattan plot is a beauty and really does summarize a lot. All those IMXX sites are the artificial ones, very low heterozygosity. SHOE is artificial too, but very close to some natural ponds and likely some gene flow with those. Most of the other sites with low heterozygosity (CCSP, LAGUN, MORO, BBEND) are themselves likely a result of a relatively recent natural founder event, geographically isolated from the others. The sites with high heterozygosity are in more suitable, connected habitat.

from radiator.

thierrygosselin commented on June 6, 2024

What you don't want to see is bubble size correlated with the patterns of heterozygosity
Here it's all good, I'll send by email a couple of bad ones, you'll understand

from radiator.

thierrygosselin commented on June 6, 2024

CATG oh ok that's easy then...
give me 30 min I'll draft something

Ok so I'll now parse the CATG column into 4.
What's nice is that you can re-calibrate the alleles REF/ALT based on the actual count of alleles, not just individuals. Which is way better and less biased for low coverage data.

from radiator.

kevinmneal commented on June 6, 2024

Thanks 👍

I feel bad making all these requests--I hope they've been useful and not tedious--but one other request if it's easy to implement: faststructure format output. Not sure why it's necessary, but for faststructure they ignore the first 6 columns, so would need to add 5 blank columns before the first marker column. Otherwise it's just a standard structure file with two rows per individual. (Alternatively, faststructure accepts a .bed file)

from https://rajanil.github.io/fastStructure/

While the original Structure program allowed for a more flexible input format, fastStructure expects a more specific Structure-like input format. Specifically, rows in the data file correspond to samples, with two rows per sample (note that only diploids are handled by this software), and columns correspond to SNPs. The first 6 columns of the file will be ignored; these typically would include IDs, metadata, etc. This software only handles bi-allelic loci. The two alleles at each locus can be encoded as desired; however, missing data should be encoded as `-9'.

An example of a faststructure file I made by converting a vcf to faststructure in PGDSpider (faststructure just ignores the numbers there in the extra columns, I believe):

OCcoastsub48_85clust_75pctmsng_oneSNPmac2_20190205.recode.str.txt

Immensely grateful for all the fixes and help you've given me so far, thanks Thierry

from radiator.

thierrygosselin commented on June 6, 2024

Look at this link for faststructure ;)
Search my name on the page... was using it a bit a while ago...
I should have an output for it somewhere or it was importing one already outputted by radiator.
Will check

from radiator.

kevinmneal commented on June 6, 2024

Accidentally clicked close, whoops. Reopened.

Fantastic, looking forward to the update!

from radiator.

thierrygosselin commented on June 6, 2024

write_faststructure: available as standalone. Let me know if it works as expected, will include it as an option in genomic_converter after

from radiator.

kevinmneal commented on June 6, 2024

I'll test it out now and report back

from radiator.

kevinmneal commented on June 6, 2024

When I run faststructure with a file written by write_faststructure, I get this error:

Traceback (most recent call last):
  File "/mnt/Data5/kevinRAD/RAD_2018-02-01/ipyrad_min100/highqual_min220_outfiles/faststructure/fastStructure/structure.py", line 172, in <module>
    G = parse_str.load(params['inputfile'])
  File "parse_str.pyx", line 10, in parse_str.load
    L = loci.shape[1]
IndexError: tuple index out of range

Also, I have to change the file suffix to ".str" or else faststructure won't read it.

The faststructure output:
OCreduced148.radiator.75pctmsng.oneSNPmac3.faststructure.txt

from radiator.

kevinmneal commented on June 6, 2024

Actually, I checked my other faststructure files; they don't have the locus name header. I just removed the locus name header/the entire first line, and it's running now

from radiator.

kevinmneal commented on June 6, 2024

Hi Thierry

Trying out the latest version (only just updated an hour ago; from the version you last posted 6 days ago that added faststructure). I'm getting errors with genomic_converter for genepop, hierfstat, and structure: Error in .f(.x[[i]], ...) : object 'GT' not found. My hunch is it is because my strata again has populations with only 1 individual (I'd removed them in the previous dataset, but haven't done so for the dataset I'm using here, I'll test it out adding the single individuals to a larger population) UPDATE: still get the errors.

Also a different error when writing betadiv, and plink :

Generating betadiv object
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 9
Error in .f(.x[[i]], ...) : object 'POP_ID' not found

Full function call and return below. Note this is a slightly different dataset than what I've already sent you (different individuals over a wider geographic range, but sequencing and assembly were the same)

> genomic_converter(data = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad", 
+                                                                   strata = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt",
+                                                                   verbose = TRUE,
+                                                                   filename="sphasouth178spatial.radiator.50pctmsng.oneSNPmac3",
+                                                                   parallel.core = 1,
+                                                                   fig.upsetr=TRUE,
+                                                                   filter.common.markers=FALSE,
+                                                                   output=c("vcf", "fineradstructure", "tidy", "plink", "ldna", "stockr", "genind", "genlight", "structure", "bayescan", "betadiv", "related"))
################################################################################
########################## radiator::genomic_converter #########################
################################################################################
Execution date@time: 20190329@1757

genomic_converter function call arguments:
    data = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad
    strata = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt
    output = vcf, fineradstructure, tidy, plink, ldna, stockr, genind, genlight, structure, bayescan, betadiv, related
    filename = sphasouth178spatial.radiator.50pctmsng.oneSNPmac3
    parallel.core = 1
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in genomic_converter:
    filter.common.markers = FALSE

Default "..." arguments assigned in genomic_converter:
    blacklist.genotypes = NULL
    filter.monomorphic = TRUE
    internal = FALSE
    keep.allele.names = FALSE
    parameters = NULL
    path.folder = NULL
    vcf.metadata = TRUE
    vcf.stats = TRUE
    whitelist.markers = NULL

Unknowned arguments identified inside "...": 
    fig.upsetr

Read documentation, for latest changes, and modify your codes!



Folder created: 21_radiator_genomic_converter_20190329@1757
File written: [email protected]
Filters parameters file generated: [email protected]

Importing data

Synchronizing data and strata...
    Number of strata: 31
    Number of individuals: 178

Writing tidy data set:
sphasouth178spatial.radiator.50pctmsng.oneSNPmac3.rad

Preparing data for output

Data is bi-allelic
Generating structure file
Error in .f(.x[[i]], ...) : object 'GT' not found

from radiator.

kevinmneal commented on June 6, 2024

Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):

##fileformat=VCFv4.3
##fileDate=20190329
##source=radiator_v.1.0.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
1	locus_100029__100029__50	NA	C	T	.	PASS	NS=115	GT	
1	locus_100077__100077__49	NA	A	G	.	PASS	NS=174	GT	
1	locus_10009__10009__11	NA	C	T	.	PASS	NS=178	GT
1	locus_100137__100137__36	NA	G	T	.	PASS	NS=163	GT	
1	locus_100170__100170__22	NA	T	A	.	PASS	NS=102	GT	
1	locus_100203__100203__29	NA	C	T	.	PASS	NS=108	GT

This is what the input vcf looks like:

##fileformat=VCFv4.0
##fileDate=2019/01/13
##source=ipyrad_v.0.7.28
##reference=Spea_genomeassembly_fromEvan.fasta
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With 
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
locus_100006	5	.	G	A	13	PASS	NS=120;DP=2002	GT:DP:CATG	
locus_100014	8	.	G	T	13	PASS	NS=251;DP=8028	GT:DP:CATG	
locus_100023	12	.	C	A	13	PASS	NS=95;DP=2406	GT:DP:CATG	
locus_100029	50	.	C	T	13	PASS	NS=173;DP=3987	GT:DP:CATG	
locus_100038	19	.	T	C	13	PASS	NS=261;DP=9508	GT:DP:CATG	
locus_100042	50	.	C	A	13	PASS	NS=91;DP=887	GT:DP:CATG

from radiator.

thierrygosselin commented on June 6, 2024

I’ll have a look at it in 1 hour

from radiator.

kevinmneal commented on June 6, 2024

It's the weekend, no rush! (unless you want to)

Thanks

from radiator.

chasesmith15 commented on June 6, 2024

Hi Thierry,
Any updates on the next version of radiator coming out? Running into some of the same problems stated above using filter_rad on DArT data:

Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1458\01_radiator\[email protected]'. The process cannot access the file because it is being used by another process.

Also:
In grid.Call(C_textBounds, as.graphicsAnnot(x$label), ... : font family not found in Windows font database

Thanks!

Chase

from radiator.

thierrygosselin commented on June 6, 2024

I’m sick and won’t be like to work or push something before Monday next week.

Some of you’re problem seems to be related to parallel processing. Try setting the argument to use 1 core. But it can not access the file which seems to another problem:

make sure you have read/write permission
why is the file ending with ‘gds.rd’ ? It should be automatically set to ‘gds.rad’, weird..

from radiator.

thierrygosselin commented on June 6, 2024

The other one about don’t family is new to me. I removed all mention of font family so that windows machine wouldn’t have to struggle with helvetica font so it should pick whatever you have as default.

Have you done ggplot fig before ? Got similar errors ?

from radiator.

chasesmith15 commented on June 6, 2024

Hey Thierry,

Thanks for the quick response! Hope you get feeling better soon.

-Setting to 1 core seemed to fix the errors. Even with the font family. Never had the error with ggplot either so strange

-Files are saving as .rad not .rd something must have gotten deleted when I was copying over

I'll let you know if I run in to anything else.

from radiator.

thierrygosselin commented on June 6, 2024

How many cores do you have on your PC?

from radiator.

thierrygosselin commented on June 6, 2024

For some functions, SeqArray package à some problems dealing with parallel on PC (because of the lack of forking, he uses something else). So for those I’m detecting the OS and setting it automatically.

If you encounter another problem let me know exactly when it happens in the pipeline

from radiator.

thierrygosselin commented on June 6, 2024

8 cores shouldn’t be any problem

from radiator.

chasesmith15 commented on June 6, 2024

If I specify more than 1 core I get an error during the first common marker filter:
Scanning for common markers... Generating UpSet plot to visualize markers in common Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1527\01_radiator\[email protected]'. The process cannot access the file because it is being used by another process

If I specify 1 core I get an error at the MAC filter step:
Error in SeqArray::seqGetData(gds, "annotation/format/AD") : The GDS node "annotation/format/AD/data" does not exist.

from radiator.

thierrygosselin commented on June 6, 2024

New push on GitHub, Try again and let me know if it works or not

from radiator.

chasesmith15 commented on June 6, 2024

That worked! Thanks for all the help.

from radiator.

filter_rad issues about radiator HOT 72 CLOSED

Comments (72)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent