Comments (72)
Do and Waples developed NeEstimator (https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12157), which includes LDNe, molecular coancestry, and heterozygosity excess methods. Will give ldne in strataG a try - I'd been looking for an Ne implementation in R, thanks for pointing it out, and it looks like they implement some bias corrections for using thousands of SNPs that I don't think NeEstimator has even implemented yet: https://www.nature.com/articles/hdy201660
So far, clouds of closekin are almost always between 0.25 and 0.75
đ I've since removed that top outlier, at least... my original relatedness filtering was based off of KING (implemented as --relatedness2 in vcftools) (https://academic.oup.com/bioinformatics/article/26/22/2867/228512) I think most of that cloud from 0.25-0.5 is from some artificial pools that are derived from a very small sample, though, so that's probably demonstrative of a founder effect.
Of course, whether it's appropriate to purge close relatives is still up for debate: https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14022
from radiator.
-
Will update radiator very soon with these changes:
-
Will now import and parse the VCF metadata CATG
Screenshot of the doc for the metadata
Screenshot during function execution to know what to expect
-
The CATG columns are automatically matched with
REF
andALT
columns to generateALLELE_REF_DEPTH
andALLELE_ALT_DEPTH
columns... this is what you need forradiator::detect_allele_problems
-
Documentation for the markers metadata parsing is also updated...
Screenshot of the doc for VCF import and parsing of markers metadata
from radiator.
I removed the top line (but it might have been something else, the top line pas just markers, not accounting for the first 6 columns...).
genomic_converter
: should work too
from radiator.
Working non stop for 2weeks
Workshop is next week, got to have radiator ready ;)
https://thierrygosselin.github.io/genomics-workshops/
from radiator.
8 cores on my PC
from radiator.
Hi Kevin, I'm closing the issue, because I think everything was covered. If I missed something, let me know by reopening it.
from radiator.
Hi Thierry,
Hadn't used filter_rad/genomic_converter in a while but reran filter_rad today with the latest version of radiator and still get this error that I brought up a month or so ago: `. The genomic_converter step of filter_rad Any ideas?
Transferring data to genomic converter...
Synchronizing data and strata...
Number of strata: 1
Number of individuals: 153
Writing tidy data set:
sphasouth153spatial.nohybrids.radiator.50pctmsng.oneSNPmac3.rad
Calibrating REF/ALT alleles...
number of REF/ALT switch = 10
Calibrating REF/ALT alleles...
number of REF/ALT switch = 10
Error in .f(.x[[i]], ...) : object 'GT' not found
Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):
##fileformat=VCFv4.3 ##fileDate=20190329 ##source=radiator_v.1.0.0 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 locus_100029__100029__50 NA C T . PASS NS=115 GT 1 locus_100077__100077__49 NA A G . PASS NS=174 GT 1 locus_10009__10009__11 NA C T . PASS NS=178 GT 1 locus_100137__100137__36 NA G T . PASS NS=163 GT 1 locus_100170__100170__22 NA T A . PASS NS=102 GT 1 locus_100203__100203__29 NA C T . PASS NS=108 GT
This is what the input vcf looks like:
##fileformat=VCFv4.0 ##fileDate=2019/01/13 ##source=ipyrad_v.0.7.28 ##reference=Spea_genomeassembly_fromEvan.fasta ##phasing=unphased ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT locus_100006 5 . G A 13 PASS NS=120;DP=2002 GT:DP:CATG locus_100014 8 . G T 13 PASS NS=251;DP=8028 GT:DP:CATG locus_100023 12 . C A 13 PASS NS=95;DP=2406 GT:DP:CATG locus_100029 50 . C T 13 PASS NS=173;DP=3987 GT:DP:CATG locus_100038 19 . T C 13 PASS NS=261;DP=9508 GT:DP:CATG locus_100042 50 . C A 13 PASS NS=91;DP=887 GT:DP:CATG
from radiator.
Hi Kevin, thanks a lot for the bug report. I don't have a PC nearby so it's very useful, I'm giving a workshop in 2 weeks and 19 out of the 25 participants have PCs...
I'll break the reply on topics...
1. Font Helvetica: just realized that my favorite font is not pre-installed on PCs. Snif. Will find a replacement found in all 3 OS...
- so this one is done. Completely removed Helvetica from radiator
from radiator.
- 2. outliers stats
Figure that says it all.
The outliers are values outside the 1.5 x IQR range.
Sometimes useful for filtering, sometimes not.
They are there to help guide your decision.
Careful in Mac/maf, coverage, snp number per locus and position on the read.
It's obviously normal to have lots of dotted lines (outliers inside a Boxplot showing MAC)...
This will be explained further in the vignette coming max next week.
In radiator the dotted lines in the Boxplots are actually the outliers. It's faster computationally to do it this way.
from radiator.
- 3. filter_coverage
a plot for min mean would be useful
It's on my todo list to have the helper plot, it's there, just not outputting it.
But it's less important or rational, because it all falls down to your comfort with low coverage...
Personally I don't go below 7, because of null allele, heterozygosity, etc.
3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now.
I'll check that at the same time. But it wouldn't be the first time a pipeline says it's filtering x and it doesn't... do you know if it's <=
or <
, because for your dataset that would be a big deal (see also my observation/comment of Minot Allele Count below...
You could try this with your dataset and the latest release:
data <- radiator::tidy_vcf(data = "OCall208_75pctmsng.recode.vcf",
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
filter.common.markers = FALSE,
verbose = TRUE)
cov.markers <- data %>%
dplyr::group_by(MARKERS) %>%
dplyr::summarise(
MEAN = mean(READ_DEPTH, na.rm = TRUE),
MEDIAN = median(READ_DEPTH, na.rm = TRUE)
)
dplyr::filter(cov.markers, MEAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEAN < 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN <= 6) %>% nrow(.)
dplyr::filter(cov.markers, MEDIAN < 6) %>% nrow(.)
You should get these numbers for the unfiltered dataset:
2829
2826
3990
3680
Now both helper plot (low and high are printed)
Using these thresholds:
Filter mean coverage thresholds: 7 / 100
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 26 / 5600 / 5600 / 6965
Blacklisted: 0 / 0 / 1773 / 1773 / 2248
After: 208 / 26 / 3827 / 3827 / 4717
There's definitely things to test with that filter and downstream analysis.
I know some people who wouldn't do closekin with markers below 20 of coverage ... but that's up for debate...
from radiator.
- 4. In filter_genotyping, is the threshold applied per-strata or only on the total?
It's overall.
The filter used to be able to have a pop threshold (some call that a joker).
This is actually generating bias and exacerbating pattern of missingness.
The reason I show the missing per strata, the mean value and overall is that it helps to see if the populations are "equal". You can already predict if a pattern of missingness is present from this filter.
Remember, their is nothing missing unless you compare it with something...
The course of action when some populations drags down the polymorphism discovery is to remove them, do the downstream analysis. Re-introduce the population and see if it changes something.
The impact depends on the analysis:
- for close kin analysis it's noticeable
- for population structure: depends on the number of markers and degree of differentiation. The weaker the more important.
- pop. assignment: important
- if you see a trend between bubble size (missing) and heterozygosity... it's usually because something went wrong in the wet lab...
Update: so that filter no longer crash RStudio on my end when the number of samples for a pop is below the number of CPU used...
from radiator.
- 5. outputting the full function call with args entered during the interactive session would help with reproducibility
You should see it if verbose = TRUE
for main function. After, for the modules inside it's printed in the file with _args_
in it.
Additionally, the values for threshold, along CHROM/LOCUS/SNP kept and blacklisted for each filtering steps are written in the filters_parameters
file in the 01_radiator
folder
from radiator.
6. asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time
Most are fast enough: I usually test with 2M markers
As for HWE
it's the only filter I'm not confortable 100% in removing markers. You really have to know what you're doing. Consequently, you should get the message Do you want to continue with the filtering ?
after the plots are produced...
More on that filter below...
The only other now I saw useful to do it is the pruning of markers based on long ld... it also ask if you want to proceed or not. Because it's usually fast with a reference genome but can be long for de novo data.
from radiator.
- 7. Long LD filtering appears to work but only if pruned WITHOUT missing data statistics when CHROMs represent contigs; pruning with missing data statistics doesn't remove anything. Is there a reason for this? Actually I'm not sure loci are pruned either way. Is it possible to collapse the CHROMs down to a single CHROM to do the long LD filtering?
It should work. No need to collapse CHROM. Likely a bug. with your VCF.
I'm testing it with and without a genome, so I'll run it again to see if I broke something.
WITH MISSING is currently only available WITH a reference genome. It's a computational issue for now.
The problem you're seeing is it with the VCF you sent me yesterday ? If so I'll do the test with it.
update so do you really have a reference genome or an approach using a catalog of individuals as pseudo-reference ? It looks like the info in CHROM
are more like locus than scaffolds. You have 6000 of them almost as much as the markers.
Ok so I checked ipyrad doc and the options are described here
options:
- Change the CHROM info: I'll make available a series of codes in the function:
filter_ld
examples that will show how to easily change theCHROM
column in your GDS object and/or tidy data. So you could skip the long LD step infilter_rad
... and do it separately infilter_ld
. - Tell me if there is a better way to code the
CHROM
right from the start...
Code to change the chromosome information:
#Change the CHROM column-------------------------------------------------------
gds <- read_rad("full path to gds or .rad.gds file")
radiator::summary_gds(gds)
markers.meta <- radiator::extract_markers_metadata(gds)
# Keep a back up
markers.meta.bk <- markers.meta
# change the CHROM column
markers.meta %<>%
dplyr::mutate(CHROM = "chrom_1")
# Update the GDS markers metadata (this is the radiator part)
update_radiator_gds(
gds = gds,
node.name = "markers.meta",
value = markers.meta,
replace = TRUE
)
# Update the chromosome info in GDS (GDS, SeqArray part)
update_radiator_gds(
gds = gds,
radiator.gds = FALSE,
node.name = "chromosome",
value = markers.meta$CHROM,
replace = TRUE
)
#Ready for LD without the code running by scaffolds...
from radiator.
8. Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.
Yes, a pain.
I have a fix for this that tracks if it's been done. It's always a problem with low coverage data + the removal of samples. It generates more monomorphic markers and requires recalibrating REF/ALT...
my plan in the next couple of weeks is to have a C++ code do it faster.
from radiator.
9.in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.
Sadly, no easy way.
I have a solution in the vignette that shows how to start with the GDS file, removes some filters, etc.
The other solution might be for you to just use the filter modules you want and use magrittr %>%
in-between.
Todo list: use shiny and let use pick filters on the left side and click and enter thresholds as the modular pipeline goes...
from radiator.
10. when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.
Bug
I'll check the combination mentioned above. It's not for n < 10 (that's tested) not tested is when you enter to the same number of pops.
Usually with asterisk, if entering e.g. *****
returns no filtering, the interactive version will prompt to enter a new value... that might trigger an ever ending loop. I'll check.
from radiator.
- 11.detect_mixed_genomes doesn't abide by the parallel.core arg in filter_rad(); fixed by adding parallel.core = parallel.core in filter_rad:
Good catch.
from radiator.
- 12. May need to remove strata=NULL from filter_hwe in filter_rad():
gds <- filter_hwe(data = gds, interactive.filter = interactive.filter,
filter.hwe = filter.hwe, strata=NULL, hw.pop.threshold = hw.pop.threshold,
midp.threshold = midp.threshold, parallel.core = parallel.core,
parameters = filters.parameters, path.folder = wf, verbose = verbose,
internal = FALSE)
That one is actually necessary with other type of datasets.
I don't think it's the reason you experience problem with this function.
from radiator.
1- Arial is probably the safest bet, a near-copy of Helvetica
3- I was curious because, in ipyrad I set minimum coverage to 6, but I seem to lose a lot of loci if I set the min here to 6. I've just been setting it to 1 for now.
6- I think it's going through the HWE calculations before asking if you want to filter
7- Yes, with that VCF. The issue might be that I'm using filter.common.markers=FALSE
(at least for some analyses, I'd like to keep as many SNPs per population as possible, even if not common, mostly for doing effective population size calculations). I think when I did have it =TRUE that it was filtering properly.
10- Yeah, it became a never-ending loop regardless of what I entered. Sorry I don't have the parameters that triggered it, but if it comes up again I'll report back
13 (below)- Maybe that was being caused by my removing strata=NULL
from filter_hwe
, hmm. EDIT: Although it got an error after the detect_mixed_genomes
step too. EDIT2: Yup, tried after putting strata=NULL back, still get the error.
Thanks!
from radiator.
- 13. Any filter that removes samples/individuals causes failure in the next filtering step:
Bug. I don't see this with my test dataset.
Will check with your dataset
Awesome
from radiator.
Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?
from radiator.
Will have a new commit with fixes by the end of the day
When you run this
test1 <- radiator::read_vcf(
data = "OCall208_75pctmsng.recode.vcf",
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
verbose = TRUE)
On my Mac I'm getting this at the end:
Computation time, overall: 11 sec
########################### completed read_vcf ###########################
Warning message:
In mclapply(seq_len(njobs), mc.preschedule = FALSE, mc.cores = njobs, :
1 function calls resulted in an error
It's likely related to some SeqArray function I'm using and some don't like when the number of CPU is lower than the number of pop, so I have a fix for it, but are you getting a similar message (it won't be
mclapply
, but something else)
from radiator.
Also 7- is there a way to have radiator utilize the reference genome fasta for this? Does it require it/is it better with it?
genome fasta was used for alignment right ?
I'm not sure I understand how this is related to your point 7 for the common markers ?
from radiator.
test1 <- radiator::read_vcf(
data = "OCall208_75pctmsng.recode.vcf",
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata.txt",
verbose = TRUE)
For me on Windows, the same input returns the error
Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, :
One of the nodes produced an error: Can not open file 'C:\Users\kevin\Documents\Rtest\read_vcf_20190320@1141\01_import_gds\[email protected]'. The process cannot access the file because it is being used by another process.
I did use a genome fasta for alignment. I was wondering if incorporating that fasta into radiator operations added more information than is found in the VCF, but probably not, so ignore that.
Also, I did run more tests on long LD filtering with filter.common.markers either TRUE or FALSE, using long.ld.missing TRUE or FALSE. No combination seems to be filtering, so I'm really not sure what to do now. I think I might have gotten it to successfully filter once, but I don't remember the parameters there again (I really should be better at logging, but I've been running this dozens of times.)
from radiator.
More info, if it helps with the long LD filtering:
When I run using long.ld.missing=TRUE, no SNPs are filtered, and this plot is the output:
When I run using long.ld.missing=FALSE, no SNPs are filtered (regardless of threshold), and this plot is the output:
from radiator.
One more thing, an error I get after filtering when filter_rad passes results to genomic_converter; looks like it's an issue with the pcadapt output. Outputs up to that point are still being written, but none of the ones listed after in my filter_rad input.
Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: blacklist.id.tsv
Generating statistics after filtering
calculating individual stats...
[==================================================] 100%, completed in 0s
Extracting DP information...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...
Extracting DP information...
[==================================================] 100%, completed in 0s
[==================================================] 100%, completed in 0s
Transferring data to genomic converter...
Calibrating REF/ALT alleles...
number of REF/ALT switch = 15
Synchronizing data and strata...
Number of strata: 4
Number of individuals: 208
Writing tidy data set:
OCall208.radiator.75pctmsng.oneSNPmac.rad
* Number of sample pop, np = 4
* Number of markers, nl = 6155
* The highest number used to label an allele, nu = 4
* The alleles are encoded with one digit number
Generating BayeScan file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1557
::radiatorfilter_common_markers function call arguments:
data = tbl_df
filter.common.markers = TRUE
fig = FALSE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_common_markers:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_common_markers:
parameters = NULL
path.folder = NULL
Scanning for common markers...
Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1557
::radiatorfilter_monomorphic function call arguments:
data = tbl_df
filter.monomorphic = TRUE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_monomorphic:
parameters = NULL
path.folder = NULL
Scanning for monomorphic markers...
Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
Calibrating REF/ALT alleles...
generating REF/ALT dictionary
integrating genotypes codings...
writing BayeScan file with:
Number of populations: 4
Number of individuals: 208
Number of biallelic markers: 6012
Writting populations dictionary
Writting markers dictionary
Generating pcadapt file...
################################################################################
######################## radiator::filter_common_markers #######################
################################################################################
Execution date@time: 20190320@1604
::radiatorfilter_common_markers function call arguments:
data = tbl_df
filter.common.markers = TRUE
fig = FALSE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_common_markers:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_common_markers:
parameters = NULL
path.folder = NULL
Scanning for common markers...
Computation time, overall: 1 sec
####################### filter_common_markers completed ########################
################################################################################
########################### radiator::filter_monomorphic #######################
################################################################################
Execution date@time: 20190320@1604
::radiatorfilter_monomorphic function call arguments:
data = tbl_df
filter.monomorphic = TRUE
parallel.core = 3
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in ::radiatorfilter_monomorphic:
internal = TRUE
Default "..." arguments assigned in ::radiatorfilter_monomorphic:
parameters = NULL
path.folder = NULL
Scanning for monomorphic markers...
Computation time, overall: 1 sec
######################## filter_monomorphic completed ##########################
################################################################################
############################## radiator::filter_ld #############################
################################################################################
Execution date@time: 20190320@1604
filter_ld function call arguments:
interactive.filter = TRUE
data = tbl_df
filter.short.ld = NULL
filter.long.ld = NULL
parallel.core = 3
filename = NULL
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in filter_ld:
ld.method = r2
long.ld.missing = FALSE
Default "..." arguments assigned in filter_ld:
internal = FALSE
ld.figures = TRUE
parameters = NULL
path.folder = NULL
subsample.markers.stats = NULL
File written: [email protected]
Interactive mode: on
Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD
Filters parameters file generated: [email protected]
Filters parameters file: initiated
Minimizing short distance LD...
There is no variation in the number of SNP/locus across the data
Error: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
Call `rlang::last_error()` to see a backtrace
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Computation time, overall: 1 sec
############################# completed filter_ld ##############################
Computation time, overall: 1519 sec
############################# completed filter_rad #############################
> rlang::last_error()
<error>
message: All columns in a tibble must be 1d or 2d objects:
* Column `VALUES` is NULL
class: `rlang_error`
backtrace:
1. radiator::filter_rad(...)
2. radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
3. radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
4. radiator::filter_ld(...)
5. radiator::radiator_parameters(...)
6. tibble::tibble(...)
7. tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
8. tibble:::check_valid_cols(x)
Call `rlang::last_trace()` to see the full backtrace
> rlang::last_trace()
x
1. \-radiator::filter_rad(...)
2. \-radiator::genomic_converter(...) C:\Users\kevin\AppData\Local\Temp\Rtmpo9w9tT\file1e54a562a37:236:2
3. \-radiator::write_pcadapt(data = input, filename = filename, parallel.core = parallel.core)
4. \-radiator::filter_ld(...)
5. \-radiator::radiator_parameters(...)
6. \-tibble::tibble(...)
7. \-tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
8. \-tibble:::check_valid_cols(x)
Input was
OCall208.filtered <- radiator::filter_rad(data = OCall208.tidy,
strata = "OCall208_85clust_75pctmsng_oneSNPmac2_20190205_radiator_strata_K3_ARTIF.txt",
verbose = TRUE,
filename="OCall208.radiator.75pctmsng.oneSNPmac",
parallel.core = 1,
fig.upsetr=TRUE,
filter.common.markers=FALSE,
output=c("vcf", "genepop", "tidy", "plink", "genind", "genlight", "structure", "hierfstat", "bayescan", "betadiv", "pcadapt", "hzar", "related", "maverick"),
path.folder="G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphaOCclust85/OCall208.radiator.filtered")
from radiator.
pcadapt
: the function will only keep common markers and discard monomorphic markers automatically. Will check that one first.
maverick
: Will no longer work. I have to check what changed with the rmaverick
.
from radiator.
fig.upsetr
: forget about it. I now detect automatically if the number of samples will break UpSetR
from radiator.
Will have a solution strategy soon for your uneven sample size dataset...
But you're not really going to do Ne on the populations with less than 15 samples ?
What's the specie?
Also, just for my brain and correlations: Sequencing technology used and type of RAD wet lab technique ?
from radiator.
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 26 / 6660 / 6660 / 8719
Blacklisted: 0 / 0 / 4675 / 4675 / 6327
After: 208 / 26 / 1985 / 1985 / 2392
You really don't want that one...
from radiator.
I've been getting Ne values out of NeEstimator with pretty narrow parametric confidence intervals on ~10 samples (jackknife CI is bigger though...), but I have been worried about bias which was a big reason for why I wanted to use radiator :) I have a few sites where I have 20-30 individuals and will compare results between using all individuals in a pond vs. using 10, but since they're tadpoles usually sampled in one trip, I worry about very high relatedness (which is why most ponds have ~10 now, after filtering for relatedness). I should try getting Ne using Colony but that program is not the most user-friendly... I'm also calculating Ne for genetic clusters in addition to pond-level.
Species is the western spadefoot toad, Spea hammondii. Sequenced with Illumina HiSeq 4000, wet lab technique was 3RAD (ddRAD but with a third enzyme to cut primer dimers) https://www.biorxiv.org/content/10.1101/205799v3
I've been trying out using common markers on a different strata, on K=3 clusters + a category for artificial ponds (so 4 strata) that I got from FastStructure. When I do that, the filtering is more reasonable:
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 4 / 6660 / 6660 / 8719
Blacklisted: 0 / 0 / 148 / 148 / 193
After: 208 / 4 / 6512 / 6512 / 8526
from radiator.
Have you tried ldne ?
Eric Archer package strataG as a nice ldNe interface... ??strataG::ldNe
my conversion function for gtypes
won't work, he updated the internal structure and I'm currently updating it. But you could easily do it from a genind object with his conversion function.
from radiator.
Colony: I understand, and it's not the fastest.
My output used to work because I ran a lot of dataset on colony to get closekin
However, you can get a pretty good estimate of how close your samples are with the detect_duplicate_genomes
(it's running automatically in filter_rad
). So far, clouds of closekin are almost always between 0.25 and 0.75
from radiator.
filter for common markers
I wouldn't use it for now with low numbers like that, you'll likely throw out good markers.
Skip that one.
Toad: very cool
from radiator.
I love that function... it's basically an individual tree, but Manhattan style, which is way easier to highlight outlier... and so far you would believe how much are out there in published dataset...
With some of my own tuna datasets, removing duplicates or very close kin revealed another K (group) in DAPC and stockR... it was the filter with the biggest impact
from radiator.
FYI Eric Archer and Robin have work closely on some projects, so I would trust the implementation
KING: be careful with that one, I compared results with colony and KING and some of KING results made no sense.
It's also implemented in SNPRelate (that uses GDS that radiator supports).
from radiator.
As for purging those close relatives, I think it really depends on the number of close kin, the dataset, the number of markers and differentiation... it's been a while I read that paper and remember having talked about it to both authors...
You probably read this one:
Sampling related individuals within ponds biases estimates of population structure in a pondâbreeding amphibian by O'Connell et al.
from radiator.
lots of close relative in there... the relationship will be interesting to highlight...With colony you could also have a good idea on mating success and polygynandry
.
from radiator.
About minor allele count/frequency...
I don't know what you used in ipyrad or in radiator but the very minimum...is probably 3...
Filter mac threshold: 3
Number of individuals / strata / chrom / locus / SNP:
Before: 208 / 26 / 6660 / 6660 / 8719
Blacklisted: 0 / 0 / 1060 / 1060 / 1754
After: 208 / 26 / 5600 / 5600 / 6965
And if everything before that was unfiltered, like in this example above, you really drop a lot of markers...
A MAC = 3
is like saying:
Overall samples, you need:
3 heterozygote individuals or
1 heterozygote individual + 1 homozygote for the ALT allele
to keep a marker
The minimum number of samples is 2
A MAC = 2
can potentially mean just 1 sample ... so you need good coverage and confidence.
If you use 2, check the coverage for those alternate alleles
from radiator.
Yeah, the manhattan cloud is awesome. That stockR figure is really interesting. In my case, removing close relatives collapsed clusters--it was forcing a bunch of artificial pools derived from a single site (geographically and phylogenetically within another cluster) into their own cluster at K=2. Removing close relatives brought the artificial pools back in with the expected cluster. My guess is due to founder effect making the pools look far more differentiated than reality since they aren't very distant phylogenetically; could have been an issue of uneven sampling though. That cloud from 0.25-0.5 is almost all from these same artificial pools.
Re: MAC:
I've been using MAC=2 (the vcf I sent you used mac=2 in vcftools), though for the same reason you explain, I have been considering MAC=3. My clustering results don't really change whether using 2 or 3, though. Vcftools also has an option of filtering both singletons and private doubletons, which I also toyed with, but decided a single homozygous individual with the ALT allele still represents real genetic diversity in the population (assuming good coverage, yes) - I'll check coverage on those, but might just do mac=3 like you suggest for simplicity
from radiator.
I have a function that highlight coverage for lower MAC... will find it. It was quick. If there's good coverage keep them. But testing with or without and looking at downstream impact is the best and fastest options.
detect_mixed_genomes
is probably my favorite. I'm working with Eric Anderson on heterozygote miscall rate... and this one is good to highlight wet-lab problems. But also find cryptic species and all sorts of coverage related problems
from radiator.
Having lots of homozygotes generates something similar to closekin... it generates false structure.
Having lots of heterozygotes is the opposite... because having 2 copies of each alleles makes you close to everybody.. you can see it easily in output of the function mentioned above
from radiator.
There's a recent paper that tests maf/mac thresholds and their effect on structure and comes to the same conclusion, that rare alleles in general mess up structure: https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12995 so I'll do mac=3...
That function would be great to use if you find it, thanks!
from radiator.
I did the move of switching my MAF filter to MAC filter just a bit before reading that paper when it was underpress or biorxiv, don't remember, but I'm glad I did. It's really a nice read.
from radiator.
Question for coverage for low Mac...
with ipyrad: do you have coverage for REF/ALT allele or just the read depth of the genotypes ?
from radiator.
Because I have a function, not tested with your dataset, and it's been a while, work with tidy data, will update to work with GDS as well, but I think it requires there coverage for both alleles (which makes sense)...
??radiator::detect_allele_problems
from radiator.
Individual's heterozygosity
- on the unfiltered data
- that tell almost the complete story :)
individual.heterozygosity.manhattan.plot.pdf
from radiator.
The ipyrad vcf contains read depth for each base pair in the CATG format (this is custom to ipyrad, I believe). I wanted to code up a way to calculate REF/ALT depths from those based on the REF/ALT base pair columns, but figured it would take me too long/got distracted/not sure I'd actually know how to. A quick perusal of the vcf does show some genotypes where there's e.g. 50 reads for A and only 1 for C; not exactly sure how ipyrad/downstream software are calling that as a homozygote vs heterozygote.
I'll try out that function, thanks Thierry.
That het manhattan plot is a beauty and really does summarize a lot. All those IMXX sites are the artificial ones, very low heterozygosity. SHOE is artificial too, but very close to some natural ponds and likely some gene flow with those. Most of the other sites with low heterozygosity (CCSP, LAGUN, MORO, BBEND) are themselves likely a result of a relatively recent natural founder event, geographically isolated from the others. The sites with high heterozygosity are in more suitable, connected habitat.
from radiator.
What you don't want to see is bubble size correlated with the patterns of heterozygosity
Here it's all good, I'll send by email a couple of bad ones, you'll understand
from radiator.
CATG oh ok that's easy then...
give me 30 min I'll draft something
Ok so I'll now parse the CATG column into 4.
What's nice is that you can re-calibrate the alleles REF/ALT based on the actual count of alleles, not just individuals. Which is way better and less biased for low coverage data.
from radiator.
Thanks đ
I feel bad making all these requests--I hope they've been useful and not tedious--but one other request if it's easy to implement: faststructure format output. Not sure why it's necessary, but for faststructure they ignore the first 6 columns, so would need to add 5 blank columns before the first marker column. Otherwise it's just a standard structure file with two rows per individual. (Alternatively, faststructure accepts a .bed file)
from https://rajanil.github.io/fastStructure/
While the original Structure program allowed for a more flexible input format, fastStructure expects a more specific Structure-like input format. Specifically, rows in the data file correspond to samples, with two rows per sample (note that only diploids are handled by this software), and columns correspond to SNPs. The first 6 columns of the file will be ignored; these typically would include IDs, metadata, etc. This software only handles bi-allelic loci. The two alleles at each locus can be encoded as desired; however, missing data should be encoded as `-9'.
An example of a faststructure file I made by converting a vcf to faststructure in PGDSpider (faststructure just ignores the numbers there in the extra columns, I believe):
OCcoastsub48_85clust_75pctmsng_oneSNPmac2_20190205.recode.str.txt
Immensely grateful for all the fixes and help you've given me so far, thanks Thierry
from radiator.
Look at this link for faststructure ;)
Search my name on the page... was using it a bit a while ago...
I should have an output for it somewhere or it was importing one already outputted by radiator.
Will check
from radiator.
Accidentally clicked close, whoops. Reopened.
Fantastic, looking forward to the update!
from radiator.
write_faststructure
: available as standalone. Let me know if it works as expected, will include it as an option in genomic_converter
after
from radiator.
I'll test it out now and report back
from radiator.
When I run faststructure with a file written by write_faststructure, I get this error:
Traceback (most recent call last):
File "/mnt/Data5/kevinRAD/RAD_2018-02-01/ipyrad_min100/highqual_min220_outfiles/faststructure/fastStructure/structure.py", line 172, in <module>
G = parse_str.load(params['inputfile'])
File "parse_str.pyx", line 10, in parse_str.load
L = loci.shape[1]
IndexError: tuple index out of range
Also, I have to change the file suffix to ".str" or else faststructure won't read it.
The faststructure output:
OCreduced148.radiator.75pctmsng.oneSNPmac3.faststructure.txt
from radiator.
Actually, I checked my other faststructure files; they don't have the locus name header. I just removed the locus name header/the entire first line, and it's running now
from radiator.
Hi Thierry
Trying out the latest version (only just updated an hour ago; from the version you last posted 6 days ago that added faststructure). I'm getting errors with genomic_converter for genepop, hierfstat, and structure: Error in .f(.x[[i]], ...) : object 'GT' not found
. My hunch is it is because my strata again has populations with only 1 individual (I'd removed them in the previous dataset, but haven't done so for the dataset I'm using here, I'll test it out adding the single individuals to a larger population) UPDATE: still get the errors.
Also a different error when writing betadiv, and plink :
Generating betadiv object
Calibrating REF/ALT alleles...
number of REF/ALT switch = 9
Error in .f(.x[[i]], ...) : object 'POP_ID' not found
Full function call and return below. Note this is a slightly different dataset than what I've already sent you (different individuals over a wider geographic range, but sequencing and assembly were the same)
> genomic_converter(data = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad",
+ strata = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt",
+ verbose = TRUE,
+ filename="sphasouth178spatial.radiator.50pctmsng.oneSNPmac3",
+ parallel.core = 1,
+ fig.upsetr=TRUE,
+ filter.common.markers=FALSE,
+ output=c("vcf", "fineradstructure", "tidy", "plink", "ldna", "stockr", "genind", "genlight", "structure", "bayescan", "betadiv", "related"))
################################################################################
########################## radiator::genomic_converter #########################
################################################################################
Execution date@time: 20190329@1757
genomic_converter function call arguments:
data = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad
strata = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt
output = vcf, fineradstructure, tidy, plink, ldna, stockr, genind, genlight, structure, bayescan, betadiv, related
filename = sphasouth178spatial.radiator.50pctmsng.oneSNPmac3
parallel.core = 1
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in genomic_converter:
filter.common.markers = FALSE
Default "..." arguments assigned in genomic_converter:
blacklist.genotypes = NULL
filter.monomorphic = TRUE
internal = FALSE
keep.allele.names = FALSE
parameters = NULL
path.folder = NULL
vcf.metadata = TRUE
vcf.stats = TRUE
whitelist.markers = NULL
Unknowned arguments identified inside "...":
fig.upsetr
Read documentation, for latest changes, and modify your codes!
Folder created: 21_radiator_genomic_converter_20190329@1757
File written: [email protected]
Filters parameters file generated: [email protected]
Importing data
Synchronizing data and strata...
Number of strata: 31
Number of individuals: 178
Writing tidy data set:
sphasouth178spatial.radiator.50pctmsng.oneSNPmac3.rad
Preparing data for output
Data is bi-allelic
Generating structure file
Error in .f(.x[[i]], ...) : object 'GT' not found
from radiator.
Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):
##fileformat=VCFv4.3
##fileDate=20190329
##source=radiator_v.1.0.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 locus_100029__100029__50 NA C T . PASS NS=115 GT
1 locus_100077__100077__49 NA A G . PASS NS=174 GT
1 locus_10009__10009__11 NA C T . PASS NS=178 GT
1 locus_100137__100137__36 NA G T . PASS NS=163 GT
1 locus_100170__100170__22 NA T A . PASS NS=102 GT
1 locus_100203__100203__29 NA C T . PASS NS=108 GT
This is what the input vcf looks like:
##fileformat=VCFv4.0
##fileDate=2019/01/13
##source=ipyrad_v.0.7.28
##reference=Spea_genomeassembly_fromEvan.fasta
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
locus_100006 5 . G A 13 PASS NS=120;DP=2002 GT:DP:CATG
locus_100014 8 . G T 13 PASS NS=251;DP=8028 GT:DP:CATG
locus_100023 12 . C A 13 PASS NS=95;DP=2406 GT:DP:CATG
locus_100029 50 . C T 13 PASS NS=173;DP=3987 GT:DP:CATG
locus_100038 19 . T C 13 PASS NS=261;DP=9508 GT:DP:CATG
locus_100042 50 . C A 13 PASS NS=91;DP=887 GT:DP:CATG
from radiator.
Iâll have a look at it in 1 hour
from radiator.
It's the weekend, no rush! (unless you want to)
Thanks
from radiator.
Hi Thierry,
Any updates on the next version of radiator coming out? Running into some of the same problems stated above using filter_rad on DArT data:
Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1458\01_radiator\[email protected]'. The process cannot access the file because it is being used by another process.
Also:
In grid.Call(C_textBounds, as.graphicsAnnot(x$label), ... : font family not found in Windows font database
Thanks!
Chase
from radiator.
Iâm sick and wonât be like to work or push something before Monday next week.
Some of youâre problem seems to be related to parallel processing. Try setting the argument to use 1 core. But it can not access the file which seems to another problem:
- make sure you have read/write permission
- why is the file ending with âgds.rdâ ? It should be automatically set to âgds.radâ, weird..
from radiator.
The other one about donât family is new to me. I removed all mention of font family so that windows machine wouldnât have to struggle with helvetica font so it should pick whatever you have as default.
Have you done ggplot fig before ? Got similar errors ?
from radiator.
Hey Thierry,
Thanks for the quick response! Hope you get feeling better soon.
-Setting to 1 core seemed to fix the errors. Even with the font family. Never had the error with ggplot either so strange
-Files are saving as .rad not .rd something must have gotten deleted when I was copying over
I'll let you know if I run in to anything else.
from radiator.
How many cores do you have on your PC?
from radiator.
For some functions, SeqArray package Ă some problems dealing with parallel on PC (because of the lack of forking, he uses something else). So for those Iâm detecting the OS and setting it automatically.
If you encounter another problem let me know exactly when it happens in the pipeline
from radiator.
8 cores shouldnât be any problem
from radiator.
If I specify more than 1 core I get an error during the first common marker filter:
Scanning for common markers... Generating UpSet plot to visualize markers in common Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1527\01_radiator\[email protected]'. The process cannot access the file because it is being used by another process
If I specify 1 core I get an error at the MAC filter step:
Error in SeqArray::seqGetData(gds, "annotation/format/AD") : The GDS node "annotation/format/AD/data" does not exist.
from radiator.
New push on GitHub, Try again and let me know if it works or not
from radiator.
That worked! Thanks for all the help.
from radiator.
Related Issues (20)
- SeqArray newest version: unused argument (.progress = TRUE) HOT 6
- error in genomic_converter() HOT 2
- Stacks vcf to rubias HOT 4
- genomic_converter Error HOT 4
- filter_rad error: Column `READ_DEPTH` is a `SeqVarDataList` object. HOT 1
- filter_rad issue error/dplyr:::mutate_error ! & DynamicClusterCall()
- error in reading VCF file HOT 1
- The GDS node "$ref" does not exist. HOT 5
- Genomic converter error 'no more individuals in your data' HOT 9
- Error in file(con, "r") : cannot open the connection HOT 1
- Error when using genomic_converter genlight --> pcadapt HOT 3
- Potential help error with DArT HOT 3
- The GDS node "$ref" does not exist. HOT 4
- genomic_converter errors HOT 5
- Error in `dplyr::mutate()` - No Variants Selected HOT 5
- genomic converter() Error:>! `everything()` must be used within a *selecting* function. HOT 1
- Error Converting VCF to Genepop HOT 12
- read_vcf file access error with parallel.core = 1L argument HOT 3
- ERROR with detect_duplicate_genome + tidy format obtained from genomic_converter HOT 3
- Error with genomic_converter HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â€ïž Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from radiator.