Comments (8)
malva-geno produced valid VCF output when I used your example/chr20.vcf
. But it produced empty VCF when used with fingerprint_snps.vcf
- a subset of the ExAC VCF downloaded from ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/subsets/. It includes an INFO/AF
tag, so I'm not sure what caused this issue. Worth debugging. I'll leave this issue open if you want to investigate. In the meantime, I will use a subset of SNPs from the 1000genome VCF as you do.
from malva.
My subset of SNPs from the 1000genome VCF also produced an empty output VCF. This might have something to do with the way I subset SNPs for genotyping.
Here is how to reproduce:
# Fetch the 1000genomes phase3 VCF, and its tabix index:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz{.tbi,}
# Shortlist SNPs appropriate for genotyping (commonly heterozygous within subpopulations):
bcftools view --include 'EAS_AF>0.4 & EAS_AF<0.6 & EUR_AF>0.4 & EUR_AF<0.6 & AFR_AF>0.4 & AFR_AF<0.6 & AMR_AF>0.4 & AMR_AF<0.6 & SAS_AF>0.4 & SAS_AF<0.6' --exclude-types indels,mnps,ref,bnd,other ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz | bcftools annotate --remove ^INF/AF,INF/EAS_AF,INF/EUR_AF,INF/AFR_AF,INF/AMR_AF,INF/SAS_AF --output fingerprint_snps.vcf
# Limit the VCF to chr20 for use with the malva test data:
grep -E "^#|^20" fingerprint_snps.vcf > fingerprint_snps_chr20.vcf
# Download malva test dataset and build k-mer reference:
wget https://github.com/AlgoLab/malva/raw/master/example/data.tar.gz
tar -zxf data.tar.gz
mkdir kmc_tmp
kmc -m4 -k43 -fm chr20.sample.fa kmc.out kmc_tmp
# Run malva-geno to genotype the fingerprint SNPs shortlisted earlier:
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf
The fingerprint_snps_chr20.genotyped.vcf
above will contain only VCF header lines, and no variants listed. When I allow the input VCF to contain rare alleles, I seem to get variants in my output.
from malva.
Hi,
thanks for opening the issue. We'll look into it and we'll let you know asap.
Luca
from malva.
@ldenti I was able to get it working by adding fake FORMAT and sample columns with GT
populated as 0|0
for all variants. The sample ID in my input VCF does not seem to matter since your output VCF will replace it with "DONOR".
perl -i -a -F'\t' -pe 'chomp; $s="\tFORMAT\tSAMPLE_ID" if(/^#CHROM/); $s="\tGT\t0|0" unless(/^#/); $_.="$s\n"' fingerprint_snps_chr20.vcf
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf
from malva.
I see. Yes, the problem is that there must be the sample columns in the input VCF. Indeed, malva needs these columns to characterize each variant by computing its signatures.
Therefore if no sample columns are present in the input VCF or if you use only 0|0 as dummy column, then malva would not work properly and the output VCF would be meaningless.
So, if you want to use the VCF provided by the 1000genomesproject, then you have to download the ALL.chr*.
files and concatenate them with bcftools
(see this script).
from malva.
From your paper: "For each variant allele, assign a signature in the form of a set of k-mers". You can do this with the ALT
in the VCF, and there is no need for GT
. You could technically use GT
to build a genotype likelihood model i.e. whether each variant is more commonly heterozygous/homozygous across individuals, but I don't see that described in the paper. Can you roughly describe how malva-geno uses GT
from the input VCF?
I really appreciate your insight. We intend to use malva to genotype the >350k pairs of FASTQs archived at MSKCC - and use that to identify new samples. If successful, it would become a critical QC step in our infrastructure. But most of our data is targeted sequencing or RNA, so we would benefit from using SNPs from ExAC instead of 1000genomes.
from malva.
Yes, it is right: we assign a signature, that is a set of kmers, to each variant.
But actually we assign multiple signatures to the same variant. For example, let us assume to have two single-allelic SNPs that occur within k/2 bases. Then, to characterize each allele of each variant, we should use 2 different signature: one taking into account the ref allele of the close variant and the other taking into account the alternate allele. This is just an example but in same cases, there could be more than 2 variants that are close and they could also be multi-allelic variants. To avoid to make all the possible combinations (that can be exponential in k), we base our computation only on those combinations that actually occur in the population used to produce the VCF.
I hope this is clear, if not, let me know and I'll try to explain it in a clearer way. In any case, you can find a more detailed explanation in the supplemental material: see definition 1 and definition 2.
Maybe if the GT fields are not available we can just make a single signature without taking in consideration close variants. It should be easy to do, but I think that it could lower the global accuracy of the tool. If your files do not have GT fields, I could try to implement and test that.
from malva.
Very clear. Thanks Luca. I'll close this ticket, because my original problem is resolved.
from malva.
Related Issues (8)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from malva.