Code Monkey home page Code Monkey logo

Comments (8)

ckandoth avatar ckandoth commented on June 26, 2024

malva-geno produced valid VCF output when I used your example/chr20.vcf. But it produced empty VCF when used with fingerprint_snps.vcf - a subset of the ExAC VCF downloaded from ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/subsets/. It includes an INFO/AF tag, so I'm not sure what caused this issue. Worth debugging. I'll leave this issue open if you want to investigate. In the meantime, I will use a subset of SNPs from the 1000genome VCF as you do.

from malva.

ckandoth avatar ckandoth commented on June 26, 2024

My subset of SNPs from the 1000genome VCF also produced an empty output VCF. This might have something to do with the way I subset SNPs for genotyping.

Here is how to reproduce:

# Fetch the 1000genomes phase3 VCF, and its tabix index:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz{.tbi,}

# Shortlist SNPs appropriate for genotyping (commonly heterozygous within subpopulations):
bcftools view --include 'EAS_AF>0.4 & EAS_AF<0.6 & EUR_AF>0.4 & EUR_AF<0.6 & AFR_AF>0.4 & AFR_AF<0.6 & AMR_AF>0.4 & AMR_AF<0.6 & SAS_AF>0.4 & SAS_AF<0.6' --exclude-types indels,mnps,ref,bnd,other ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz | bcftools annotate --remove ^INF/AF,INF/EAS_AF,INF/EUR_AF,INF/AFR_AF,INF/AMR_AF,INF/SAS_AF --output fingerprint_snps.vcf

# Limit the VCF to chr20 for use with the malva test data:
grep -E "^#|^20" fingerprint_snps.vcf > fingerprint_snps_chr20.vcf

# Download malva test dataset and build k-mer reference:
wget https://github.com/AlgoLab/malva/raw/master/example/data.tar.gz
tar -zxf data.tar.gz
mkdir kmc_tmp
kmc -m4 -k43 -fm chr20.sample.fa kmc.out kmc_tmp

# Run malva-geno to genotype the fingerprint SNPs shortlisted earlier:
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf

The fingerprint_snps_chr20.genotyped.vcf above will contain only VCF header lines, and no variants listed. When I allow the input VCF to contain rare alleles, I seem to get variants in my output.

from malva.

ldenti avatar ldenti commented on June 26, 2024

Hi,
thanks for opening the issue. We'll look into it and we'll let you know asap.

Luca

from malva.

ckandoth avatar ckandoth commented on June 26, 2024

@ldenti I was able to get it working by adding fake FORMAT and sample columns with GT populated as 0|0 for all variants. The sample ID in my input VCF does not seem to matter since your output VCF will replace it with "DONOR".

perl -i -a -F'\t' -pe 'chomp; $s="\tFORMAT\tSAMPLE_ID" if(/^#CHROM/); $s="\tGT\t0|0" unless(/^#/); $_.="$s\n"' fingerprint_snps_chr20.vcf
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf

from malva.

ldenti avatar ldenti commented on June 26, 2024

I see. Yes, the problem is that there must be the sample columns in the input VCF. Indeed, malva needs these columns to characterize each variant by computing its signatures.

Therefore if no sample columns are present in the input VCF or if you use only 0|0 as dummy column, then malva would not work properly and the output VCF would be meaningless.

So, if you want to use the VCF provided by the 1000genomesproject, then you have to download the ALL.chr*. files and concatenate them with bcftools (see this script).

from malva.

ckandoth avatar ckandoth commented on June 26, 2024

From your paper: "For each variant allele, assign a signature in the form of a set of k-mers". You can do this with the ALT in the VCF, and there is no need for GT. You could technically use GT to build a genotype likelihood model i.e. whether each variant is more commonly heterozygous/homozygous across individuals, but I don't see that described in the paper. Can you roughly describe how malva-geno uses GT from the input VCF?

I really appreciate your insight. We intend to use malva to genotype the >350k pairs of FASTQs archived at MSKCC - and use that to identify new samples. If successful, it would become a critical QC step in our infrastructure. But most of our data is targeted sequencing or RNA, so we would benefit from using SNPs from ExAC instead of 1000genomes.

from malva.

ldenti avatar ldenti commented on June 26, 2024

Yes, it is right: we assign a signature, that is a set of kmers, to each variant.

But actually we assign multiple signatures to the same variant. For example, let us assume to have two single-allelic SNPs that occur within k/2 bases. Then, to characterize each allele of each variant, we should use 2 different signature: one taking into account the ref allele of the close variant and the other taking into account the alternate allele. This is just an example but in same cases, there could be more than 2 variants that are close and they could also be multi-allelic variants. To avoid to make all the possible combinations (that can be exponential in k), we base our computation only on those combinations that actually occur in the population used to produce the VCF.

I hope this is clear, if not, let me know and I'll try to explain it in a clearer way. In any case, you can find a more detailed explanation in the supplemental material: see definition 1 and definition 2.

Maybe if the GT fields are not available we can just make a single signature without taking in consideration close variants. It should be easy to do, but I think that it could lower the global accuracy of the tool. If your files do not have GT fields, I could try to implement and test that.

from malva.

ckandoth avatar ckandoth commented on June 26, 2024

Very clear. Thanks Luca. I'll close this ticket, because my original problem is resolved.

from malva.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.