algolab / malva Goto Github PK
View Code? Open in Web Editor NEWgenotyping by Mapping-free ALternate-allele detection of known VAriants
Home Page: https://algolab.github.io/malva/
License: GNU General Public License v3.0
genotyping by Mapping-free ALternate-allele detection of known VAriants
Home Page: https://algolab.github.io/malva/
License: GNU General Public License v3.0
Hello,
Can the test reads be long reads? In theory I don't see why not, but do you think the results would be as robust? I am just worried the Kmer counting does something wild.
Thanks
Alex
I'm trying to get MALVA installed on a Linux server (Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-112-generic x86_64)
) following the instructions on the README in the root directory and get the following error:
./malva-geno: error while loading shared libraries: libhts.so.3: cannot open shared object file: No such file or directory
No changes were made to the build instructions, and there were no reported errors during compilation of sdsl-lite, kmc, and htslib, or MALVA itself.
Nice work with malva. I tried this on several R1 FASTQs of paired-end 150x2 whole-genome sequencing datasets, and end up with empty VCFs in all cases. Can you help debug? Here are the commands used:
mkdir sample_name
kmc -t16 -m64 -k43 -f /path/to/fastqs/fastq_name_S4_R1_001.fastq.gz sample_name/sample_name.res sample_name
malva-geno -k35 -r43 -b16 /path/to/reference/b37.fasta fingerprint_snps.vcf sample_name/sample_name.res > sample_name/sample_name.fingerprint.vcf
Logs from kmc reported:
Stats:
No. of k-mers below min. threshold : 956248069
No. of k-mers above max. threshold : 0
No. of unique k-mers : 1888206968
No. of unique counted k-mers : 931958899
Total no. of k-mers : 12689712777
Total no. of reads : 116433462
Total no. of super-k-mers : 792031281
Logs from malva-geno reported:
[malva-geno/Reference parsing] Time elapsed 0s
[malva-geno/Reference processed] Time elapsed 6s
[malva-geno/VCF parsing (Bloom Filter construction)] Time elapsed 6s
[malva-geno/Processed 5000 variants] Time elapsed 18s
[malva-geno/Processed 6246 variants] Time elapsed 18s
[malva-geno/BF creation complete] Time elapsed 27s
[malva-geno/Reference BF construction] Time elapsed 27s
[malva-geno/Reference BF creation complete] Time elapsed 80s
[malva-geno/KMC output processing] Time elapsed 94s
malva-geno crashes when the vcf has "chr" in a name. Commenting out the section of main.cpp that strips "chr" from the id in the reference map lets malva run fine. I'd recommend placing the burden on the users to match vcf and reference naming conventions.
Hello, I have installed MALVA following the conda approach but it returns
error while loading shared libraries: libkmc.so: cannot open shared object file: No such file or directory
Is there an issue with the bioconda recipe? thanks
First, thanks for a very nice tool!
I have a couple of questions:
Is it possible to speed up Malva using more threads? I know that KMC easily can use more threads, meaning that the first step of Malva (getting kmers from the reads) can be parallelized, but can malva-geno (computing the signatures and performing the genotyping) also be parallelized in any way?
Assume I want to genotype many samples. I guess that one could potentially save a lot oftime by not computing the variant signatures and reference kmers for each sample (since these should be the same, given the input variants). It seems now that malva-geno does this every time a new sample is genotyped. Is it possible to re-use this data so that genotyping of new samples would become faster?
Thanks!
I installed Malva using conda into a docker contianer and am running it with Nextflow.
RUN conda update conda \
&& conda install -c bioconda malva=1.1.1
The command to run MALVA is:
#!/bin/bash -ue
MALVA \
GRCh38_full_analysis_set_plus_decoy_hla.fa \
ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
GRC194242437_trimmed.fq.gz \
> GRC194242437.vcf
Thre known variants are the 1000 Genomes 2019 high-depth GRCH38 variants.
I am using the 1000 Genomes version of GRCh38.
The fastq file is from a concatentation of paired-end reads (i.e. all read1 hen all read2; not interlaced). They contain only 1x worth of reads.
The MALVA script outputs:
[MALVA] Running KMC
[MALVA] Running malva-geno
[malva-geno/Reference parsing] Time elapsed 0s
[malva-geno/Reference processed] Time elapsed 10s
[malva-geno/VCF parsing (Bloom Filter construction)] Time elapsed 10s
/home/maxh/conda/bin/MALVA: line 93: 13284 Killed ${EXECUTABLE} -k ${k} -r ${refk} -e ${erate} -f ${freq} -s ${samples} -c ${maxcov} -b ${bfsize} ${reference} ${vcf_file} ${kmc_out_prefix}
[MALVA] Cleaning up
Thus KMC and malva-geno run but the run times of malva-geno are suspiciously short. I don't quite understand why bash prints out the line from MALVA that it does as this is not line 93, it is line 97 and it is what runs malva-geno. Infact it should never get to line 93 as this is only reached is kmc_pre and kmc_suf are not present.
What could be the cause? Resources? I limit MALVA to 8GB currently, is that too low?
Hi there,
We'd like to try using MALVA on our own low-coverage WGS data (~1x). We've noticed that the MALVA release we're using (version 1.3.1; build h3889886_0) is only genotyping sites where a sample has >=2 coverage. Is there a way to modify the default behaviour to do this? There's nothing obvious in the provided flags but maybe it's possible to modify the original code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.