Code Monkey home page Code Monkey logo

mlehaplo-1's Introduction

MLEHaplo

Maximum Likelihood Estimation for Viral Populations

VIPRA and MLEHaplo README file

Pre-requisites:

  1. multi-dsk: k-mer counting software ( Extension of dsk [http://minia.genouest.org/dsk/] version 1.5655)
  • Counts k-mers for multiple values of k simultaneously. The software doesn't combines the counts of forward and reverse complement k-mers, as is performed in traditional k-mer counting softwares
  1. perl with modules Bio::Perl, Getopt::Long, Graph.

Preliminaries

Step 1: Generate k-mer counts file

Command: multi-k fasta/fastq_file list_of_kvalues -d diskspace_limit -m memory_limit

Example:

./multi-dsk/multi-dsk Example/paired-reads.fasta  Example/listofkmers.txt  -m 8192 -d 10000
  • fasta/fastq file: is the file name of the fasta/fastq file. If there are a list of files, use a text file containing locations of all the files, one file per line. For example, if there are two files containing paired reads, file1.fastq and file2.fastq, create a file list_of_files.txt containing following:
file1.fastq
file2.fastq
  • list_of_kvalues : contains a list of k values for which k-mer counting has to be performed. For example if k-mer counting is desired for k-values 55,45,35,25. Create a file list_of_kmers.txt containing these numbers in decreasing order:
55
45
35
25
  • diskspace_limit : defines the limit of temporary disk space (in MB) used while performing k-mer counting

  • memory_limit : defines the memory limit (in MB) for storage of temporary hashes while performing k-mer counting

  • Output of multi-dsk is a collection of files with extension prefix.solid_kmers_binary."kvalue" in compressed format, which contains counts of k-mers present in the fasta/fastq file.

Step 2: De-compress the output of multi-dsk

Command: parse_results prefix.solid_kmers_binary.kvalue_file > prefix_file.kvalue

Example:

./multi-dsk/parse_results Example/paired-reads.solid_kmers_binary.60 > paired-reads.60
  • The file fasta/fastq_file.kvalue now contains the k-mer counts for the fasta/fastq file in the format "k-mer count" per line.

Step 3: Generate the De Bruijn graph

Generating the graph needs two files and a parameter. This will combine paired files into a single file.

  1. fasta_file containing all the reads.
  2. kmer_file generated above.
  3. threshold value for ignoring erroneous k-mers.

Command: perl construct_graph.pl fasta_file kmer_file threshold graph_file "s"

Example:

perl construct_graph.pl  Example/paired-reads.fasta paired-reads.60 0 paired-reads.60.graph "s"
  • "s" parameter tells the perl script that there is a single fasta file of reads
  • Output is the graph_file containing pairs of k-mers that form edges in the De Bruijn graph.

TODO: Add "From (k+1)-mer counts file"

Step 4: Create the paired set file

Create the paired set using the paired reads. It takes as input the two paired files, file1.fasta and file2.fasta, the k-mer counts file file1.kvalue, and a threshold for ignoring erroneous k-mers

Command: perl construct_paired_without_bloom.pl -file1 file1.fasta -file2 file2.fasta -paired -kmerfile file1.kvalue -thresh number -wr output_paired_set_file

Example:

perl construct_paired_without_bloom.pl -fasta Example/paired-reads.fasta -kmerfile paired-reads.60 -thresh 0 -wr paired-reads.60.pk.txt
  • Choice of threshold : Dependent on sequencing coverage. Lower threshold includes more erroneous k-mers in the graph, while higher threshold decreases the number of true k-mers and size of the graph.

  • Output is a file that contains pairs of k-mers on a line and the number of times such pair is observed: kmer1 kmer2 count

VIPRA

Step A: Running VIPRA

Running the VIPRA algorithm takes inputs generated above and a parameter for the average insert size, threshold parameter and a value for M (factor) which decides the number of paths to generate per vertex

Command: perl dg_cover.pl -graph graph_file_from_step3 -kmer kmer_file_from_step2 -paired paired_set_from_step4 -fact M -thresh threshold_value -IS insert_size > vipra_output_file

Example:

perl dg_cover.pl -graph paired-reads.60.graph -kmer paired-reads.60 -paired paired-reads.60.pk.txt -fact 15 -thresh 0 -IS 400 > paired-reads.60.fact15.txt
  • graph_file_from_step3 - output_file from Step 3: Generate the De Bruijn graph

  • kmer_file_from_step2 - output_file from Step 2: De-compress the output of multi-dsk

  • paired_set_from_step4 - output_file from Step 4: Create the paired set file

  • vipra_output_file: Contains the paths generated from the graph with high paired end supports.

  • prefix.comp.txt : Contains a sets of paired vertices in the condensed graph that are compatible with each other based on the paired set.

  • prefix.cond.graph : Contains the condensed version of De Bruijn graph, with non-branching paths condensed to a single vertex.

  • Temporary output files:

    1. prefix.bubble.txt: Contains a sets of paired bubbles in the condensed graph that are compatible with each other based on the paired set.
    2. prefix.depth : Contains the depth first search traversal of the graph.
    3. prefix.nodedepth : Temporary file for debugging of code.

Step B: Generate fasta file

Extracting fasta file from outputfile

Command: perl process_dg.pl vipra_output_file > paths_fasta_file

Example:

perl process_dg.pl paired-reads.60.fact15.txt > paired-reads.60.fact15.fasta
  • Output: paths_fasta_file. The paths generated by VIPRA

Step C: Generate paths file for maximum likelihood estimation

Extracting just the paths in terms of nodes in the graph

Command: perl get_paths_dgcover.pl -f vipra_output_file -w paths_write_file

Example:

perl get_paths_dgcover.pl -f paired-reads.60.fact15.txt -w paired-reads.60.fact15.paths.txt
  • Output: paths_write_file

MLEHaplo

Running MLEHaplo takes as input intermediate files generated by VIPRA Step A: Running VIPRA and paths_write_file generated in Step C: Generate paths file for maximum likelihood estimation

Command: perl likelihood_singles_wrapper_parallel.pl -condgraph prefix.cond.graph -compset prefix.comp.txt -pathsfile paths_write_file -back -slow -gl approximate_genome_size > MLE_textfile_output

Non Parallel Version Command: perl likelihood_singles_wrapper.pl -condgraph prefix.cond.graph -compset prefix.comp.txt -pathsfile paths_write_file -back -gl approximate_genome_size -slow > MLE_textfile_output

Example

perl likelihood_singles_wrapper.pl -condgraph paired-reads.60.cond.graph -compset paired-reads.60.comp.txt -pathsfile paired-reads.60.fact15.paths.txt -back -gl 1200 -slow  > paired-reads.60.smxlik.txt

Required input files & Parameters

  1. prefix.cond.graph: Condensed Graph file generated by VIPRA .
  2. prefix.comp.txt: Compatible Set generated by VIPRA.
  3. paths_write_file: Paths file from ViPRA Step C.
  4. approximate_genome_size: Approximate genome Length.

Final viral population generation using MLE_textfile_output

Command: perl extract_MLE.pl -f paths_fasta_file -l MLE_textfile_output > MLE_Haplo_fasta_OUTPUT

Example

perl extract_MLE.pl -f paired-reads.60.fact15.fasta -l paired-reads.60.smxlik.txt > paired-reads.60.MLE.fasta

mlehaplo-1's People

Contributors

louisranjard avatar raunaq-m avatar stevenhwu avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.