col-iu / graph2pro-var Goto Github PK

License: GNU General Public License v3.0

Makefile 0.01% C 0.42% Perl 0.04% C++ 98.61% Objective-C 0.01% Shell 0.03% HTML 0.16% M4 0.01% Batchfile 0.01% Python 0.70% CSS 0.01%

graph2pro-var's Introduction

Package name: graph2pro-var
This package contains a wrapper script (graph2pro-var.sh) and component packages for two algorithms: 
graph2pro and var2pep for protein/peptide identification from metaproteomic mass spectral data with matching 
metagenomic/transcriptomic data (i.e., meta-proteogenomic approach).
The graph2pro approach is based on assembly graph for protein/peptide identification.
The var2pep aims to use the sequencing reads that cannot be assembled to further improve peptide identification.

Released: Nov 21, 2018
Developers: Sujun Li ([email protected]), Yuzhen Ye ([email protected]) and Haixu Tang ([email protected])

This work was supported by NIH grant 1R01AI108888 to YY and HT

graph2pro-var is free software under the terms of the GNU General Public License as published by
the Free Software Foundation.

>> Package contents
graph2pro-var.sh -- wrapper script for the package that can be called directly
Graph2Pro -- programs for assembly graph based protein/peptide identification
MSGF+ -- MS search engine
FragGeneScan, bowtie2, RAPSearch2, cd-hit and pyscript -- folders contain programs/scripts for other purposes 
Tests folder, which contains a small dataset for testing the pipeline. 
   SML.par -- the parameter file
   other files, including the fastg and spectral files

>> Installation
   If you run the pipeline on a linux machine, the pipeline probably works just fine. 
   If not, you may need to recompile the tools included. 
   We provide a script for recompiling the tools: 
	 $./install

>> Try a testing example (under Tests folder)
   Check readme file under the Tests folder for usage; all necessary files are provided in the same folder.
   Command to test the provided testing example:
   	$cd Tests
        $nohup sh ../graph2pro-var.sh SML.par > SML.log &
   
   Notes:
       1) Included in this folder you can see SML.par, the parameter file that will be used by the wrapper script. 
       2) The first four parameters (id, kmer, fastg, ms) are mandatory; reads, thread, memory, and fdr are optional.
          By default fdr is set to 0.01, thread is set to 8, and memory is set to 32g 
         (if you have extremely large spectral files, you may consider increasing the memory)
       3) The kmer parameter specifies the kmer size; it must be the same as the kmer size 
          used for the assembly graph generation (see "How to prepare assembly graph?" below)
       4) Note the toy dataset is extremely small so only very few peptides will be identified.
   Required and optional input files (specified in the parameter file):  
       1) fastg file, the assembly graph of metagenome (and/or metatranscriptome) (required)
       2) spectral file (required) 
       3) reads files (optional)
       graph2pro uses the first two inputs, and var2pep needs the additional reads files. 
       If no reads files are provided, the pipeline stops after graph2pro is completed. 
   Outputs: 
       1) If this test example runs successfully, the pipeline reports identified peptides and other result files,
          all having the specified id (in this case SML, as specified in the parameter file) as the prefix in their names.
       2) *.final-peptide.txt
          this file lists the identified peptides, # of supporting spectra, and by which program (graph2pro or var2pep).
       3) Other intermediate files you may find useful: 
          *.graph2pro.fasta -- the Graph2Pro database
  	  *.var2pro.fasta -- the Var2Pep database
          *.mzid files -- MSGF+ search results
          *.0.01.tsv -- details of peptide identification, with FDR control (set to 1% in this case) by different approaches;
              *.fgs.tsv.0.01.tsv -- results from using contigs only
              *.graph2pro.tsv.0.01.tsv -- results from graph2pro
              *.var2pep.tsv.0.01.tsv -- results from var2pep

>> Try graph2pro & var2pep for your own datasets
   As graph2pro-var.sh pipeline creates large intermediate result files in **current** folder,
   we recommend that you create a dedicated working folder for each of your job, and work under that folder. 

   ** Prepare the parameter file for running the pipeline for your job.
      See SML.par under the Tests folder for an example.
      Copy and paste this file to your working folder, and revise the parameter file accordingly.
      Please note the graph2pro-var.sh pipeline gets the input file names (fastg, MS data, reads files) from
      the parameter file. Make sure that you provide the paths for the input files in the parameter file, if
      the input files are not in the current folder.

   ** Call graph2pro-var.sh (referring to the script with full path information) as following:
         $path-to-the-wrapper-script/graph2pro-var.sh parameter-file 
         OR
         $nohup sh path-to-the-wrapper-script/graph2pro-var.sh parameter-file > parameter-file.log &

>> How to prepare assembly graph?
   An important input to the pipeline is the assembly graph (in fastg format). 
   We recommend that you use MegaHit assembler to prepare assembly graph, i.e., fastg file as following:
   a) First run megahit with the option like --k-list 21,29,39,59,79,99  (notice the ending k-mer size 99)
   b) Then use metahit_toolkit to prepare fastg:
        megahit_toolkit contig2fastg 99 intermediate_contigs/k99.contigs.fa > k99.contigs.fastg
   c) If you use MetaSpade, please use the customized parameters and follow the instruction of the software to get the fastg
 
#new update at March, 2019
>> Use graph2pep only
   To be compatible with fastg from Megahit or Metaspades, we have re-coded the program graph2pep. 
   If users only want to use graph2pep to produce database from fastg file. Please use the following example:
   	./Graph2Pro/DBGraph2Pro -s assembly_graph.fastg -S -k 49 -o test.graph2pep.fasta

>> Use graph2pro only
   If you want to call only the graph2pro, but no var2pep for your dataset, you can simply do it by removing "reads=" parameter 
   from the parameter file, and call the same wrapper script graph2pro-var.sh. For this case, only the fastg file (assembly graph) and
   the spectral data, but not the reads files, will be needed as the inputs.

>> Other option: cascaded search
   If you want to try cascaded search (only unidentified spectra from the Graph2Pro step goes to the Var2Pep based search),
   simply add a line to the end of your parameter file:
   cascaded=yes

>> Submit the pipeline to queue using qsub
   Petide identification from spectral data is very time consuming. If you have many datasets, you may want to use a computer cluster.
   Note: make sure you specify -v par=par-file-name, and -v pgmdir=program-folder to pass along these information to the queue, if your
   computer cluster uses qsub.

>> Summarize the results (see example under Tests folder)
   $path-to-the-wrapper-script/summarize.sh parameter-file(the same used by graph2pro-var.sh)
   This script reports the number of spectra and unique peptides identified by the different approaches 
    (contig only, graph2pro, graph2pro&var2pep).

graph2pro-var's People

Contributors

Stargazers

Watchers

Forkers

jraysajulga pravs3683 jj-umn liangzhw6

graph2pro-var's Issues

Edge not found error

** I can upload the fastg, fastq, mgf and par file to a datashare, can't share on github.

We are running on a Linux 5.3.18-150300.59.60-default x86_64 server, and we are submitting jobs using slurm 21.08.8-2 with the following parameters:

#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=48:00:00
#SBATCH --mem=500

We are submitting the jobs with the following command: srun ~/software/graph2pro-var/graph2pro-var.sh ~/file.par
GCC version is 10.3.0 (GCC).

When running the the graph2pro-var.sh script while trying to running "Graph2Pep "we encounter the following error in the step "read edges from file ...":

edgeseqfile /fs/pool/pool-mann-projects/Peter_Florian_Max_Capscan/fastg/k39_all/k39_all/1249_k39.fastg
514112 edges input
Edge not found: NODE_101623_length_2622_cov_11.6744_ID_203245

And this stops the Graph2Pep pipeline from completing and the '1743.1_k39.graph2pep.fasta' is never created. Which results in another error message, ultimately leading to an absence of input for the rest of the pipeline making it complete incorrectly.

Traceback (most recent call last):
File "/fs/gpfs41/lv07/fileset03/home/b_mann/maedler/Software/graph2pro-var/pyscript/createFixedReverseKR.py", line 26, in
inf = open(argv[1], "r")
FileNotFoundError: [Errno 2] No such file or directory: '1743.1_k39.graph2pep.fasta'
output.log
error.log

Error when running graph2pro-var wrapper script

Hi Sujun,

While trying to run the entire wrapper script on your test data, but got the following error message:

../graph2pro-var.sh: 29: ../graph2pro-var.sh: Syntax error: "(" unexpected

I'm running Ubuntu 14.04 with a BASH shell.
Could you please advise on that?

Sincerely,
Goor

Segmentation Fault error in DBGraphPep2Pro

Hi,

I am experiencing Segmentation Fault error in the step DBGraphPep2Pro.

graph2pro-var_metaSPAdes.sh: line 172: 109636 Segmentation fault (core dumped) $dbpep2pro -s $gnm -f -p $exp_d.step1.output -o $exp_d.step1.fasta.2 -k $kmer -d 5

Can you please look into the issue. Also, let me know if you would need more information.

Thanks,
Praveen

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.