col-iu / graph2pro-var Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
Package name: graph2pro-var This package contains a wrapper script (graph2pro-var.sh) and component packages for two algorithms: graph2pro and var2pep for protein/peptide identification from metaproteomic mass spectral data with matching metagenomic/transcriptomic data (i.e., meta-proteogenomic approach). The graph2pro approach is based on assembly graph for protein/peptide identification. The var2pep aims to use the sequencing reads that cannot be assembled to further improve peptide identification. Released: Nov 21, 2018 Developers: Sujun Li ([email protected]), Yuzhen Ye ([email protected]) and Haixu Tang ([email protected]) This work was supported by NIH grant 1R01AI108888 to YY and HT graph2pro-var is free software under the terms of the GNU General Public License as published by the Free Software Foundation. >> Package contents graph2pro-var.sh -- wrapper script for the package that can be called directly Graph2Pro -- programs for assembly graph based protein/peptide identification MSGF+ -- MS search engine FragGeneScan, bowtie2, RAPSearch2, cd-hit and pyscript -- folders contain programs/scripts for other purposes Tests folder, which contains a small dataset for testing the pipeline. SML.par -- the parameter file other files, including the fastg and spectral files >> Installation If you run the pipeline on a linux machine, the pipeline probably works just fine. If not, you may need to recompile the tools included. We provide a script for recompiling the tools: $./install >> Try a testing example (under Tests folder) Check readme file under the Tests folder for usage; all necessary files are provided in the same folder. Command to test the provided testing example: $cd Tests $nohup sh ../graph2pro-var.sh SML.par > SML.log & Notes: 1) Included in this folder you can see SML.par, the parameter file that will be used by the wrapper script. 2) The first four parameters (id, kmer, fastg, ms) are mandatory; reads, thread, memory, and fdr are optional. By default fdr is set to 0.01, thread is set to 8, and memory is set to 32g (if you have extremely large spectral files, you may consider increasing the memory) 3) The kmer parameter specifies the kmer size; it must be the same as the kmer size used for the assembly graph generation (see "How to prepare assembly graph?" below) 4) Note the toy dataset is extremely small so only very few peptides will be identified. Required and optional input files (specified in the parameter file): 1) fastg file, the assembly graph of metagenome (and/or metatranscriptome) (required) 2) spectral file (required) 3) reads files (optional) graph2pro uses the first two inputs, and var2pep needs the additional reads files. If no reads files are provided, the pipeline stops after graph2pro is completed. Outputs: 1) If this test example runs successfully, the pipeline reports identified peptides and other result files, all having the specified id (in this case SML, as specified in the parameter file) as the prefix in their names. 2) *.final-peptide.txt this file lists the identified peptides, # of supporting spectra, and by which program (graph2pro or var2pep). 3) Other intermediate files you may find useful: *.graph2pro.fasta -- the Graph2Pro database *.var2pro.fasta -- the Var2Pep database *.mzid files -- MSGF+ search results *.0.01.tsv -- details of peptide identification, with FDR control (set to 1% in this case) by different approaches; *.fgs.tsv.0.01.tsv -- results from using contigs only *.graph2pro.tsv.0.01.tsv -- results from graph2pro *.var2pep.tsv.0.01.tsv -- results from var2pep >> Try graph2pro & var2pep for your own datasets As graph2pro-var.sh pipeline creates large intermediate result files in **current** folder, we recommend that you create a dedicated working folder for each of your job, and work under that folder. ** Prepare the parameter file for running the pipeline for your job. See SML.par under the Tests folder for an example. Copy and paste this file to your working folder, and revise the parameter file accordingly. Please note the graph2pro-var.sh pipeline gets the input file names (fastg, MS data, reads files) from the parameter file. Make sure that you provide the paths for the input files in the parameter file, if the input files are not in the current folder. ** Call graph2pro-var.sh (referring to the script with full path information) as following: $path-to-the-wrapper-script/graph2pro-var.sh parameter-file OR $nohup sh path-to-the-wrapper-script/graph2pro-var.sh parameter-file > parameter-file.log & >> How to prepare assembly graph? An important input to the pipeline is the assembly graph (in fastg format). We recommend that you use MegaHit assembler to prepare assembly graph, i.e., fastg file as following: a) First run megahit with the option like --k-list 21,29,39,59,79,99 (notice the ending k-mer size 99) b) Then use metahit_toolkit to prepare fastg: megahit_toolkit contig2fastg 99 intermediate_contigs/k99.contigs.fa > k99.contigs.fastg c) If you use MetaSpade, please use the customized parameters and follow the instruction of the software to get the fastg #new update at March, 2019 >> Use graph2pep only To be compatible with fastg from Megahit or Metaspades, we have re-coded the program graph2pep. If users only want to use graph2pep to produce database from fastg file. Please use the following example: ./Graph2Pro/DBGraph2Pro -s assembly_graph.fastg -S -k 49 -o test.graph2pep.fasta >> Use graph2pro only If you want to call only the graph2pro, but no var2pep for your dataset, you can simply do it by removing "reads=" parameter from the parameter file, and call the same wrapper script graph2pro-var.sh. For this case, only the fastg file (assembly graph) and the spectral data, but not the reads files, will be needed as the inputs. >> Other option: cascaded search If you want to try cascaded search (only unidentified spectra from the Graph2Pro step goes to the Var2Pep based search), simply add a line to the end of your parameter file: cascaded=yes >> Submit the pipeline to queue using qsub Petide identification from spectral data is very time consuming. If you have many datasets, you may want to use a computer cluster. Note: make sure you specify -v par=par-file-name, and -v pgmdir=program-folder to pass along these information to the queue, if your computer cluster uses qsub. >> Summarize the results (see example under Tests folder) $path-to-the-wrapper-script/summarize.sh parameter-file(the same used by graph2pro-var.sh) This script reports the number of spectra and unique peptides identified by the different approaches (contig only, graph2pro, graph2pro&var2pep).
** I can upload the fastg, fastq, mgf and par file to a datashare, can't share on github.
We are running on a Linux 5.3.18-150300.59.60-default x86_64 server, and we are submitting jobs using slurm 21.08.8-2 with the following parameters:
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=48:00:00
#SBATCH --mem=500
We are submitting the jobs with the following command: srun ~/software/graph2pro-var/graph2pro-var.sh ~/file.par
GCC version is 10.3.0 (GCC).
When running the the graph2pro-var.sh script while trying to running "Graph2Pep "we encounter the following error in the step "read edges from file ...":
edgeseqfile /fs/pool/pool-mann-projects/Peter_Florian_Max_Capscan/fastg/k39_all/k39_all/1249_k39.fastg
514112 edges input
Edge not found: NODE_101623_length_2622_cov_11.6744_ID_203245
And this stops the Graph2Pep pipeline from completing and the '1743.1_k39.graph2pep.fasta' is never created. Which results in another error message, ultimately leading to an absence of input for the rest of the pipeline making it complete incorrectly.
Traceback (most recent call last):
File "/fs/gpfs41/lv07/fileset03/home/b_mann/maedler/Software/graph2pro-var/pyscript/createFixedReverseKR.py", line 26, in
inf = open(argv[1], "r")
FileNotFoundError: [Errno 2] No such file or directory: '1743.1_k39.graph2pep.fasta'
output.log
error.log
Hi Sujun,
While trying to run the entire wrapper script on your test data, but got the following error message:
../graph2pro-var.sh: 29: ../graph2pro-var.sh: Syntax error: "(" unexpected
I'm running Ubuntu 14.04 with a BASH shell.
Could you please advise on that?
Sincerely,
Goor
Hi,
I am experiencing Segmentation Fault error in the step DBGraphPep2Pro.
graph2pro-var_metaSPAdes.sh: line 172: 109636 Segmentation fault (core dumped) $dbpep2pro -s $gnm -f -p $exp_d.step1.output -o $exp_d.step1.fasta.2 -k $kmer -d 5
Can you please look into the issue. Also, let me know if you would need more information.
Thanks,
Praveen
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.