Code Monkey home page Code Monkey logo

starfish's Introduction

DOI Anaconda_version Anaconda_downloads Anaconda_platforms Anaconda-Server Badge

starfish is a modular toolkit for giant mobile element annotation. Built primarily for annotating Starship elements in fungal genomes, it can be easily adapted to find any large mobile element (≥6kb) that shares the same basic architecture as a fungal Starship or a bacterial integrative and conjugative element: a "captain" gene with zero or more "cargo" genes downstream of its 3' end. It is particularly well suited for annotating low-copy number elements in a content independent manner.

Overview

The starfish workflow is organized into three main modules: Gene Finder, Element Finder, and Region Finder. Each has dedicated commands that are typically run sequentially. Auxiliary commands that provide additional utilities and generate visualizations are also available through the commandline. Several useful stand-alone scripts are located in the /scripts directory.

Documentation

Head to our GitHub Wiki for useful resources, including installation instructions, a manual with important details and considerations for each command, and a step-by-step tutorial. If you run into difficulties, please open an issue on GitHub

Citations and dependencies

Many starfish commands have dependencies that are stand-alone programs in their own right. If you use starfish in your research, please contact us as it has not yet been published.

Please cite both the starfish manuscript in addition to any dependencies you may have used (see Table below for a guide). For example:

We used starfish v1.0.0 (Gluck-Thaler and Vogan 2023) in conjunction with metaeuk (Karin et al. 2020), mummer4 (Marcais et al 2018), and blastn (Camacho et al. 2009) to annotate and visualize giant mobile elements.

Command Dependency Citation
annotate, augment metaeuk, hmmer, bedtools Karin et al. 2020, Eddy 2011, Quinlan and Hall 2010
insert, extend blastn, mummer4 Camacho et al. 2009, Marcais et al 2018
flank cnef Ayad et al. 2018
sim sourmash Pierce et al 2019
group mcl Enright et al. 2002
*-viz circos, gggenomes, mummer4,
mafft, minimap2
Krzywinski et al. 2009, Hackl and Ankenbrand 2022, Marcais et al 2018,
Katoh and Standley, Li 2018

License and acknowlegements

Please cite our work if you use starfish in your research:

Gluck-Thaler, E., & Vogan, A. A. (2023). Systematic identification of cargo-carrying genetic elements reveals new dimensions of eukaryotic diversity. bioRxiv, 2023-10.

starfish is an open source tool available under the GNU Affero General Public License version 3.0 or greater. This work was supported by funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement (grant number 890630).

starfish's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

starfish's Issues

Typo in step-by-step tutorial

I believe that the annotate command in the step-by-step tutorial has a typo:

starfish annotate -T 2 -x macpha6_tyr -a ome2assembly.txt -g macpha6.gff3 -p ../database/YRsuperfams.p1-512.hmm -P ../database/YRsuperfamRefs.faa -i tyr -o geneFinder/

Where the -g parameter should be ome2gff.txt instead of macpha6.gff3.

Thanks for the fantastic programme!

typo in install instructions

From: https://github.com/egluckthaler/starfish/wiki/Installation

when replacing the cnef.cc file, your instructions say:
cp ../scripts/cneff.cc .

but it should say:
cp ../scripts/cnef.cc .

I guess this is a bit of a dangerous error, and might pop up due to other install errors. Since the original file and the new file have the same names, it is hard to distinguish between them.

I guess an option would be to rename your cnef.cc to cnef_starfish.cc

and then add a sed command for the Makefile? Something like this?
sed -i "s/cnef.cc/cnef_starfish.cc/" Makefile

Trying to run on my own samples

Hello Emile and group! Excited to try your program on my Histoplasma datasets but I have hit a couple of snags, some I've solved but currently stuck on this one.

I have been following your tutorial for your test dataset and the portion I have been stuck on is your eggnog annotation sorting.
My genomes have been annotated using Funannotate, but I also attempted an eggnog-mapper run and my output does not look like yours.

Your step that cuts your emapper into a text file for later steps is giving me an issue:

cut -f1,10 ann/*emapper.annotations | grep -v '#' | perl -pe 's/^([^\s]+?)\t([^\|]+).+$/\1\t\2/' > ann/macph6.gene2og.txt

the -f1,10 cut is looking for "bestOG|evalue|score" tab where it pulls out the first portion "bestOG" to create the text file. Is it maybe a -flag I am missing for this specific emapper output?

Funannotate:
GeneID TranscriptID Feature Contig Start Stop Strand Name Product Alias/Synonyms EC_number BUSCO PFAM InterPro EggNog COG GO Terms Secreted Membrane Protease CAZyme Notes gDNA mRNA CDS-transcript Translation
19VMG-15_000001 19VMG-15_000001-T1 mRNA scaffold_1 63793 65897 + hypothetical protein 2.1.1.310 EOG091P0ENB PF01189;PF17125 IPR001678 SAM-dependent methyltransferase RsmB-F/NOP2-type domain;IPR011023 Nop2p;IPR018314 Bacterial Fmu (Sun)/eukaryotic nucleolar NOL1/Nop2p, conserved site;IPR023267 RNA (C5-cytosine) methyltransferase;IPR023273 RNA (C5-cytosine) methyltransferase, NOP2;IPR029063 S-adenosyl-L-methionine-dependent methyltransferase superfamily;IPR031341 Ribosomal RNA small subunit methyltransferase F, N-terminal ENOG503NUZ7 L:(L) Replication, recombination and repair GO_component: GO:0005730 - nucleolus [Evidence IEA];GO_function: GO:0008757 - S-adenosylmethionine-dependent methyltransferase activity [Evidence IEA];GO_function: GO:0008168 - methyltransferase activity [Evidence IEA];GO_function: GO:0003723 - RNA binding [Evidence IEA];GO_function: GO:0009383 - rRNA (cytosine-C5-)-methyltransferase activity [Evidence IEA];GO_process: GO:0006396 - RNA processing [Evidence IEA];GO_process: GO:0001510 - RNA methylation [Evidence IEA];GO_process: GO:0070475 - rRNA base methylation [Evidence IEA];GO_process: GO:0000470 - maturation of LSU-rRNA [Evidence IEA]

My eggnog-mapper:
#query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_PathwayKEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs
01_16_1 5037.XP_001538943.1 0 3312 COG1020@1|root,KOG1178@2759|Eukaryota,38V1T@33154|Opisthokonta,3Q08R@4751|Fungi,3R2AF@4890|Ascomycota,20DTP@147545|Eurotiomycetes,3AZS6@33183|Onygenales 4751|Fungi Q Condensation domain - - - ko:K22152 - - - ko00000 - - - AMP-binding,Condensation,PP-binding

Your eggnog-mapper:
#query_name seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score predicted_gene_name GO_terms KEGG_pathways Annotation_tax_scope OGs bestOG|evalue|score COG cat eggNOG annot
mp040_13792 441959.XP_002479979.1 5.00E-71 232.2 FG02084.1 GO:0005575,GO:0005623,GO:0005886,GO:0006810,GO:0008150,GO:0015886,GO:0016020,GO:0016021,GO:0031224,GO:0044425,GO:0044464,GO:0044699,GO:0044765,GO:0051179,GO:0051181,GO:0051234,GO:0071702,GO:0071705,GO:0071944,GO:1901678 fuNOG[21] 03RFZ@ascNOG,0IWNB@euNOG,0M5FN@euroNOG,0MQ5B@eurotNOG,0PN5J@fuNOG,11Q0B@NOG,13QP8@opiNOG 0PN5J|7.70053432588e-86|288.860290527 S RTA1 domain protein

Any guidance would be appreciated!

-Tania

sh: 1: Syntax error: "(" unexpected

hey i do now what is wrong...

(starfish) marcin@marcin-:~/starfish/test$ starfish annotate -T 2 -x macpha6_tyr -a ome2assembly.txt -g ome2gff.txt -p ../db/YRsuperfams.p1-512.hmm -P ../db/YRsuperfamRefs.faa -i tyr -o geneFinder/
[Thu Apr 25 21:33:47 2024] executing command: starfish annotate -T 2 -x macpha6_tyr -a ome2assembly.txt -g ome2gff.txt -p ../db/YRsuperfams.p1-512.hmm -P ../db/YRsuperfamRefs.faa -i tyr -o geneFinder/
Key parameters:
metaeuk        -v 3 -s 7.5 -e 0.0001 --max-intron 2000 --max-seqs 300 --metaeuk-eval 0.001 --exhaustive-search 1 --metaeuk-tcov 0.25 --allow-deletion 1 --protein 1 --disk-space-limit 100G --compressed 1
hmmsearch      --max -E 0.001

[Thu Apr 25 21:33:47 2024] geneFinder//macpha6_tyr.fas exists, skipping metaeuk annotation
[Thu Apr 25 21:33:47 2024] geneFinder//macpha6_tyr.hmmout exists, skipping HMM annotation of metaeuk results
[Thu Apr 25 21:33:47 2024] filtering metaeuk annotations based on hmmsearch results..
[Thu Apr 25 21:33:47 2024] lifting over names of overlapping feature from GFFs in ome2gff.txt to metaeuk annotations..
[Thu Apr 25 21:33:47 2024] checking formatting of GFFs in ome2gff.txt..
sh: 1: Syntax error: "(" unexpected

cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.