Code Monkey home page Code Monkey logo

fdog's Introduction

fDOG - Feature-aware Directed OrtholoG search

PyPI version License: GPL v3 Build Status Github Build

Table of Contents

How to install

fDOG tool is distributed as a python package called fdog. It is compatible with Python ≥ v3.7.

Install the fDOG package

You can install fdog using pip:

python3 -m pip install fdog

or, in case you do not have admin rights, and don't use package systems like Anaconda to manage environments you need to use the --user option:

python3 -m pip install --user fdog

and then add the following line to the end of your ~/.bashrc or ~/.bash_profile file, restart the current terminal to apply the change (or type source ~/.bashrc):

export PATH=$HOME/.local/bin:$PATH

Setup fDOG

After installing fdog, you need to setup fdog to get its dependencies and pre-calculated data.

NOTE: in case you haven't installed greedyFAS, it will be installed automatically within fDOG setup. However, you need to run setupFAS after fDOG setup finished before actually using fDOG!

You can setup fDOG by running this command

fdog.setup -d /output/path/for/fdog/data

Pre-calculated data set of fdog will be saved in /output/path/for/fdog/data. After the setup run successfully, you can start using fdog. Please make sure to check if you need to run setupFAS first.

You will get a warning if any of the dependencies are not ready to use, please solve those issues and rerun fdog.setup.

For debugging the setup, please create a log file by running the setup as e.g. fdog.setup | tee log.txt and send us that log file, so that we can trouble shoot the issues. Most of the problems can be solved by just re-running the setup.

Usage

fdog will run smoothly with the provided sample input file 'infile.fa' if everything is set correctly.

fdog.run --seqFile infile.fa --jobName test --refspec HUMAN@9606@3

The output files with the prefix test will be saved at your current working directory. You can have an overview about all available options with the command

fdog.run -h

Please find more information in our wiki to learn about the input and outputs files of fdog.

fDOG data set

Within the data package we provide a set of 78 reference taxa. They can be automatically downloaded during the setup. This data comes "ready to use" with the fdog framework. Species data must be present in the three directories listed below:

  • searchTaxa_dir (Contains sub-directories for proteome fasta files for each species)
  • coreTaxa_dir (Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes)
  • annotation_dir (Contains feature annotation files for each proteome)

For each species/taxon there is a sub-directory named in accordance to the naming schema ([Species acronym]@[NCBI ID]@[Proteome version])

fdog is not limited to those 78 taxa. If needed the user can manually add further gene sets (multiple fasta format) using provided functions.

Adding a new gene set into fDOG

For adding one gene set, please use the fdog.addTaxon function:

fdog.addTaxon -f newTaxon.fa -i tax_id [-o /output/directory] [-n abbr_tax_name] [-c] [-v protein_version] [-a]

in which, the first 3 arguments are required including newTaxon.fa is the gene set that need to be added, tax_id is its NCBI taxonomy ID, /output/directory is where the sub-directories can be found (genome_dir, blast_dir and weight_dir). If not given, new taxon will be added into the same directory of pre-calculated data. Other arguments are optional, which are -n for specify your own taxon name (if not given, an abbriviate name will be suggested based on the NCBI taxon name of the input tax_id), -c for calculating the BLAST DB (only needed if you need to include your new taxon into the list of taxa for compilating the core set), -v for identifying the genome/proteome version (default will be the current date ), and -a for turning off the annotation step (not recommended).

Adding a list of gene sets into fDOG

For adding more than one gene set, please use the fdog.addTaxa script:

fdog.addTaxa -i /path/to/newtaxa/fasta -m mapping_file [-o /output/directory] [-c]

in which, /path/to/taxa/fasta is a folder where the FASTA files of all new taxa can be found. mapping_file is a tab-delimited text file, where you provide the taxonomy IDs that stick with the FASTA files:

#filename	tax_id	abbr_tax_name	version
filename1.fa	12345678
filename2.faa	9606
filename3.fasta	4932	my_fungi
...

The header line (started with #) is a Must. The values of the last 2 columns (abbr. taxon name and genome version) are, however, optional. If you want to specify a new version for a genome, you need to define also the abbr. taxon name, so that the genome version is always at the 4th column in the mapping file.

NOTE: After adding new taxa into fdog, you should check for the validity of the new data before running fdog.

Bugs

Any bug reports or comments, suggestions are highly appreciated. Please open an issue on GitHub or be in touch via email.

How to cite

Ebersberger, I., Strauss, S. & von Haeseler, A. HaMStR: Profile hidden markov model based search for orthologs in ESTs. BMC Evol Biol 9, 157 (2009), doi:10.1186/1471-2148-9-157

Contributors

Contact

For further support or bug reports please contact: [email protected]

fdog's People

Contributors

mueli94 avatar trvinh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fdog's Issues

select Candidate Orthologous Genes (OGs) from transcript of transcriptomes

Dear Professor:
I found through literature review that HaMStR can identify orthologous clusters from genomes or transcriptomes(http://www.deep-phylogeny.org/hamstr/) by performed with strict parameters (-hmmset = magnoliophyta_hmmer3, -representative, -strict, -eval_limit = 0.00001, and -rbh). Most of those paper used the perl script (perl hamstrsearch_local-hmmer3.v8.pl), I did not found it in the package of fDOG and found hamstr.pl in https://github.com/BIONF/HaMStR/blob/master/h1s/bin/hamstr.pl. So I would like to take this opportunity to ask you a few questions. fDOG can do same work like this paper (https://doi.org/10.1016/j.xplc.2023.100595) to select Candidate Orthologous Genes (OGs) from transcript of transcriptomes, and which script can do it?

Fail installation (setup stage)

Hello , thank you for the software ! But I keep failing to install the program, it always stuck at the setup stage. Full installation output as below,

(base) ada@node_b:~/0hiuyan/tools/DB_File-1.827$ fdog.setup -o /home/ada/0hiuyan/tools/fdog --conda                                                                                                                                          | tee log.txt
Current OS system: Linux
Data output path: /home/ada/0hiuyan/tools/fdog
-------------------------------------
Checking .bash_profile/.bashrc, grep, sed/gsed and wget availability...
done!
-------------------------------------
Installing dependencies...
done!
-------------------------------------
Preparing folders...
done!
-------------------------------------
Downloading and installing annotation tools/databases:
--2022-07-15 22:04:26--  ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
           => ‘taxdump.tar.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 165.112.9.230, 130.14.250.10, 2607:f220:41f:250::22                                                                                                                                         9, ...
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|165.112.9.230|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/taxonomy ... done.
==> SIZE taxdump.tar.gz ... 58541292
==> PASV ... done.    ==> RETR taxdump.tar.gz ... done.
Length: 58541292 (56M) (unauthoritative)

taxdump.tar.gz           100%[==================================>]  55.83M  2.36MB/s    in 24s

2022-07-15 22:04:55 (2.29 MB/s) - ‘taxdump.tar.gz’ saved [58541292]

citations.dmp
delnodes.dmp
division.dmp
gencode.dmp
merged.dmp
names.dmp
nodes.dmp
gc.prt
readme.txt
Taxonomy database indexing. It can take a while, please wait...

------------- EXCEPTION -------------
MSG: Failed to load module Bio::DB::Taxonomy::flatfile. Can't load '/home/ada/anaconda3/lib/perl5/5.                                                                                                                                         32/core_perl/auto/DB_File/DB_File.so' for module DB_File: /home/ada/anaconda3/lib/perl5/5.32/core_pe                                                                                                                                         rl/auto/DB_File/DB_File.so: undefined symbol: db_create at /home/ada/anaconda3/lib/perl5/core_perl/X                                                                                                                                         SLoader.pm line 93.
 at /home/ada/anaconda3/lib/perl5/5.32/core_perl/DB_File.pm line 258.
Compilation failed in require at /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy/flatfile.pm                                                                                                                                          line 88.
BEGIN failed--compilation aborted at /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy/flatfil                                                                                                                                         e.pm line 88.
Compilation failed in require at /home/ada/anaconda3/lib/perl5/site_perl/Bio/Root/Root.pm line 520.

STACK Bio::Root::Root::_load_module /home/ada/anaconda3/lib/perl5/site_perl/Bio/Root/Root.pm:522
STACK (eval) /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy.pm:295
STACK Bio::DB::Taxonomy::_load_tax_module /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy.pm                                                                                                                                         :295
STACK Bio::DB::Taxonomy::new /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy.pm:119
STACK toplevel /home/ada/.local/lib/python3.9/site-packages/fdog/setup/indexTaxonomy.pl:6
-------------------------------------

Bio::DB::Taxonomy: flatfile cannot be found
Exception
------------- EXCEPTION -------------
MSG: Failed to load module Bio::DB::Taxonomy::flatfile. Can't load '/home/ada/anaconda3/lib/perl5/5.                                                                                                                                         32/core_perl/auto/DB_File/DB_File.so' for module DB_File: /home/ada/anaconda3/lib/perl5/5.32/core_pe                                                                                                                                         rl/auto/DB_File/DB_File.so: undefined symbol: db_create at /home/ada/anaconda3/lib/perl5/core_perl/X                                                                                                                                         SLoader.pm line 93.
 at /home/ada/anaconda3/lib/perl5/5.32/core_perl/DB_File.pm line 258.
Compilation failed in require at /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy/flatfile.pm                                                                                                                                          line 88.
BEGIN failed--compilation aborted at /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy/flatfil                                                                                                                                         e.pm line 88.
Compilation failed in require at /home/ada/anaconda3/lib/perl5/site_perl/Bio/Root/Root.pm line 520.

STACK Bio::Root::Root::_load_module /home/ada/anaconda3/lib/perl5/site_perl/Bio/Root/Root.pm:522
STACK (eval) /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy.pm:295
STACK Bio::DB::Taxonomy::_load_tax_module /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy.pm                                                                                                                                         :295
STACK Bio::DB::Taxonomy::new /home/ada/anaconda3/lib/perl5/site_perl/Bio/DB/Taxonomy.pm:119
STACK toplevel /home/ada/.local/lib/python3.9/site-packages/fdog/setup/indexTaxonomy.pl:6
-------------------------------------


For more information about the Bio::DB::Taxonomy system please see
the Bio::DB::Taxonomy docs.  This includes ways of checking for
formats at compile time, not run time.
Can't call method "get_taxon" on an undefined value at /home/ada/.local/lib/python3.9/site-packages/                                                                                                                                         fdog/setup/indexTaxonomy.pl line 12.
Error while indexing NCBI taxonomy database! Please check /home/ada/.local/lib/python3.9/site-packag                                                                                                                                         es/fdog/taxonomy/ folder and run this setup again!

There are only three files in the /home/ada/.local/lib/python3.9/site-packages/fdog/taxonomy/
names.dmp nodes.dmp taxdump.tar.gz

Could you please suggest how to solve this error ?
Thank you very much !

Isoforms

Task: Extend fDOG for searching orthologs with all isoforms.

Issue(s):

  • How to distinguish between different isoforms and co-orthologs?

Possible solution(s):

  • Use gff file

Additional:

  • Since gff file is required, synteny info can also be acquired

Convert nine variable assignments to the usage of combined operators

👀 Some source code analysis tools can help to find opportunities for improving software components.
💭 I propose to increase the usage of combined operators accordingly.

diff --git a/fdog/bin/hamstr.pl b/fdog/bin/hamstr.pl
index 3feb01e..c56ecdb 100755
--- a/fdog/bin/hamstr.pl
+++ b/fdog/bin/hamstr.pl
@@ -1150,7 +1150,7 @@ sub checkInput {
 		elsif (-e $blastpathtmp . '_prot' . $blastdbend){
 			## the check for the file naming '_prot' is only to maintain backward compatibility
 			$blastapp = '_prot';
-			$blastpathtmp = $blastpathtmp . $blastapp;
+			$blastpathtmp .= $blastapp;
 			push @log, "\tcheck for $blastpathtmp succeeded";
 			printOUT("succeeded\n");
 		}
diff --git a/fdog/bin/oneSeq.pl b/fdog/bin/oneSeq.pl
index a99e1e6..6a8a61b 100755
--- a/fdog/bin/oneSeq.pl
+++ b/fdog/bin/oneSeq.pl
@@ -586,7 +586,7 @@ if (!$coreex) {
 		print "Added TAXON: $addedTaxon\t$addedTaxonName\n";
 		#if a new core ortholog was found
 		if($addedTaxon ne "") {
-			$hamstrSpecies = $hamstrSpecies . "," . $addedTaxon;
+			$hamstrSpecies .= "," . $addedTaxon;
 
 			clearTmpFiles();
 
@@ -865,7 +865,7 @@ sub getAlnScores{
 		unless ($silent) {
 			print "Cumulative alignmentscore is: $score\n";
 		}
-		$scores{$key} = $scores{$key} / $maxAlnScore;
+		$scores{$key} /= $maxAlnScore;
 		$score = $scores{$key};
 		unless ($silent) {
 			print "Normalised alignmentscore is: $score\n";
@@ -1151,7 +1151,7 @@ sub checkOptions {
 		}
 		my $output = '';
 		for (my $i = 0; $i < @refTaxonlist; $i++) {
-			$output = $output . "[$i]" . "\t" . $refTaxonlist[$i] . "\n";
+			$output .= "[$i]\t" . $refTaxonlist[$i] . "\n";
 		}
 		### for debug?
 		# for (keys %taxa){
@@ -1279,7 +1279,7 @@ sub checkOptions {
 		}
 		print "Your sequence was named: " . $seqName . "\n\n";
 	}
-	$outputPath = $outputPath . "/$seqName";
+	$outputPath .= "/$seqName";
 	if (! -d "$outputPath"){
 		mkdir "$outputPath", 0777  or die "could not create the output directory $outputPath";
 	}
@@ -1526,7 +1526,7 @@ sub fetchSequence {
 		my $line = $_;
 		chomp($line);
 		unless($line =~ /^\>.*/) {
-			$seq = $seq . $line;
+			$seq .= $line;
 		}
 	}
 	close INPUT;
@@ -1719,7 +1719,7 @@ sub cumulativeAlnScore{
 			if($line[0] && ($line[0] eq $shortedId)){
 				if(exists $cumscores{$key}) {
 					$gotScore = 1;
-					$cumscores{$key} = $cumscores{$key} + $line[2];
+					$cumscores{$key} += $line[2];
 				}else{
 					$gotScore = 1;
 					$cumscores{$key} = $line[2];
@@ -1938,7 +1938,7 @@ sub runHamstr {
 	if (! -e $taxaDir) {
 		## backward compatibility. I used to name the dirs with the ending .dir
 		if (-e "$taxaDir.dir"){
-			$taxaDir = $taxaDir . '.dir';
+			$taxaDir .= '.dir';
 		}
 	}
 	$taxaDir =~ s/\s*//g;
diff --git a/fdog/bin/translate.pl b/fdog/bin/translate.pl
index 68ee444..2fe04dd 100755
--- a/fdog/bin/translate.pl
+++ b/fdog/bin/translate.pl
@@ -167,7 +167,7 @@ sub checkIds {
 		$id =~ s/|.*//;
 	    }
 	    elsif ($check == 2) {
-		$id = $id . '_' . $seq_object[$i]->desc;
+		$id .= '_' . $seq_object[$i]->desc;
 		$id =~ s/(.{0,$limit}).*/$1/;
 	    }
 	    if (defined $counter->{$id}) {

central version

  • core_orthologs in output folder (not created in setup and read from pathconfig)
  • files to specify list of core and search taxa

about the seed sequence

Hi, I want to make sure that the input single fasta file(seed sequence) means just one sequence in the file?

Avoid editing `~/.bashrc` without user permission

The setup script currently edits the user’s ~/.bashrc in several places, for example:

fDOG/fdog/setup/setup.sh

Lines 138 to 140 in 7a987e3

if [ -z "$($grepprog PATH=$CURRENT/bin/aligner/bin ~/$bashFile)" ]; then
echo "export PATH=$CURRENT/bin/aligner/bin:\$PATH" >> ~/$bashFile
fi

This should be avoided. It’d be nice if only instructions how to do it were printed by the script.

Addition of Taxa doesnt seems to be working with fdog.addTaxa

Hello, I am trying to add some genomes from ncbi for which I have downloaded AA fasta sequences in a directoty and I would like to use "fdog.addTaxa -i Fasta_files -m Mapping_file.txt -c " if I use it this way it doesnt read "_protein.faa" files then tried "fdog.addTaxa -i Fasta_files/* -m Mapping_file.txt -c " Now with this it read all fasta files but gives an error "unrecognized arguments"

Could you please help, what wrong I am doing, when I tried with fdog.addTaxon -f XX_protein.fa -i 9606 -c, this was running and iterating through all sequenec in fasta file but with error to so I am not sure whether I could the information created with fdog.addTaxon.

Any help will be highly appreciated.

FileNotFoundError: [Errno 2] No such file or directory: '~/python3.7/lib/python3.7/site-packages/fdog/bin/pathconfig.txt'

Hello,
I met this error when I setup fDOG:

Traceback (most recent call last):
File "/public/home/HuCH/jxhuang/biosoft/python3.7/bin/fdog.setup", line 8, in
sys.exit(main())
File "/public/home/HuCH/jxhuang/biosoft/python3.7/lib/python3.7/site-packages/fdog/setupfDog.py", line 256, in main
with open(pathconfig_file, 'w') as cf:
FileNotFoundError: [Errno 2] No such file or directory: '/public/home/HuCH/jxhuang/biosoft/python3.7/lib/python3.7/site-packages/fdog/bin/pathconfig.txt

And I can not find the "bin" dictionary in the path "/public/home/HuCH/jxhuang/biosoft/python3.7/lib/python3.7/site-packages/fdog/", it presents as:
image

How to solve this problem?
Thanks.

ERROR: Cannot find seed sequence in genome of reference species for Dracaena_cambodiana_DN10000_c0_g1_i1.p1!

I encountered the following error while running fDOG
ERROR: Cannot find seed sequence in genome of reference species for Dracaena_cambodiana_DN10000_c0_g1_i1.p1! This is my code fdog.run --seqFile Dracaena_cambodiana.pep --jobName test --refspec ARATH@3702@3`.Also, nothing appears in the window when I run this command.Finally, the above error occurs.
This is my seqFile

Dracaena_cambodiana_DN10000_c0_g1_i1.p1
MVAVFNKELLSCYLITLKLKQTVEAGLAKSQPNSPTPKPPLRLTQGQPPQPPVPRADSEWVISIREKLDEARQEQAACPWAKLSIYRVPKSLREGDSKAYVPQVVSLGPYHRGKHHLRGMDHHKWRALHKVLRRTGHDVELYLDSIKMLEERARACYEGSLSLTSNEFVETMVLDGIFILELFRGAAGEGFKQLGYSHNDPVFAMRATMHSIQRDMIMLENQIPLFVLDRLLALQICKPEQSGLVATLAVQFFDPLIPTDEPLRKIDRNKLQSTSPSLKATAAVFDP
Dracaena_cambodiana_DN10004_c0_g1_i1.p1
VKEEKLSKDSNSKNGAKSGGELKRASKRPDLKAVSKMIENHQVLYPEKMIGHLPGIDVGDQFLSRAEMCVLGLHNHWLNGIDYMGQSFAKLERYKNYTFPLAICIVLSGMYEDDLDNSEDIVYTGQGGHDLLGSKHQVSDQKMERGNLALKNNCDLGVPVRVVRGHELKSSYSGKVYTYDGLYRVIKTWAEKGVSGFTVYKYKLRRVEGQPVLRTNQVQFTRAT
Dracaena_cambodiana_DN10014_c0_g1_i1.p1
VRLLLLVGPAQVQIQQLLLEKLPEHFDAVSPSRNLNDDVARLIVNQFRWLDFLVDSRGFAEKLMEVLSIAPLLLKKEIIGSLPEIIGDQNSSMVIASLERLLQEDSEVIVPVLDSFSNLNLDDQLQEQAVTIALSCIRTVDAEHMPHLLRFLLLSATSVNVGRIISQIREQLKFVGVSDPCVAKQKMLKGKYLADSTEASILDALRSSLRFKNTTCEAIFKELKSLDHPRDHKVIDVWFLMLIYTNGGSLQKNAEKIL
Dracaena_cambodiana_DN10015_c0_g1_i1.p1
MTPNANTLPPPQTPTPNSLIMAGKSPKPSILTDGLLFFGGAAVALLLFLTLYSFLSPLAPSSPSLPFRHSLPLLANPHSPRHHHNSCDSNLHIDPSSPNLYDDPSVSYTLSYSGSLITDWDRKRTNWLHSHPTFAAKRNRILMVSGSQPGPCRNPIGDHLLLRFYKNKADYCRLHDIDLFYNTALLQPDMFSFWAKIPVVRAAMVAHPEAEWIWWVDSDAAITDMEFELPLDRYEGYNLVLHGWRDLVYKAKSWTSLNAGVFLMRNCQWSLDFIDTWSRMGPQSPEYERWGRIQRAIFKDKLYAASDDQTGIAYLLLEEKRKWADKIYLESAFYFEGYWVEIVNKLERIEAEYLEMDRRVKGLRRRRAEKVAVAYGRMREELLEKQGVRRGMEGWRRPFITHFTGCQPCSGDHNKEYSGENCFGGMQRALNFADNQVLRSYGFRHERLVNSTGVRPLPFDFPAAG*
Dracaena_cambodiana_DN10025_c0_g1_i1.p1
MSPCGGVGHEPAVLQLQKWGYLQFQLEPSEFRLASISPTRDLLLLLSYQCEALLLPLLLGKFQSENICEPNSSEQVITCIPDDVDSAQCSKNDEESVKGAALPILESSSASKSYPVISGVKSLAWGHS
Dracaena_cambodiana_DN10026_c0_g1_i1.p1
VICLGIFFRSWPSPRPSFIRDISQQNSSESYGLVHVCALGGEFLCRLAVFAILERLGSQILYSMFASFCLMAAIFLRRNIVETKGKTLQEIEISLRQPE*
Dracaena_cambodiana_DN10027_c0_g1_i1.p4
MSPTDRGGPTSSAVGSLHLLPSELLHSILLRLPLPDLLRLRSLSRLLLSSISAPEFRRTYQSTSPWLFLFQKRPPRSSLLRGFDPRSARWFTFPSLSALIAAPALPPGDDLYLLA
Dracaena_cambodiana_DN10028_c0_g1_i1.p1
MGFTAAELERLISSNARILIVAPAIPRLEFWQAFIGGDNRKDLVSVLTRNRGLITHDIVSGIAPKMLLLKEHGLSQRDIVGLVKRGHGFITRSSKTIEAVLNSAKELGLDTKSPMFGHTLSSLVSFSSDSFKAKMEIFRGFGWSEEELLAAFKKAPSFLHLSEENIREKMEFLVGRAGCKQSYIALNPLLLTFSLEKRLRPRHYVMEVLKSKGIMGRARFSKIMCLTEKKFVESVILPCKEQVPNIHELYIA
Dracaena_cambodiana_DN10030_c0_g1_i1.p1
LFRNPLSLSLLRKFFPKQISNLMGYLSCRAESSISTCRSISAISPSQSPTSTTNTKKKHLKPYKSLQNQENDDDDDDKSPCRRIEHFTYAELESATNNFSNSSLLGRGS

4180 low-copy lineologous candidate nuclear genes from 9 representative angiosperms

Hello, I read an article about angiosperms phylogeny written by someone else and paid attention to your software. In the article, there appeared 4180 core low-copy nuclear gene data set composed of 9 angiosperms. May I ask whether these 4180 low-copy genes are in the pre-downloaded database? Or do I need to use software to extract from 9 angiosperms?

Convert six assignment statements to augmented source code

👀 Some source code analysis tools can help to find opportunities for improving software components.
💭 I propose to increase the usage of augmented assignment statements accordingly.

diff --git a/fdog/libs/addtaxon.py b/fdog/libs/addtaxon.py
index 8e38620..b25b3b3 100644
--- a/fdog/libs/addtaxon.py
+++ b/fdog/libs/addtaxon.py
@@ -91,7 +91,7 @@ def create_genome(args):
             ### check if id longer than 20 character
             if len(id) > 20:
                 long_id = 1
-                mod_id_index = mod_id_index + 1
+                mod_id_index += 1
                 id = '%s_%s' % (spec_name.split('@')[1], mod_id_index)
                 if not ori_id in id_dict:
                     id_dict[ori_id] = id
@@ -157,7 +157,7 @@ def create_annoFile(outPath, genome_file, cpus, force):
     """ Create annotation json for a given genome_file """
     annoCmd = 'fas.doAnno -i %s -o %s --cpus %s' % (genome_file, outPath+'/annotation_dir', cpus)
     if force:
-        annoCmd = annoCmd + " --force"
+        annoCmd += " --force"
     try:
         subprocess.call([annoCmd], shell = True)
     except:
diff --git a/fdog/libs/alignment.py b/fdog/libs/alignment.py
index 507eaa5..7c1699c 100644
--- a/fdog/libs/alignment.py
+++ b/fdog/libs/alignment.py
@@ -123,5 +123,5 @@ def calc_aln_score(fa1, fa2, aln_strategy = 'local', debugCore = False):
             if gene_id in aln_score:
                 if re.search('\(\s+\d+\)', l):
                     l = re.sub(r'\(\s+','(', l)
-                aln_score[gene_id] = aln_score[gene_id] + int(l.split()[2])
+                aln_score[gene_id] += int(l.split()[2])
     return(aln_score)
diff --git a/fdog/libs/blast.py b/fdog/libs/blast.py
index ff41608..3c8dc5d 100644
--- a/fdog/libs/blast.py
+++ b/fdog/libs/blast.py
@@ -75,7 +75,7 @@ def make_blastdb(args):
     (specName, specFile, outPath, silent) = args
     blastCmd = 'makeblastdb -dbtype prot -in %s -out %s/coreTaxa_dir/%s/%s' % (specFile, outPath, specName, specName)
     if silent == True:
-        blastCmd = blastCmd + '> /dev/null 2>&1'
+        blastCmd += '> /dev/null 2>&1'
     try:
         subprocess.call([blastCmd], shell = True)
     except:
diff --git a/fdog/libs/zzz.py b/fdog/libs/zzz.py
index 219ac76..8bd4f22 100644
--- a/fdog/libs/zzz.py
+++ b/fdog/libs/zzz.py
@@ -87,10 +87,10 @@ def count_line(file, pattern, contain):
         for line in f:
             if contain:
                 if pattern in line:
-                    nline = nline + 1
+                    nline += 1
             else:
                 if not pattern in line:
-                    nline = nline + 1
+                    nline += 1
     return(nline)
 
 

Cannot install FASTA36

Hello,

I'm having problems to install fDOG. See below for error message

$ fdog.setup -d $HOME/bin/fdog/
*** Creating local NCBI taxonomy database...
*** Installing dependencies...
=> Dependencies in /home/uni08/bheimbu/.local/lib/python3.9/site-packages/fdog/data/dependencies.txt already installed!
=> FASTA36 (https://github.com/wrpearson/fasta36)
Downloading https://github.com/wrpearson/fasta36/archive/refs/tags/v36.3.8h_04-May-2020.tar.gz
...-126156800%, 1 MB, 7856 KB/s, 0 seconds passed ... done!
Compiling fasta36. Please wait...
gcc -g -O -msse2  -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit  -DM10_CONS  -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP  -D_LARGEFILE64_SOURCE  -DBIG_LIB64 -DCOMP_THR -DCOMP_MLIB -c comp_lib9.c -o comp_mthr9.o
In file included from comp_lib9.c:43:0:
/opt/sw/rev/21.12/haswell/oneapi-2022.0.0/intel-20.0.4-bn2ws2/include/limits.h:37:54: Fehler: missing binary operator before token "("
     defined(__has_include_next) && __has_include_next(<limits.h>)
                                                      ^
comp_lib9.c: In Funktion »main«:
comp_lib9.c:913:17: Fehler: »FLT_MAX« nicht deklariert (erste Benutzung in dieser Funktion)
     zbestcut = -FLT_MAX;
                 ^
comp_lib9.c:913:17: Anmerkung: jeder nicht deklarierte Bezeichner wird nur einmal für jede Funktion, in der er vorkommt, gemeldet
comp_lib9.c: In Funktion »init_beststats«:
comp_lib9.c:3046:25: Fehler: »FLT_MIN« nicht deklariert (erste Benutzung in dieser Funktion)
   (*best)[0].rst.escore=FLT_MIN; /* for E()-values, lower is best */
                         ^
comp_lib9.c:3047:21: Fehler: »FLT_MAX« nicht deklariert (erste Benutzung in dieser Funktion)
   (*best)[0].zscore=FLT_MAX; /* for Z-scores, bigger is best */
                     ^
make: *** [comp_mthr9.o] Fehler 1
ERROR: Cannot install FASTA36!

before that I have done...

$ fas.setup -t ./ --check
Annotation tools can be found at /usr/users/bheimbu/annotation_tools. FAS is ready to run!
You should test fas.doAnno with this command:
==> fas.doAnno -i test_annofas.fa -o testFas_output <==

Any help is highly appreciated,

Bastian

fdogs.run error messages

Hello!
I tested fdogs.run with some sequences and it seems to me that error messages are displayed a little incorrectly.
Below I will give the errors that I encountered so that it would be more convenient for you to catch them!

  1. When outdir doesnt exists:
    fdogs.run --cpu 12 --input input_dir/ --jobName job1 --refspec ANOGA@7165@3 --outpath ./job1_S1 --fasoff
Traceback (most recent call last):
  File "/home/sh/.local/bin/fdogs.run", line 8, in <module>
    sys.exit(main())
  File "/home/sh/.local/lib/python3.7/site-packages/fdog/runMulti.py", line 415, in main
    multiLog = open(outpath + '/' + jobName + '_log.txt', "w")
FileNotFoundError: [Errno 2] No such file or directory: '/data/fs/hpc/cluws/sh-worms/test/job1_S1/job1_log.txt'
  1. When program FAS doesnt found:
    fdogs.run --cpu 12 --input input_dir/ --jobName job1 --refspec ANOGA@7165@3 --outpath ./2
PID 189193
Starting compiling core orthologs...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 20.88it/s]
==> Core compiling finished in 0.363 sec
Searching orthologs for...
seq1.fa
test_ref.fa
job1_S1_DN10018_c0_g1_i21.fa
seq2.fa.fa
==> Ortholog search finished in 4.044 sec
Joining single outputs...
Traceback (most recent call last):
  File "/home/sh/.local/bin/fdogs.run", line 8, in <module>
    sys.exit(main())
  File "/home/sh/.local/lib/python3.7/site-packages/fdog/runMulti.py", line 444, in main
    finalFa = joinOutputs(outpath, jobName, seeds, keep)
  File "/home/sh/.local/lib/python3.7/site-packages/fdog/runMulti.py", line 158, in joinOutputs
    os.remove(outpath + '/' + seqName + '.fa')
FileNotFoundError: [Errno 2] No such file or directory: '/data/fs/hpc/cluws/sh-worms/test/2/seq1.fa'

the beginning of this log looks a little funny, because no search is going on, files are not created, but it is written as if all this has been done))

  1. When fdogs.run cant find orthologs least for one sequence from input (or if --refspec is selected incorrectly):
    fdogs.run --cpu 12 --input input_dir/ --jobName job1 --refspec ARATH@3702@3 --outpath ./3 --fasoff
PID 154327
Starting compiling core orthologs...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:49<00:00, 27.30s/it]
==> Core compiling finished in 109.272 sec
Searching orthologs for...
seq1.fa
test_ref.fa
job1_S1_DN10018_c0_g1_i21.fa
seq2.fa.fa
==> Ortholog search finished in 35.291 sec
Joining single outputs...
Traceback (most recent call last):
  File "/home/sh/.local/bin/fdogs.run", line 8, in <module>
    sys.exit(main())
  File "/home/sh/.local/lib/python3.7/site-packages/fdog/runMulti.py", line 444, in main
    finalFa = joinOutputs(outpath, jobName, seeds, keep)
  File "/home/sh/.local/lib/python3.7/site-packages/fdog/runMulti.py", line 158, in joinOutputs
    os.remove(outpath + '/' + seqName + '.fa')
FileNotFoundError: [Errno 2] No such file or directory: '/data/fs/hpc/cluws/sh-worms/test/3/test_ref.fa'

in this case the program just breaks, file *extended.fa is empty. When the running time of the program is 5 minutes (processing a pair of sequences), it is not as unpleasant as in the case when the program worked for 2 days on a large number of sequences...

I hope my message will be useful to you!

About fDOG running issues

I had the following problem running fDOG
(hamstr) sxq020@masterv2:~/fDOG/output39_1$ fdog.run --seqFile /media/ym/desk16/sxq020/fDOG/output39_1/431143.fa --jobName A_test --refspec ORYSA@4530@230330 --corepath /media/ym/desk16/sxq020/fDOG/output39_1/A39_1_39525/coreTaxa_dir --searchpath /media/ym/desk16/sxq020/fDOG/output39_1/A39_1_39525/searchTaxa_dir --annopath /media/ym/desk16/sxq020/fDOG/output39_1/A39_1_39525/annotation_dir

Identified seed ID: 4530_1
Compiling core set for A_test
Traceback (most recent call last):
File "/media/ym/desk16/sxq020/.local/bin/fdog.run", line 8, in
sys.exit(main())
File "/media/ym/desk16/sxq020/.local/lib/python3.8/site-packages/fdog/runSingle.py", line 225, in main
core_runtime = core_fn.run_compile_core([seqFile, seqName, refspec, seed_id, reuseCore,
File "/media/ym/desk16/sxq020/.local/lib/python3.8/site-packages/fdog/libs/corecompile.py", line 417, in run_compile_core
compile_core([seqFile, seqName, refspec, seed_id, coreArgs, pathArgs,
File "/media/ym/desk16/sxq020/.local/lib/python3.8/site-packages/fdog/libs/corecompile.py", line 178, in compile_core
tree = ncbi.get_topology(tax_ids.keys(), intermediate_nodes = True)
File "/media/ym/desk16/sxq020/.local/lib/python3.8/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 442, in get_topology
lineage = id2lineage[sp]
KeyError: 39525383

Since the species added to the taxa does not have a taxa_id number in the ncbi, I numbered it 39525383.

question for understanding the input files for HaMStR, --seqFile and --refspec

Hello, I understand that you have developed HaMStR, which can identify orthologous genes in any protein-coding sequences.

Firstly, I have downloaded hundreds of high-quality genomes and identified orthologous genes among them. Additionally, I have some relatively low-quality transcriptomes. My aim is to utilize HaMStR to extract low-copy genes from both the transcriptomes and genomes. Despite reading your articles and the fdog.run -h usage information, I still have a few questions:

  1. If I already have identified orthologous genes from hundreds of high-quality genomes, should these identified orthologous genes be used as --seqFile infile.fa, and should the transcriptome protein database be used as --refspec?

  2. Does infile.fa support multiple sequences?

  3. In the help document, the description for --seqFile is "Input file containing the seed sequence (protein only) in fasta format." Here, Is the "seed sequence" obtained after processing the orthologous genes, or is it just the sequences of each OG in different species? How should we understand the "seed sequence" here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.