sionbayliss / pirate Goto Github PK

View Code? Open in Web Editor NEW

85.0 85.0 29.0 5.87 MB

A toolbox for pangenome analysis and threshold evaluation.

License: GNU General Public License v3.0

Perl 96.38% R 3.62%

pirate's People

Contributors

Stargazers

Watchers

pirate's Issues

Question: Which type of homology is used with a mix of feature types?

Hi,

When running PIRATE with several features (e.g. CDS,rRNA,tRNA), some of which require nucleotide homology to be used (i.e. rRNA,tRNA), does PIRATE automatically resort to using nucleotide sequence homolgy for all features? Or just the ones requiring it?

I.e. would CDS be analyzed using nucleotide or a.a. homology in this case? If nucleotide (which I suspect from an example on the main page), would it be difficult to implement an option to use a.a. for CDS but nucleotide for non-coding features?

Any response is greatly appreciated!!
Thank you,
Conrad

PIRATE create unexpected output when using -f "tRNA,rRNA"

Hi,
I run following two commands, both created the same number of output files.
PIRATE -i ./311gff/ -o ./311Agro_panOut_RNAcalled -f "tRNA,rRNA" -a -r -k "-f 6" -t 40
PIRATE -i ./311gff/ -o ./311Agro_panOut -a -r -k "-f 6" -t 40
The only difference is the second line of codes without -f "tRNA,rRNA". Both command create the same amount of output files, and no error message for both.
However, the output file contents of the first command is weird, and it ran less than 10 mins, while the second command run around 9 hours.
Here is one comparison of PIRATE.pangenome_summary.txt from both commands:
First command:

215 gene families in 311 genomes.

19 contain greater than one allele at the thresholds analysed.

0 contain fission/fusion events.

0 contain duplication/loss.

%isolates #clusters >1 allele fission/fusion multicopy
0-10% 160 4 0 0
10-25% 5 1 0 0
25-50% 6 0 0 0
50-75% 5 0 0 0
75-90% 1 0 0 0
90-95% 1 1 0 0
95-100% 37 13 0 0

Second command output:

52233 gene families in 311 genomes.

3985 contain greater than one allele at the thresholds analysed.

2341 contain fission/fusion events.

886 contain duplication/loss.

%isolates #clusters >1 allele fission/fusion multicopy
0-10% 42290 1859 650 167
10-25% 3061 790 294 144
25-50% 3207 610 408 163
50-75% 1101 293 286 124
75-90% 503 162 158 100
90-95% 104 25 39 23
95-100% 1967 246 506 165

Can anyone help explain what may go wrong when adding -f "tRNA,rRNA"? Though it doesn't create error message, it doesn't make sense it finish in less than 10 mins while the second command ran 9 hours.

Thanks.
Limin

gene_families and pangenome gff file do not match

Hi,

I am looking at my gene_families.tsv file and I have ~11,000 entries in there but when I look at the annotation file (pangenome_alignment.gff) there are only ~7500 and I can't seem to understand why there are a lot of gene families missing?

Thanks in advance,
Marcela

PIRATE -c doesn't work (but --check does)

PIRATE -c doesn't work (but --check does)
Conflict with another -c ?

Pan genome from genomes of contigs

I am trying to construct a pan genome using genomes that consist only of contigs. However, pirate gives me the following error

- running mcl on pan_sequences at 50    
 - 0 clusters at 50 % - completed in 0 secs
 - running mcl on pan_sequences at 60    
 - ERROR: pangenome_construction.pl failed - error logged at /path/to/output/folder/fail_test.txt

fail_text.txt:

BLAST options error: File /path/to/output/folder/pangenome_iterations/pan_sequences.representative.fasta is empty
 - ERROR: no clusters in /path/to/output/folder/pangenome_iterations/pan_sequences.mcl_50.clusters

When I add a complete genome into the pool of genomes, on which the pan genome is constructed, everything works fine. Is this a feature or a bug, and do you have an explanation to why this arises?

Extract shared genes of a set of genomes compare with the others

I have run PIRATE and the execution was really fast and fine. My question now is if using some of the scripts I can extract shared genes between a set of genomes that are not within the rest of genomes. Is that possible? thank you!

bioconda recipe in progress

See bioconda/bioconda-recipes#17798

pangenome_alignment.fasta file explanation

Hello,

I just wanted to clarify what the pangenome alignment file contains. From my understanding, it contains all genes in the pangenome, concatenated together, with a sequence for each sample in the PIRATE run. Genes are in the same order as in the ordered tsv file. Is this correct?

If so, are the genes concatenated simply end-to-end, or is some spacer used? And how are missing genes handled? "N"s or "-"? I understand that dosage > 1 genes are replaced by "?", but does this refer to all copies of the gene, or subsequent (after the first, however that is defined)?

Thank you!
Conrad

--help should not return error

$ PIRATE --help
$ echo $?
1

should be 0 as I asked for help and got it. there was no error.
same for --version in #11

PIRATE_to_roary.pl: command not found

I am trying to convert the output files but not working. Any help?

Cannot install

Hi,

I am trying to install the package but apparently Conda cannot find it. Would you please have look at this issue and let me know how to install the pirate package? I am eager to try it out.

Cheers,
Pablo

conda installation issue

Hello,
I am trying to install pirate via conda. I have activated all the mentioned conda channels. But it provides the following error or issue:


> `root@honey-pc:/home/furqan# conda install pirate
> Collecting package metadata (current_repodata.json): done
> Solving environment: failed with initial frozen solve. Retrying with flexible solve.
> Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
> Collecting package metadata (repodata.json): done
> Solving environment: failed with initial frozen solve. Retrying with flexible solve.
> Solving environment: - 
> Found conflicts! Looking for incompatible packages.
> This can take several minutes.  Press CTRL-C to abort.
> failed                                                                          
> 
> UnsatisfiableError: The following specifications were found to be incompatible with each other:
> 
> Output in format: Requested package -> Available versions
> 
> Package libstdcxx-ng conflicts for:
> python=3.8 -> libstdcxx-ng[version='>=7.3.0|>=7.5.0']
> python=3.8 -> libffi[version='>=3.2.1,<3.3.0a0'] -> libstdcxx-ng[version='>=4.9|>=7.2.0']
> 
> Package zlib conflicts for:
> pirate -> blast[version='>=2.2.31'] -> zlib[version='1.2.11.*|>=1.2.11,<1.3.0a0|1.2.8.*']
> python=3.8 -> zlib[version='>=1.2.11,<1.3.0a0']
> 
> Package libgcc-ng conflicts for:
> python=3.8 -> libffi[version='>=3.2.1,<3.3.0a0'] -> libgcc-ng[version='>=4.9|>=7.2.0']
> python=3.8 -> libgcc-ng[version='>=7.3.0|>=7.5.0']`

May be its a small issue but I couldn't find any solution to this.

Regards
Furqan

PIRATE --check wants to write to its own folder?

We installed PIRATE but I want a user to check it, but it fails.
Seems it wants to write to it's own folder?
Can you use $outdir = File::Temp->tempdir(CLEANUP=>1) instead?

PIRATE --check

Running PIRATE on test files:
sh: /home/software/PIRATE/test/PIRATE/PIRATE.log: Permission denied
 - ERROR: PIRATE did not run correctly:
 - ERROR: PIRATE was not able to classify paralogs
 - WARNING: PIRATE could not make R plots (are dependencies installed)

 - tests completed

Use FindBin to locate true location of script

FindBin is the standard way to find where you are and resolve all symlinks. It's core module.
I think there can be issues with dirname($0) with symlinks.

use FindBin;
use Cwd 'abs_path';
my $script_path = abs_path($FindBin::RealBin);  # from a script in scripts/

my $script_path = abs_path("$FindBin::RealBin/../scripts");  # from PIRATE

use File::Basename;
my $exe_name = basename("$FindBin::RealScript");  #

grep dirname -r . | grep script_path

./bin/PIRATE:my $script_path = abs_path(dirname($0));
./scripts/classify_paralogs.pl:my $script_path = abs_path(dirname($0));
./scripts/split_paralogs_runner.pl:my $script_path = abs_path(dirname($0));
./scripts/run_PIRATE.pl:my $script_path = abs_path(dirname($0));
./scripts/align_feature_sequences.pl:my $script_path = abs_path(dirname($0));
./scripts/create_pangenome_alignment.pl:my $script_path = abs_path(dirname($0));
./scripts/link_clusters_runner.pl:my $script_path = abs_path(dirname($0));
./scripts/pangenome_construction.pl:my $script_path = abs_path(dirname($0));
./tools/treeWAS/pangenome_variants_to_treeWAS.pl:my $script_path = abs_path(dirname($0));
./tools/subsetting/select_representative:#my $script_path = abs_path(dirname($0));

Add --version

eg.

% PIRATE --version
PIRATE 0.4.9

This should match the release version in #10

Representative pan genes FASTA ?

Sion,

I was hoping to get a FASTA file out with one representative sequence per cluster but i can't seem to see one in the output or the Wiki output files.

ie. if there was 3000 clusters, a fasta file with the "best/longest" rep for each cluster

Ideally a .ffn (DNA) and .faa (AA) version.

Also, a pan, and a core (only clusters with all genomes in it) version

Have i missed something?

select_representative does not have .pl ending - odd one out

select_representative => select_representative.pl ?

In fact, I don't think it is even used anywhere?
git rm scripts/select_representative

Replace mkdir with File::Path make_path

https://perldoc.perl.org/File/Path.html#SYNOPSIS

Doing backquotes mkdir is not that safe.

Still getting fasttree error

 - WARNING: PIRATE could not make binary tree (is fastree installed?)
 - WARNING: PIRATE could not make R plots (are dependencies installed?)

 - tests completed

 $ PIRATE --version
PIRATE 1.0.2

which FastTree
/home/linuxbrew/.linuxbrew/bin/FastTree

From #8

../scripts/run_PIRATE.pl:$ft = "FastTree" if `command -v FastTree;`;
../scripts/run_PIRATE.pl:$ft = "fasttree" if `command -v fasttree;`;

I think these both always return true.

I am preparing a PR.

Annotations did not match contig names

Hello,

I apologize if this is the wrong place to raise this issue but I'm hoping you'll be able to provide some insight. I have several fasta files that I annotated using Prokka. I then took the gff3 files that were created and tried to run PIRATE. I received the error that too few (0) of my files had passed QC. I opened the Parse_GFF_log.txt file and saw the message "Annotations did not match contig names". Currently the contig sequences at the end of the gff file are named numerically ie 0,1,2 etc. What annotations does it want the contig names to match? How do I go about resolving this issue?
Thank you in advance.

Missing --version flag

should print to stdout one line:

PIRATE 1.0.4

and return 0

3 scripts are not executable

-rw-r--r--. 1 tseemann domain^users  2927 Sep  4 11:45 align_nucleotide_sequences.pl
-rw-r--r--. 1 tseemann domain^users 36031 Sep  4 11:45 pangenome_graph.pl
-rw-r--r--. 1 tseemann domain^users  5277 Sep  4 11:45 split_paralogs_runner.pl

Added to Brew pkg manager

brewsci/homebrew-bio#761

brew install brewsci/bio/pirate

CD-HIT called wrongly in PangenomeConstruction.pl and PangenomeConstruction.bkup2.pl

The CD-HIT executable is called as cdhit instead of cd-hit at

PangenomeConstruction.pl:103: cdhit -i $working_dir/$sample.temp.fasta -o $working_dir/$sample.$i -c $curr_thresh -n 5 >> $cdhit_log;

PangenomeConstruction.bkup.pl:99: cdhit -i $working_dir/$sample.temp.fasta -o $working_dir/$sample.$i -c $curr_thresh -n 5;

PangenomeConstruction.bkup2.pl:102: cdhit -i $working_dir/$sample.temp.fasta -o $working_dir/$sample.$i -c $curr_thresh -n 5;

add a table of new ID and previous ID after modifying gffs

Hi,

Not sure if it was addressed somewhere else or if I missed a command line argument

The normalisation of the gffs change the ID from the input gffs and make it difficult to use data computed from the original gffs and the output of PIRATE. When running PIRATE only with the CDS the tRNA and mRNA are removed thus shifting the ID. I was very confused when manually checking PIRATE results because the names/length of sequences with the PIRATE.gene_families didn't always match (IDs at the top of the input gff files may not modified because tRNA and rRNA are only found later in the file)

Can I suggest to either always use the original ID in the output file or provide a table with the new and old ID so when can correctly merge datasets.

Typo

 - WARNING: PIRATE could not make binary tree (is fastree installed?)

fasttree?

Adding more genomes to a PIRATE run

Hi,
I have been using PIRATE for bacterial genomes and was wondering if there is a way to conduct an analysis by adding more genomes to an old run.
I used it on 130 genomes in the first run and now have to add a few more genomes for the same analysis. From my understanding, it seems like I have to rerun the whole thing with all the genomes, but is there a way around it?
PS: Really appreciate the PIRATE humor. :)

Bug with checking dependencies

$diamond_bin = "diamond blastp" if `command -v diamond makedb;`;

command -v diamond makedb
/home/linuxbrew/.linuxbrew/bin/diamond
/home/linuxbrew/.linuxbrew/bin/makedb  # this is some other GNU tool

might have to do

iamond help | grep makedb
makedb  Build DIAMOND database from a FASTA file

memory as a command-line argument?

Have PIRATE running seemed to work well on smaller dataset. But in large dataset Cd-Hit runs out of memory. It appears to be hardcoded as -M 2450. Can you make this a command-line argument?

Thanks.

List non-core Perl module dependencies

Some modules are core, but it depend on perl version. Let's say 5.26:
https://perldoc.perl.org/5.26.0/index-modules-T.html

These are the modules you are using:

grep -h '^use ' -r . | cut -d ' ' -f2 | sed 's/;//' | sort | uniq -c | sort -nr
     43 warnings
     43 strict
     29 Getopt::Long
     28 Pod::Usage
     20 File::Basename
     20 Cwd
      3 Text::Wrap
      3 List::Util
      2 Bio::AlignIO
      1 IPC::Open2
      1 File::Temp
      1 Bio::SeqIO
      1 Bio::Seq
      1 Bio::Perl

The good news is, I think Bioperl is the only non-core Perl module!

[Feature] Make output file with original gene designations from GFF files

I see the genes have been renumbered by PIRATE. I use Prokka to number the genes and use a 10-step (_00010, _00020 etc), but PIRATE renumbers these and adds the full name to it.

Example:
genome Ec2456_phyloC, gene designations Ec2456_00010 etc

Renamed to:
genome Ec2456_phyloC, gene designations Ec2456_phyloC_00001 etc

Is it possible to instruct PIRATE not to renumber or rename, and work with the original codes from the GFF files?

Min length of gene families

Noticed that the min_length for the members of gene families is 123 bp, although some genes in the input were as small as 75 bp. Is there a parameter that can be altered that allows these small genes to be included in families?

- creating co-ordinate files Failed

Hi,

I set up running PIRATE on a collection of bacterial genomes, using the following command,

PIRATE -i ./gff/ -o ./panOut -a -r -f "rRNA, tRNA" -k "-f 6" -t 40

However, it failed in few second with error message,
"- ERROR: feature co-ordinate extraction failed"

Can anyone please help me explain what may go wrong, how should i fix this problem?

Thanks a lot.

need suggestions on my PIRATE commend.

Hello
I run PIRATE on my around 300 bacterium genomes, using the following command:

PIRATE -i ./311gff/ -o ./311Agro_panOut -a -r -k "-f 6" -t 40

Which is, I used default identity thresholds.
The ~300 strains ANI values ranges from 77.556 to 100 %.
There is one gene called tolB which are highly identical > 98%, however, in the PIRATE_gene_presence_absence.ordered.tsv table, these three genes were classified into three gene_families.
e.g. tolB gene_family g000845, the cutoff PIRATE used is 50%;
tolB in g032043_000006 gene family, the cutoff is 98%;
tolB in g044923_000006 gene family, the cutoff is 98%.

However, the interesting is: g032043_000006 is only found in one species(containing 29 strains), and g044923_000006 gene family is also only found in one specific species (only three strains), though g000845, g032043, g044923 are highly similar > 98.8%.
Feels like the default setting somehow is able to put genes from the same species in one gene family, at least for gene tolB in these two species. I also checked tolB in other species, then I got the 50% cutoff gene family g000845.
and then how to interpret why the three gene family are not cluster as one gene family, and one even was clustered with other genes with 50% cutoff. I am really confused on how to interpret this data.
I am wondering if you can give some suggestions on how to modify my PIRATE codes.
I would like to try different things to compare based on your experience.
Actually I sent you email though never get your reply. Now I am posting here to ask for help again.
I am sorry if my question is too basic to ask.
Really appreciate your help.

Best,
Limin

questions on output files

Hi,

After finishing running PIRATE, there is an output file called core_alignment.fasta. I want to know how many genes are involved in this alignment, and which genes they are? How or where can I get these information? Because I want to know the individual genes that consist of this core_alignment.fasta.
Also, I want to get amino acid alignment of core genes. Is it possible to have PIRATE create it for me? or I have to translate it from core_alignment.fasta.

Thanks a lot.
Best,
LC

problem with fasttree and --pan-off

Hi,

I've been trying out PIRATE on some bacterial genomes, and I had a couple questions

I have FastTree binary in my path, and when I run check_dependencies.pl under scripts it all seems fine. I checked closed issues on this, and I think you did some changes to fix this before but in run_PIRATE.pl

PIRATE/scripts/run_PIRATE.pl

Line 453 in 8e857f1

if (`command -v fasttree;`){

In my installation, command -v fasttree; returns false but command -v FastTree; is true so PIRATE couldn't generate a tree

Also, I couldn't figure out in what context '--pan-off' can be useful exactly? Do you have any examples for when I might want to try this?

Thanks!

PS. really like the random pirate jokes at the end! Glad to see I'm not the only one who prefers some humor to exit flag 0

use parallel instead of xargs ?

./scripts/run_PIRATE.pl:        `cat $pirate_dir/genome_list.txt | xargs -I {} cat $pirate_dir/genome_multifastas/{}.fasta > $pirate_dir/pan_sequences.fasta`;

I think this command just cats all the files in the list together into a single fasta?

Having issues in conda with -I {} .
Can we use parallel -j1 somehow instead?

UPDATE: not sure if this is the real problem

most output files missing

Hi Sion,

I have tried running PIRATE and it seems to have worked (in the log file there are no obvious errors) but then most of the output files you describe do not appear. Any idea why?

Cheers.

` - WARNING: R not found in system path, cannot use -r command.

PIRATE input options:

Input Directory = /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE
Output directory = /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE
PIRATE will run using 20 cores
4713 files in input directory.
PIRATE will be run on 50,60,70,80,90,95,98 amino acid % identity thresholds.
PIRATE will be run on features annotated as CDS

Standardising and checking input files:

4713 gff files passed QC and will be analysed by PIRATE - completed in: 149s

creating co-ordinate files - completed in: 50s
creating genome loci list: - completed in: 162s

Extracting pangenome sequences:

completed in: 751s

Constructing pangenome sequences:

Options:

Creating pangenome on amino acid % identity using BLAST.
Input directory: /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE
Output directory: /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations
Number of input files: 1
Threshold(s): 50 60 70 80 90 95 98
MCL inflation value: 1.5
Homology test cutoff: 1E-6
Loci file contains 22822198 loci from 4713 genomes.
Extracting core loci during cdhit clustering

Opening pan_sequences
/gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pan_sequences.fasta contains 22659607 sequences.
Passing 22659608 loci to cd-hit at 100%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.100 -aS 0.9 -c 1 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99.5 -aS 0.9 -c 0.995 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99 -aS 0.9 -c 0.99 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98.5 -aS 0.9 -c 0.985 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98 -aS 0.9 -c 0.98 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
completed in 16985 secs
0 core loci (0%)
22659608 non-core loci (100%)
433444 representative loci passed to blast.
running all-vs-all BLASTP on pan_sequences
completed in 33254 secs
running mcl on pan_sequences at 50
50379 clusters at 50 % - completed in 712 secs
running mcl on pan_sequences at 60
61760 clusters at 60 % - completed in 2838 secs
running mcl on pan_sequences at 70
74797 clusters at 70 % - completed in 2545 secs
running mcl on pan_sequences at 80
93099 clusters at 80 % - completed in 2300 secs
running mcl on pan_sequences at 90
133221 clusters at 90 % - completed in 2005 secs
running mcl on pan_sequences at 95
194809 clusters at 95 % - completed in 1744 secs
running mcl on pan_sequences at 98
368701 clusters at 98 % - completed in 1659 secs
reinflating clusters for pan_sequences
Finished
completed in: 64346s

Parsing pangenome files:

Processing 50% - 10689 paralogous gene clusters.
Processing 60% - 11560 paralogous gene clusters.
Processing 70% - 12329 paralogous gene clusters.
Processing 80% - 13154 paralogous gene clusters.
Processing 90% - 14112 paralogous gene clusters.
Processing 95% - 14136 paralogous gene clusters.
Processing 98% - 12281 paralogous gene clusters.

10689 paralog containing gene clusters detected.
4713 genomes processed.

completed in: 13809s

Classifing paralogous clusters:

19508966 loci contained in 10689 clusters containing paralogs
(base) [lipworth@rescomp2 PIRATE]$ cat pangenome_log.txt

Options:

Creating pangenome on amino acid % identity using BLAST.
Input directory: /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE
Output directory: /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations
Number of input files: 1
Threshold(s): 50 60 70 80 90 95 98
MCL inflation value: 1.5
Homology test cutoff: 1E-6
Loci file contains 22822198 loci from 4713 genomes.
Extracting core loci during cdhit clustering

Opening pan_sequences
/gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pan_sequences.fasta contains 22659607 sequences.
Passing 22659608 loci to cd-hit at 100%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.100 -aS 0.9 -c 1 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99.5 -aS 0.9 -c 0.995 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99 -aS 0.9 -c 0.99 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98.5 -aS 0.9 -c 0.985 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98 -aS 0.9 -c 0.98 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
completed in 16985 secs
0 core loci (0%)
22659608 non-core loci (100%)
433444 representative loci passed to blast.
running all-vs-all BLASTP on pan_sequences
completed in 33254 secs
running mcl on pan_sequences at 50
50379 clusters at 50 % - completed in 712 secs
running mcl on pan_sequences at 60
61760 clusters at 60 % - completed in 2838 secs
running mcl on pan_sequences at 70
74797 clusters at 70 % - completed in 2545 secs
running mcl on pan_sequences at 80
93099 clusters at 80 % - completed in 2300 secs
running mcl on pan_sequences at 90
133221 clusters at 90 % - completed in 2005 secs
running mcl on pan_sequences at 95
194809 clusters at 95 % - completed in 1744 secs
running mcl on pan_sequences at 98
368701 clusters at 98 % - completed in 1659 secs
reinflating clusters for pan_sequences
Finished
`

The only files this gives me is
PIRATE.log
./co-ords/
genome_list.txt
loci_list.tab
pan_sequences.fasta
pangenome_log.txt
paralog_working
cluster_alleles.tab
genome2loci.tab
./genome_multifastas/
./modified_gffs/
./pangenome_iterations/
paralog_clusters.tab

Tag a release?

Any chance you can tag a release?

https://github.com/SionBayliss/PIRATE/releases

fasttree vs FastTree ?

./PIRATE  --check
 - WARNING: fasttree not found in system path, a binary presence-absence tree will not be created.

The standard exe name is FastTree - any chance you can support both?

I did cd PIRATE/bin && ln -s $(which FastTree) fasttree for now

More genomes to an existing PIRATE run

I was wondering if you can add more genomes later to a previous analysis?

Use with eukaryotic/intron containing models

I realize currently there is a one CDS per locus model and this is very much a bacteria-focus project. But are there reasons it cannot work with other systems?
I have done my own work around using mRNA feature and nucleotide comparisons which does achieve results, but I wonder if you are open to code which deal with multi-CDS features for a single mRNA feature that could be spliced together to make the feature sequence that is compared?

possible bug with unlinking files

        # remove previous test files.
        unlink "$output_dir/modified_gffs/HO_5096_0412_test.gff" if "$output_dir/modifi
ed_gffs/HO_5096_0412_test.gff";

did you mean if -f "$output/..." or -e ?
because the string will always be true, you want to test the file?

PIRATE to Roary

Hi,

I am trying to covert the PIRATE output to Roary. However, I got "input directory not found", even though it is there.

hsb18158@sipbsmicro ~/PIRATE % bin/PIRATE PIRATE_to_roary.pl -i /home/hsb18158/data/PIRATE_output/PIRATE.gene_families.tsv -o /home/hsb18158/data/PIRATE_to_roary/gene_families.csv
ERROR: input directory not
hsb18158@sipbsmicro ~/PIRATE % bin/PIRATE convert_to_roary.pl -i /home/hsb18158/data/PIRATE_output/PIRATE.gene_families.tsv -o /home/hsb18158/data/PIRATE_to_roary/gene_families.csv
ERROR: input directory not found.

Any suggestion?

Thank you,
Parra

Problem with creating core alignments

At the end of the analysis using "PIRATE -i ./test_gff3 -f "tRNA,rRNA,CDS" -s "95,96,97,98" -k "--cd-low 98 -e 1E-12" -a", it says below
Can't use an undefined value as an ARRAY reference at /home/XXXXX/.conda/envs/conda2env/scripts/create_pangenome_alignment.pl line 384, line 9549.

0 clusters to be printed to output
100 % clusters added to output
ERROR: creating core concatenate failed
completed in: 0s
Could you help address this issue? Thanks

fasttree vs FastTree still not working

$ command -v fasttree

$ command -v FastTree
/home/linuxbrew/.linuxbrew/bin/FastTree

./PIRATE  --check

Running PIRATE on test files:

 - PIRATE completed with no errors

 - WARNING: PIRATE could not make binary tree (is fastree installed?)
 - WARNING: PIRATE could not make R plots (are dependencies installed?)

 - tests completed

Not sure what is happening here:

../scripts/check_dependencies.pl:# fasttree
../scripts/check_dependencies.pl:$ft = 1 if `command -v fasttree;`;
../scripts/check_dependencies.pl:$ft = 1 if `command -v FastTree;`;
../scripts/check_dependencies.pl:print " - WARNING: fasttree not found in system path, a binary presence-absence tree will not be created.\n" if $ft == 0;
../scripts/run_PIRATE.pl:# set fasttree executable
../scripts/run_PIRATE.pl:$ft = "FastTree" if `command -v FastTree;`;
../scripts/run_PIRATE.pl:$ft = "fasttree" if `command -v fasttree;`;
../scripts/run_PIRATE.pl:# create binary fasta file for fasttree
../scripts/run_PIRATE.pl:if (`command -v fasttree;`){

Faulty #! lines in scripts/*.pl

cd PIRATE/scripts

grep -h '^#!' *pl  | sort | uniq -c | sort -n

      1 #!/usr/bin/env perl
      1 #!usr/bin/env perl
      1 #!/usr/bin/perl -w
      2 #!/usr/bin/perl
      2 #!/usr/perl
     19 #!/usr/bin/env perl

They should all be #!/usr/bin/env perl

I assume they worked because you are calling them with perl /path/scripts/foo.pl but it would be good to fix these.

rplots vs rplot

Hi Sion,

In run_PIRATE.pl rplot(s) is different:

In the variable and help message it is rplots, but getopt is looking for rplot - this means the long opt doesn't work.

Scoary crashes on PIRATE converted output file due to "--include_input_columns ALL" option

Looks like a really interesting tool. As I have used the Roary/Scoary couple so far, I like the conversion to a Roary output.

I have tested this, and when the "--include_input_columns ALL" option is selected with Scoary, which allows the gene number columns to be copied to the output, then Scoary crashes on the converted file. Somehow the converted file is not fully compatible?

When the option is left out, then Scoary works fine, but would prefer to have the gene numbers to be included in the output. Any thoughts on what the subtle difference in format of Roary and converted files may be?

Plasmid comparison

Hello 🙂

I would like to ask you some help/suggestions please.
I am trying to create an image (either with brig or cgview or circos) where three plasmids are compared. Every plasmid comes from a different strain and they have some differences. My idea was to create a sort of pan genome of the three plasmids as a reference, and then blast every plasmid around this pan genome, showing the genes are present in every single plasmid. I hope this makes sense! I use pirate to perform the analyses, but actually i don't really know how to go on. I don't think that the pan_sequences.fasta file is what I need (since it contains the same gene multiple times), but also the representative_sequences.ffn is not ideal (as it only has one representative per gene cluster, and if a plasmids have duplications I won't be able to show them).
Would anyone please have any suggestions?

Thank oyu.

Laura