sionbayliss / pirate Goto Github PK
View Code? Open in Web Editor NEWA toolbox for pangenome analysis and threshold evaluation.
License: GNU General Public License v3.0
A toolbox for pangenome analysis and threshold evaluation.
License: GNU General Public License v3.0
Hi,
When running PIRATE with several features (e.g. CDS,rRNA,tRNA), some of which require nucleotide homology to be used (i.e. rRNA,tRNA), does PIRATE automatically resort to using nucleotide sequence homolgy for all features? Or just the ones requiring it?
I.e. would CDS be analyzed using nucleotide or a.a. homology in this case? If nucleotide (which I suspect from an example on the main page), would it be difficult to implement an option to use a.a. for CDS but nucleotide for non-coding features?
Any response is greatly appreciated!!
Thank you,
Conrad
Hi,
I run following two commands, both created the same number of output files.
PIRATE -i ./311gff/ -o ./311Agro_panOut_RNAcalled -f "tRNA,rRNA" -a -r -k "-f 6" -t 40
PIRATE -i ./311gff/ -o ./311Agro_panOut -a -r -k "-f 6" -t 40
The only difference is the second line of codes without -f "tRNA,rRNA". Both command create the same amount of output files, and no error message for both.
However, the output file contents of the first command is weird, and it ran less than 10 mins, while the second command run around 9 hours.
Here is one comparison of PIRATE.pangenome_summary.txt from both commands:
First command:
%isolates #clusters >1 allele fission/fusion multicopy
0-10% 160 4 0 0
10-25% 5 1 0 0
25-50% 6 0 0 0
50-75% 5 0 0 0
75-90% 1 0 0 0
90-95% 1 1 0 0
95-100% 37 13 0 0
Second command output:
%isolates #clusters >1 allele fission/fusion multicopy
0-10% 42290 1859 650 167
10-25% 3061 790 294 144
25-50% 3207 610 408 163
50-75% 1101 293 286 124
75-90% 503 162 158 100
90-95% 104 25 39 23
95-100% 1967 246 506 165
Can anyone help explain what may go wrong when adding -f "tRNA,rRNA"? Though it doesn't create error message, it doesn't make sense it finish in less than 10 mins while the second command ran 9 hours.
Thanks.
Limin
Hi,
I am looking at my gene_families.tsv file and I have ~11,000 entries in there but when I look at the annotation file (pangenome_alignment.gff) there are only ~7500 and I can't seem to understand why there are a lot of gene families missing?
Thanks in advance,
Marcela
PIRATE -c doesn't work (but --check does)
Conflict with another -c ?
I am trying to construct a pan genome using genomes that consist only of contigs. However, pirate gives me the following error
- running mcl on pan_sequences at 50
- 0 clusters at 50 % - completed in 0 secs
- running mcl on pan_sequences at 60
- ERROR: pangenome_construction.pl failed - error logged at /path/to/output/folder/fail_test.txt
fail_text.txt:
BLAST options error: File /path/to/output/folder/pangenome_iterations/pan_sequences.representative.fasta is empty
- ERROR: no clusters in /path/to/output/folder/pangenome_iterations/pan_sequences.mcl_50.clusters
When I add a complete genome into the pool of genomes, on which the pan genome is constructed, everything works fine. Is this a feature or a bug, and do you have an explanation to why this arises?
I have run PIRATE and the execution was really fast and fine. My question now is if using some of the scripts I can extract shared genes between a set of genomes that are not within the rest of genomes. Is that possible? thank you!
Hello,
I just wanted to clarify what the pangenome alignment file contains. From my understanding, it contains all genes in the pangenome, concatenated together, with a sequence for each sample in the PIRATE run. Genes are in the same order as in the ordered tsv file. Is this correct?
If so, are the genes concatenated simply end-to-end, or is some spacer used? And how are missing genes handled? "N"s or "-"? I understand that dosage > 1 genes are replaced by "?", but does this refer to all copies of the gene, or subsequent (after the first, however that is defined)?
Thank you!
Conrad
$ PIRATE --help
$ echo $?
1
should be 0 as I asked for help and got it. there was no error.
same for --version
in #11
I am trying to convert the output files but not working. Any help?
Hi,
I am trying to install the package but apparently Conda cannot find it. Would you please have look at this issue and let me know how to install the pirate package? I am eager to try it out.
Cheers,
Pablo
Hello,
I am trying to install pirate via conda. I have activated all the mentioned conda channels. But it provides the following error or issue:
> `root@honey-pc:/home/furqan# conda install pirate
> Collecting package metadata (current_repodata.json): done
> Solving environment: failed with initial frozen solve. Retrying with flexible solve.
> Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
> Collecting package metadata (repodata.json): done
> Solving environment: failed with initial frozen solve. Retrying with flexible solve.
> Solving environment: -
> Found conflicts! Looking for incompatible packages.
> This can take several minutes. Press CTRL-C to abort.
> failed
>
> UnsatisfiableError: The following specifications were found to be incompatible with each other:
>
> Output in format: Requested package -> Available versions
>
> Package libstdcxx-ng conflicts for:
> python=3.8 -> libstdcxx-ng[version='>=7.3.0|>=7.5.0']
> python=3.8 -> libffi[version='>=3.2.1,<3.3.0a0'] -> libstdcxx-ng[version='>=4.9|>=7.2.0']
>
> Package zlib conflicts for:
> pirate -> blast[version='>=2.2.31'] -> zlib[version='1.2.11.*|>=1.2.11,<1.3.0a0|1.2.8.*']
> python=3.8 -> zlib[version='>=1.2.11,<1.3.0a0']
>
> Package libgcc-ng conflicts for:
> python=3.8 -> libffi[version='>=3.2.1,<3.3.0a0'] -> libgcc-ng[version='>=4.9|>=7.2.0']
> python=3.8 -> libgcc-ng[version='>=7.3.0|>=7.5.0']`
May be its a small issue but I couldn't find any solution to this.
Regards
Furqan
We installed PIRATE but I want a user to check it, but it fails.
Seems it wants to write to it's own folder?
Can you use $outdir = File::Temp->tempdir(CLEANUP=>1)
instead?
PIRATE --check
Running PIRATE on test files:
sh: /home/software/PIRATE/test/PIRATE/PIRATE.log: Permission denied
- ERROR: PIRATE did not run correctly:
- ERROR: PIRATE was not able to classify paralogs
- WARNING: PIRATE could not make R plots (are dependencies installed)
- tests completed
FindBin
is the standard way to find where you are and resolve all symlinks. It's core module.
I think there can be issues with dirname($0)
with symlinks.
use FindBin;
use Cwd 'abs_path';
my $script_path = abs_path($FindBin::RealBin); # from a script in scripts/
my $script_path = abs_path("$FindBin::RealBin/../scripts"); # from PIRATE
use File::Basename;
my $exe_name = basename("$FindBin::RealScript"); #
grep dirname -r . | grep script_path
./bin/PIRATE:my $script_path = abs_path(dirname($0));
./scripts/classify_paralogs.pl:my $script_path = abs_path(dirname($0));
./scripts/split_paralogs_runner.pl:my $script_path = abs_path(dirname($0));
./scripts/run_PIRATE.pl:my $script_path = abs_path(dirname($0));
./scripts/align_feature_sequences.pl:my $script_path = abs_path(dirname($0));
./scripts/create_pangenome_alignment.pl:my $script_path = abs_path(dirname($0));
./scripts/link_clusters_runner.pl:my $script_path = abs_path(dirname($0));
./scripts/pangenome_construction.pl:my $script_path = abs_path(dirname($0));
./tools/treeWAS/pangenome_variants_to_treeWAS.pl:my $script_path = abs_path(dirname($0));
./tools/subsetting/select_representative:#my $script_path = abs_path(dirname($0));
Sion,
I was hoping to get a FASTA file out with one representative sequence per cluster but i can't seem to see one in the output or the Wiki output files.
ie. if there was 3000 clusters, a fasta file with the "best/longest" rep for each cluster
Ideally a .ffn (DNA) and .faa (AA) version.
Also, a pan, and a core (only clusters with all genomes in it) version
Have i missed something?
select_representative => select_representative.pl ?
In fact, I don't think it is even used anywhere?
git rm scripts/select_representative
https://perldoc.perl.org/File/Path.html#SYNOPSIS
Doing backquotes mkdir is not that safe.
- WARNING: PIRATE could not make binary tree (is fastree installed?)
- WARNING: PIRATE could not make R plots (are dependencies installed?)
- tests completed
$ PIRATE --version
PIRATE 1.0.2
which FastTree
/home/linuxbrew/.linuxbrew/bin/FastTree
From #8
../scripts/run_PIRATE.pl:$ft = "FastTree" if `command -v FastTree;`;
../scripts/run_PIRATE.pl:$ft = "fasttree" if `command -v fasttree;`;
I think these both always return true.
I am preparing a PR.
Hello,
I apologize if this is the wrong place to raise this issue but I'm hoping you'll be able to provide some insight. I have several fasta files that I annotated using Prokka. I then took the gff3 files that were created and tried to run PIRATE. I received the error that too few (0) of my files had passed QC. I opened the Parse_GFF_log.txt file and saw the message "Annotations did not match contig names". Currently the contig sequences at the end of the gff file are named numerically ie 0,1,2 etc. What annotations does it want the contig names to match? How do I go about resolving this issue?
Thank you in advance.
should print to stdout one line:
PIRATE 1.0.4
and return 0
-rw-r--r--. 1 tseemann domain^users 2927 Sep 4 11:45 align_nucleotide_sequences.pl
-rw-r--r--. 1 tseemann domain^users 36031 Sep 4 11:45 pangenome_graph.pl
-rw-r--r--. 1 tseemann domain^users 5277 Sep 4 11:45 split_paralogs_runner.pl
brew install brewsci/bio/pirate
The CD-HIT executable is called as cdhit instead of cd-hit at
PangenomeConstruction.pl:103: cdhit -i $working_dir/$sample.temp.fasta -o $working_dir/$sample.$i -c $curr_thresh -n 5 >> $cdhit_log
;
PangenomeConstruction.bkup.pl:99: cdhit -i $working_dir/$sample.temp.fasta -o $working_dir/$sample.$i -c $curr_thresh -n 5
;
PangenomeConstruction.bkup2.pl:102: cdhit -i $working_dir/$sample.temp.fasta -o $working_dir/$sample.$i -c $curr_thresh -n 5
;
Hi,
Not sure if it was addressed somewhere else or if I missed a command line argument
The normalisation of the gffs change the ID from the input gffs and make it difficult to use data computed from the original gffs and the output of PIRATE. When running PIRATE only with the CDS the tRNA and mRNA are removed thus shifting the ID. I was very confused when manually checking PIRATE results because the names/length of sequences with the PIRATE.gene_families didn't always match (IDs at the top of the input gff files may not modified because tRNA and rRNA are only found later in the file)
Can I suggest to either always use the original ID in the output file or provide a table with the new and old ID so when can correctly merge datasets.
- WARNING: PIRATE could not make binary tree (is fastree installed?)
fasttree?
Hi,
I have been using PIRATE for bacterial genomes and was wondering if there is a way to conduct an analysis by adding more genomes to an old run.
I used it on 130 genomes in the first run and now have to add a few more genomes for the same analysis. From my understanding, it seems like I have to rerun the whole thing with all the genomes, but is there a way around it?
PS: Really appreciate the PIRATE humor. :)
$diamond_bin = "diamond blastp" if `command -v diamond makedb;`;
command -v diamond makedb
/home/linuxbrew/.linuxbrew/bin/diamond
/home/linuxbrew/.linuxbrew/bin/makedb # this is some other GNU tool
might have to do
iamond help | grep makedb
makedb Build DIAMOND database from a FASTA file
Have PIRATE running seemed to work well on smaller dataset. But in large dataset Cd-Hit runs out of memory. It appears to be hardcoded as -M 2450. Can you make this a command-line argument?
Thanks.
Some modules are core, but it depend on perl version. Let's say 5.26:
https://perldoc.perl.org/5.26.0/index-modules-T.html
These are the modules you are using:
grep -h '^use ' -r . | cut -d ' ' -f2 | sed 's/;//' | sort | uniq -c | sort -nr
43 warnings
43 strict
29 Getopt::Long
28 Pod::Usage
20 File::Basename
20 Cwd
3 Text::Wrap
3 List::Util
2 Bio::AlignIO
1 IPC::Open2
1 File::Temp
1 Bio::SeqIO
1 Bio::Seq
1 Bio::Perl
The good news is, I think Bioperl is the only non-core Perl module!
I see the genes have been renumbered by PIRATE. I use Prokka to number the genes and use a 10-step (_00010, _00020 etc), but PIRATE renumbers these and adds the full name to it.
Example:
genome Ec2456_phyloC, gene designations Ec2456_00010 etc
Renamed to:
genome Ec2456_phyloC, gene designations Ec2456_phyloC_00001 etc
Is it possible to instruct PIRATE not to renumber or rename, and work with the original codes from the GFF files?
Noticed that the min_length for the members of gene families is 123 bp, although some genes in the input were as small as 75 bp. Is there a parameter that can be altered that allows these small genes to be included in families?
Hi,
I set up running PIRATE on a collection of bacterial genomes, using the following command,
PIRATE -i ./gff/ -o ./panOut -a -r -f "rRNA, tRNA" -k "-f 6" -t 40
However, it failed in few second with error message,
"- ERROR: feature co-ordinate extraction failed"
Can anyone please help me explain what may go wrong, how should i fix this problem?
Thanks a lot.
Hello
I run PIRATE on my around 300 bacterium genomes, using the following command:
PIRATE -i ./311gff/ -o ./311Agro_panOut -a -r -k "-f 6" -t 40
Which is, I used default identity thresholds.
The ~300 strains ANI values ranges from 77.556 to 100 %.
There is one gene called tolB which are highly identical > 98%, however, in the PIRATE_gene_presence_absence.ordered.tsv table, these three genes were classified into three gene_families.
e.g. tolB gene_family g000845, the cutoff PIRATE used is 50%;
tolB in g032043_000006 gene family, the cutoff is 98%;
tolB in g044923_000006 gene family, the cutoff is 98%.
However, the interesting is: g032043_000006 is only found in one species(containing 29 strains), and g044923_000006 gene family is also only found in one specific species (only three strains), though g000845, g032043, g044923 are highly similar > 98.8%.
Feels like the default setting somehow is able to put genes from the same species in one gene family, at least for gene tolB in these two species. I also checked tolB in other species, then I got the 50% cutoff gene family g000845.
and then how to interpret why the three gene family are not cluster as one gene family, and one even was clustered with other genes with 50% cutoff. I am really confused on how to interpret this data.
I am wondering if you can give some suggestions on how to modify my PIRATE codes.
I would like to try different things to compare based on your experience.
Actually I sent you email though never get your reply. Now I am posting here to ask for help again.
I am sorry if my question is too basic to ask.
Really appreciate your help.
Best,
Limin
Hi,
After finishing running PIRATE, there is an output file called core_alignment.fasta. I want to know how many genes are involved in this alignment, and which genes they are? How or where can I get these information? Because I want to know the individual genes that consist of this core_alignment.fasta.
Also, I want to get amino acid alignment of core genes. Is it possible to have PIRATE create it for me? or I have to translate it from core_alignment.fasta.
Thanks a lot.
Best,
LC
Hi,
I've been trying out PIRATE on some bacterial genomes, and I had a couple questions
check_dependencies.pl
under scripts it all seems fine. I checked closed issues on this, and I think you did some changes to fix this before but in run_PIRATE.pl
Line 453 in 8e857f1
In my installation, command -v fasttree;
returns false but command -v FastTree;
is true so PIRATE couldn't generate a tree
Thanks!
PS. really like the random pirate jokes at the end! Glad to see I'm not the only one who prefers some humor to exit flag 0
./scripts/run_PIRATE.pl: `cat $pirate_dir/genome_list.txt | xargs -I {} cat $pirate_dir/genome_multifastas/{}.fasta > $pirate_dir/pan_sequences.fasta`;
I think this command just cats all the files in the list together into a single fasta?
Having issues in conda with -I {}
.
Can we use parallel -j1
somehow instead?
UPDATE: not sure if this is the real problem
Hi Sion,
I have tried running PIRATE and it seems to have worked (in the log file there are no obvious errors) but then most of the output files you describe do not appear. Any idea why?
Cheers.
` - WARNING: R not found in system path, cannot use -r command.
PIRATE input options:
Standardising and checking input files:
Extracting pangenome sequences:
Constructing pangenome sequences:
Options:
Opening pan_sequences
/gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pan_sequences.fasta contains 22659607 sequences.
Passing 22659608 loci to cd-hit at 100%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.100 -aS 0.9 -c 1 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99.5 -aS 0.9 -c 0.995 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99 -aS 0.9 -c 0.99 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98.5 -aS 0.9 -c 0.985 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98 -aS 0.9 -c 0.98 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
completed in 16985 secs
0 core loci (0%)
22659608 non-core loci (100%)
433444 representative loci passed to blast.
running all-vs-all BLASTP on pan_sequences
completed in 33254 secs
running mcl on pan_sequences at 50
50379 clusters at 50 % - completed in 712 secs
running mcl on pan_sequences at 60
61760 clusters at 60 % - completed in 2838 secs
running mcl on pan_sequences at 70
74797 clusters at 70 % - completed in 2545 secs
running mcl on pan_sequences at 80
93099 clusters at 80 % - completed in 2300 secs
running mcl on pan_sequences at 90
133221 clusters at 90 % - completed in 2005 secs
running mcl on pan_sequences at 95
194809 clusters at 95 % - completed in 1744 secs
running mcl on pan_sequences at 98
368701 clusters at 98 % - completed in 1659 secs
reinflating clusters for pan_sequences
Finished
completed in: 64346s
Parsing pangenome files:
Processing 50% - 10689 paralogous gene clusters.
Processing 60% - 11560 paralogous gene clusters.
Processing 70% - 12329 paralogous gene clusters.
Processing 80% - 13154 paralogous gene clusters.
Processing 90% - 14112 paralogous gene clusters.
Processing 95% - 14136 paralogous gene clusters.
Processing 98% - 12281 paralogous gene clusters.
10689 paralog containing gene clusters detected.
4713 genomes processed.
Classifing paralogous clusters:
Options:
Opening pan_sequences
/gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pan_sequences.fasta contains 22659607 sequences.
Passing 22659608 loci to cd-hit at 100%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.100 -aS 0.9 -c 1 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99.5 -aS 0.9 -c 0.995 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 99%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.99 -aS 0.9 -c 0.99 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98.5%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98.5 -aS 0.9 -c 0.985 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
Passing 22659608 loci to cd-hit at 98%
command: "cd-hit -i /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.temp.fasta -o /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.98 -aS 0.9 -c 0.98 -T 20 -g 1 -n 5 -M 40731 -d 256 >> /gpfs2/well/bag/users/lipworth/gram_neg/PIRATE/pangenome_iterations/pan_sequences.cdhit_log.txt"
completed in 16985 secs
0 core loci (0%)
22659608 non-core loci (100%)
433444 representative loci passed to blast.
running all-vs-all BLASTP on pan_sequences
completed in 33254 secs
running mcl on pan_sequences at 50
50379 clusters at 50 % - completed in 712 secs
running mcl on pan_sequences at 60
61760 clusters at 60 % - completed in 2838 secs
running mcl on pan_sequences at 70
74797 clusters at 70 % - completed in 2545 secs
running mcl on pan_sequences at 80
93099 clusters at 80 % - completed in 2300 secs
running mcl on pan_sequences at 90
133221 clusters at 90 % - completed in 2005 secs
running mcl on pan_sequences at 95
194809 clusters at 95 % - completed in 1744 secs
running mcl on pan_sequences at 98
368701 clusters at 98 % - completed in 1659 secs
reinflating clusters for pan_sequences
Finished
`
The only files this gives me is
PIRATE.log
./co-ords/
genome_list.txt
loci_list.tab
pan_sequences.fasta
pangenome_log.txt
paralog_working
cluster_alleles.tab
genome2loci.tab
./genome_multifastas/
./modified_gffs/
./pangenome_iterations/
paralog_clusters.tab
Any chance you can tag a release?
./PIRATE --check
- WARNING: fasttree not found in system path, a binary presence-absence tree will not be created.
The standard exe name is FastTree
- any chance you can support both?
I did cd PIRATE/bin && ln -s $(which FastTree) fasttree
for now
I was wondering if you can add more genomes later to a previous analysis?
I realize currently there is a one CDS per locus model and this is very much a bacteria-focus project. But are there reasons it cannot work with other systems?
I have done my own work around using mRNA feature and nucleotide comparisons which does achieve results, but I wonder if you are open to code which deal with multi-CDS features for a single mRNA feature that could be spliced together to make the feature sequence that is compared?
# remove previous test files.
unlink "$output_dir/modified_gffs/HO_5096_0412_test.gff" if "$output_dir/modifi
ed_gffs/HO_5096_0412_test.gff";
did you mean if -f "$output/..."
or -e
?
because the string will always be true, you want to test the file?
Hi,
I am trying to covert the PIRATE output to Roary. However, I got "input directory not found", even though it is there.
hsb18158@sipbsmicro ~/PIRATE % bin/PIRATE PIRATE_to_roary.pl -i /home/hsb18158/data/PIRATE_output/PIRATE.gene_families.tsv -o /home/hsb18158/data/PIRATE_to_roary/gene_families.csv
ERROR: input directory not
hsb18158@sipbsmicro ~/PIRATE % bin/PIRATE convert_to_roary.pl -i /home/hsb18158/data/PIRATE_output/PIRATE.gene_families.tsv -o /home/hsb18158/data/PIRATE_to_roary/gene_families.csv
ERROR: input directory not found.
Any suggestion?
Thank you,
Parra
At the end of the analysis using "PIRATE -i ./test_gff3 -f "tRNA,rRNA,CDS" -s "95,96,97,98" -k "--cd-low 98 -e 1E-12" -a", it says below
Can't use an undefined value as an ARRAY reference at /home/XXXXX/.conda/envs/conda2env/scripts/create_pangenome_alignment.pl line 384, line 9549.
0 clusters to be printed to output
100 % clusters added to output
ERROR: creating core concatenate failed
completed in: 0s
Could you help address this issue? Thanks
$ command -v fasttree
$ command -v FastTree
/home/linuxbrew/.linuxbrew/bin/FastTree
./PIRATE --check
Running PIRATE on test files:
- PIRATE completed with no errors
- WARNING: PIRATE could not make binary tree (is fastree installed?)
- WARNING: PIRATE could not make R plots (are dependencies installed?)
- tests completed
Not sure what is happening here:
../scripts/check_dependencies.pl:# fasttree
../scripts/check_dependencies.pl:$ft = 1 if `command -v fasttree;`;
../scripts/check_dependencies.pl:$ft = 1 if `command -v FastTree;`;
../scripts/check_dependencies.pl:print " - WARNING: fasttree not found in system path, a binary presence-absence tree will not be created.\n" if $ft == 0;
../scripts/run_PIRATE.pl:# set fasttree executable
../scripts/run_PIRATE.pl:$ft = "FastTree" if `command -v FastTree;`;
../scripts/run_PIRATE.pl:$ft = "fasttree" if `command -v fasttree;`;
../scripts/run_PIRATE.pl:# create binary fasta file for fasttree
../scripts/run_PIRATE.pl:if (`command -v fasttree;`){
cd PIRATE/scripts
grep -h '^#!' *pl | sort | uniq -c | sort -n
1 #!/usr/bin/env perl
1 #!usr/bin/env perl
1 #!/usr/bin/perl -w
2 #!/usr/bin/perl
2 #!/usr/perl
19 #!/usr/bin/env perl
They should all be #!/usr/bin/env perl
I assume they worked because you are calling them with perl /path/scripts/foo.pl
but it would be good to fix these.
Hi Sion,
In run_PIRATE.pl rplot(s) is different:
In the variable and help message it is rplots, but getopt is looking for rplot - this means the long opt doesn't work.
Looks like a really interesting tool. As I have used the Roary/Scoary couple so far, I like the conversion to a Roary output.
I have tested this, and when the "--include_input_columns ALL" option is selected with Scoary, which allows the gene number columns to be copied to the output, then Scoary crashes on the converted file. Somehow the converted file is not fully compatible?
When the option is left out, then Scoary works fine, but would prefer to have the gene numbers to be included in the output. Any thoughts on what the subtle difference in format of Roary and converted files may be?
Hello ๐
I would like to ask you some help/suggestions please.
I am trying to create an image (either with brig or cgview or circos) where three plasmids are compared. Every plasmid comes from a different strain and they have some differences. My idea was to create a sort of pan genome of the three plasmids as a reference, and then blast every plasmid around this pan genome, showing the genes are present in every single plasmid. I hope this makes sense! I use pirate to perform the analyses, but actually i don't really know how to go on. I don't think that the pan_sequences.fasta file is what I need (since it contains the same gene multiple times), but also the representative_sequences.ffn is not ideal (as it only has one representative per gene cluster, and if a plasmids have duplications I won't be able to show them).
Would anyone please have any suggestions?
Thank oyu.
Laura
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.