alitvteam / alitv-perl-interface Goto Github PK

View Code? Open in Web Editor NEW

10.0 3.0 3.0 2.29 MB

License: MIT License

Perl 100.00%

alitv-perl-interface's Introduction

AliTV-Perl interface

Build status

Latest build:

DESCRIPTION

This repository contains two helper scripts for AliTV:

A perl script alitv.pl to generate input files for AliTV.
A perl script alitv-filter.pl to filter the json input files for AliTV.

⚠️ If your fasta file contains lowercase letters they will be treated as masked by lastz, see issue#152

INSTALLATION

External Requirements:

perl
cpanm
lastz

Install with:

git clone https://github.com/AliTVTeam/AliTV-perl-interface
cpanm --installdeps .

CITE

An article about AliTV has been published in PeerJ Computer Science Please cite this article if you use AliTV-perl-interface in your project. Additionally the software in any specific version can be cited via its zenodo doi:

AUTHOR

Frank Förster [email protected]
Markus J. Ankenbrand [email protected]

alitv-perl-interface's People

Contributors

Stargazers

Watchers

Forkers

zhaokai2014 wook2014 drsnjena

alitv-perl-interface's Issues

yml as input plus project name

Does not seem to work...

perl -I lib bin/alitv.pl data/chloroset/input.yml

***********************************************************************
*                                                                     *
*  AliTV perl interface                                               *
*                                                                     *
***********************************************************************

You are using version v0.1.12.
INFO - Created temporary folder at '/tmp/_Q2WDyftc2'
INFO - Starting alignment generation... (36 alignments required)
[...]

json/log/map files have the name autogen_XXXXX.*

But if proving a project name I am receiving a warning message:

perl -I lib bin/alitv.pl --project new_run data/chloroset/input.yml
YML file specified, therefore the project name was overwritten by 'input'!

***********************************************************************
*                                                                     *
*  AliTV perl interface                                               *
*                                                                     *
***********************************************************************

You are using version v0.1.12.
INFO - Created temporary folder at '/tmp/wDTDHkuKA3'
INFO - Starting alignment generation... (36 alignments required)

But json/log/map are now named as expected input.*

Remove pullapprove integration

Due to the new feature by Github, we do not require pullapprove anymore

Main script needs documentation

Poor performance on larger data sets

I started alitv.pl on the 6 reference E. coli genomes at NCBI. They are ~5 Mbp in size each.
As expected all lastz alignments (with default parameters) took about 1 h in total.
However after this step was finished for each of the alignments (105 in total) the following line was printed:

MAF input file detected... Therefore workaround for revcom issue activated

It took over 10 h to finish this MAF workaround for all 105 alignments.
Then the following message was printed:

Deleting temporary folder

However since this message more than 2 h have passed and still the json file is empty and the perl script runs with 100% CPU.

I expected the alignment generation to be the time limiting step but it seems that post processing takes more than 10fold that time.

Easy workflow

The easiest workflow should allow the following statement to generate AliTV JSON files:

# download a number of fasta and/or genebank files
./alitv.pl *

This should create a set of AliTV JSON input files and a yml file, for later modification and replay.

Move repo to AliTVTeam organization

Name can be AliTV-perl or AliTV-script or AliTV-generate or AliTV-prepare.

Ticks are not drawn correctly

Ticks are not drawn correctly as the json file contains tickDistance as a string rather than an integer.

Add option to just create yml

I often find myself using AliTV in quick mode (providing a list of fasta files to alitv.pl instead of a .yml file) just to let AliTV create the proper .yml file. Then cancel, adjust the .yml and run alitv.pl again with this modified .yml. While this works just fine it is a bit of a hack and might be worth considering as a feautre. The idea is a --create_yml or --yml_only parameter that if present stops AliTV after creating the .yml file.

Allow filter criteria for feature selection

Features need to be filtered out a huge file to avoid huge amounts of manual work. Filters are defined inside the yml file and are applied while importing. Therefore different feature classes can be imported from a single gff/gb/tabular file.

Error with huge sequence sets

Error in tempfile() using template tempXXXXX.maf: Could not create temp file temp50E5h.maf: Too many open files at [...]/AliTV-perl-interface/lib/AliTV/Alignment/lastz.pm line 70.

Unable to load alignment module

From a user (via mail):

i have installed all the dependency package for alitv as per the github
instruction. i want to create image for custom genomes. i have made yml
file as per the tutorials.

i am getting following error

FATAL - Unable to load alignment module 'AliTV::Alignment::lastz' at
/data/ngs/programs/AliTV-perl-interface/bin/../lib/AliTV/Script.pm line 123.

Error in 'Alignment' module with MAFFT alignment

Hi,

I read your recent publication and I'm really interested in trying out AliTV, but I can't get it to work with my data.
I got 8 viral genomes in FASTA format which were aligned by MAFFT.
I added the following lines to the yaml file:

---
alignment:
    program: importer
    parameter:
        - "../WSSV_genomes_MAFFT_alignment.fasta"

genomes:
  - name: AF332093.3
    sequence_files:
      - WSSV_genome_AF332093.3.fasta
  - name: AF369029.2
    sequence_files:
      - WSSV_genome_AF369029.2.fasta
  - name: AF440570.1
    sequence_files:
      - WSSV_genome_AF440570.1.fasta
  - name: JX515788.1
    sequence_files:
      - WSSV_genome_JX515788.1.fasta
  - name: KR083866.1
    sequence_files:
      - WSSV_genome_KR083866.1.fasta
  - name: KT995470.1
    sequence_files:
      - WSSV_genome_KT995470.1.fasta
  - name: KT995471.1
    sequence_files:
      - WSSV_genome_KT995471.1.fasta
  - name: KT995472.1
    sequence_files:
      - WSSV_genome_KT995472.1.fasta
  - name: KU216744.1
    sequence_files:
      - WSSV_genome_KU216744.1.fasta
output:
  conf: conf.json
  data: data.json
  filter: filter.json

And ran it with the following command:
perl -Ilib ~/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/alitv.pl --overwrite --project WSSV WSSV.yml
However, I get the following error:

Use of uninitialized value in subroutine entry at /home/ibar/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/../lib/AliTV/Alignment.pm line 262, <GEN9> line 55170.
FATAL - unable to create features at /home/ibar/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/../lib/AliTV/Alignment.pm line 305

When I tried to use lastz to align the sequences (giving the fasta files as input, without a yaml file), it produced a different error:
FATAL - Unable to load alignment module 'AliTV::Alignment::lastz' at /home/ibar/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/alitv.pl line 117.

Thank you, Ido

Remove output section from default YAML file

The output section inside the default YAML file is neither documented nor used at the moment. Therefore, this section should be removed.

Add lastz binary

It is a little inconvenient that users installing AliTV as stated in the documentation with:

git clone https://github.com/AliTVTeam/AliTV-perl-interface
cpanm --installdeps .

still need to install lastz separately. It might not even be obvious that --installdeps does not include lastz.
The best solution would be to include a binary of lastz that is used if it is not found in the PATH.
This should not be a problem as lastz is under MIT license now as well https://github.com/lastz/lastz

If this is not possible it would be good to clarify the need to install lastz separately in the documentation (optionally with installation instructions).

Tests should be skipped if required modules are not installed/Cleaning dependencies

I am using some test modules like Test::Exception or Test::Warn. Those tests should be skipped it the modules are not installed and the modules should be removed from dependency list.

Add `--no-self` option

alitv.pl by default calculates all-vs-all genome alignments this includes a comparison of each genome against itself. This information is currently not visualized in the AliTV frontend. It might be useful to do this in the future but right now this behavior just increases calculation times and the size of the output file without any benefit. Therefore I'd suggest disabling self alignments by default and adding a parameter to re-enable them if wanted.

Links on reverse strand have wrong coordinates

It seems like links that are twisted have the wrong coordinates as if start and stop coordinates of those hits are relative to the reverse strand (so 0 coordinate at the end).
Links on the forward strand seem to be ok.

Look at the following image for clarification:

Look at the links from the turquise "ycf2" gene. It should match while only being twisted. But the link from top to bottom hits as far from the end as it should hit from the start (notice the 0 coordinate as the sequence is shifted). The same applies for the link from bottom to top. It should hit from 19k to 25k instead it hits on 30k to 24k on the 49k long sequence (49-19=30 and 49-25=24).

Support gb and gff input

Currently only tabular feature anntotation is supported.

It needs to be possible to use gb or gff files directly.

Support multiple alignments

Now, only pairwise alignments are supported.
Due to MAF supports multiple alignments, alitv.pl should also do...

Nevertheless, multiple alignments have been splitted into pairwise, due to the fundamental design of AliTV

Missing YAML in dependencies

During an installation in a basic Docker container, I found that YAML module should be added to the list of dependencies in Makefile.PL.

Caching for alignments

If the sequence set was not changed and the alignment program parameters are still the same a cached version of the alignments can be used.

Form of invertedRepeats should be determined to arrow instead of rect by default

The default config determines a rect for the form of inverted repeats. You can change the form to an arrow via the user interface. But it would be easier if the default conf determines always an arrow form to inverted repeats.

How can I load Seqs who do not begin from first base?

For example, I want to compare regions from human, mouse and rat, and the length of the regions I am interested in is only about 500kb or so, but none of these region are at the beginning of Chrs. Then how can I make json from such sequences?

Mapping file should be generated

After generation of unique sequence names, the original names of the contigs are still available inside the JSON file. Nevertheless, AliTV v0.4.0 does not seem to display this information. Even for the offset, the unique names (seq1..seqX) are indicated. This is hard to navigate. A possible solution is the creation of a mapping file containing the original and the new unique sequence names ordered by genome.

Gene features are not displayed correctly

Gene features are not displayed, but are present in JSON output.

But original sequence names are used instead of new unique sequence identifiers.

Add missing tests

I need to add missing test files/cases for the following items:

AliTV.pm

subroutine AliTV:: run
subroutine AliTV::get_json
subroutine AliTV::_import_links

Genome.pm

subroutine AliTV::Genome::get_features
subroutine AliTV::Genome::get_chromosomes
subroutine AliTV::Genome::_store_feature
subroutine AliTV::Genome::_orig_id_to_uniq_id
subroutine AliTV::Genome::_uniq_id_to_orig_id
subroutine AliTV::Genome::_get_uniq_seq_ids
subroutine AliTV::Genome::_get_orig_seq_ids
subroutine AliTV::Genome::_get_seq_ids

Links are not drawn when name of source and/or target chromosome do not match name of visible chromosome

The following JSON contains a multiple alignment, but links are not drawn. In the function filterVisibleLinks all links are removed. Because the names of the targetKaryo and accordingly sourceKaryo do not match the keys of the visible Chromosomes.

The given tree needs to be checked against the genome

Tree and Genome information need to cover the same taxa. Therefore, I need to check the tree information against the genome information to identify input data error

AliTV::Alignment needs to be tested with non-flush alignments

To avoid an uncoverable statement, I need to provide an alignment having different length of sequences.

Maybe that we be solved never

Tests with external dependencies should be skipped if external programs are missing

Currently the only supported alignment program is lastz. In case of missing lastz the corresponding tests should be skipped instead of failing the test

PeerJ-Doi-Image not working

Concerning the README.md. The image is not displayed. Instead the link is shown.

Dynamic conf values

The values for ticks in the conf section are still static

The should be dynamically calculated.

Wrong tree after generating JSON using AliTV.pl

After generating a new JSON file using the following commands
cd AliTV-perl-interface
perl -Ilib bin/alitv.pl data/chloroset/*.fasta > simple.json
perl -Ilib bin/alitv.pl data/chloroset/input.yml > simple.json

the tree is wrong

Error when fasta id contains pipe symbol

When one of the fasta files contains a pipe symbol | in the id as in >scaffold1|size16646 the following error occurs:

INFO - Wrote temporary YAML file 'test.yml'
INFO - Created temporary folder at '/tmp/A8sihAcMFs'
INFO - Starting alignment generation...
INFO - Finished alignment generation
INFO - Deleting temporary folder
FATAL - unable to create features
unable to create features at /home/bla/AliTV-perl-interface/lib/AliTV/Base.pm line 123

There is no json output in this case.
When replacing the pipe symbol with an underscore it runs without an error.

Import of existing alignments

Precomputed alignments need to be importable.

Genome order according to the given tree

Currently my brain is blocked... How to obtain the best genome order for a given tree?

Any ideas?

Long sequences need to be excluded from JSON file

If long sequences are imported, they will blow up the memory required by the script and will also blow up the JSON file created.

Therefore, starting at a total sequence length of 1 Mbp the sequence information should be excluded from writing to JSON.

Add default lastz option --allocate:traceback=800M

Aligning closely related bacterial genomes I observed frequent messages like this:

INFO - Finished 5. alignment (5 to go; 50.00 % done)
truncating alignment ending at (759335,706960);  anchor at (535615,483240)
truncation can be reduced by using --allocate:traceback to increase traceback memory
INFO - Finished 6. alignment (4 to go; 60.00 % done)
truncating alignment ending at (1048075,706960);  anchor at (824355,483240)
truncation can be reduced by using --allocate:traceback to increase traceback memory

I did at first not realize that this little message (that might well be hidden in a long stream of output) meant that completely homologous regions end up unaligned (because as the message states, the alignment is truncated...).

We could consider summarizing those messages in the AliTV output more prominently.

The message also already suggests the solution: increasing --allocate:traceback from the default of (80.0M) https://lastz.github.io/lastz/#option_alloc_traceback
For me setting it to 800M solved the problem (that was my first try so probably less would have sufficed as well).

As I think memory in this order of magnitude should not be a problem at all in current computers we can also consider passing this by default to lastz (when AliTV is called with fasta files).

Different rounding of identity in links and filters

This error is a little subtle and I stumbled across it by chance. When you run alitv.pl on the two sequences in this gist like this:

alitv.pl --project out seq0.fa seq1.fa

you end up with the json file also in the gist. When you visualize this file in the AliTV web interface you do not see any alignment despite the two sequences being almost identical.

Upon closer inspection I found the error: The filters.links.maxLinkIdentity is dynamically set to the link with highest identity found (with lots of precision), in this case: 99.8955395382848.
However the identity for the link is rounded to one decimal position so is 99.9.
As this is higher than the maxLinkIdentity the link is not shown. This is clearly a bug.

I can imagine multiple ways to address this issue:

do not set maxLinkIdentity dynamically, rather always set it to 100
round maxLinkIdentity and the separate links to the same number of decimal points

As a workaround if this case happens it helps to re-adjust the identity slider (set it a little lower, apply, set it to 100% and apply again). In this case with 99.89 it looks like it is fixed at 100 already...

Comparing trees inside AliTV::Tree test sets

Status messages should be printed

Re-add the `--self` option

In early versions of AliTV there was a --self option that allowed easy comparison of a genome against itself. It took advantage of the self option of lastz. A user reported to me that this option was useful and convenient. Therefore I suggest re-adding this feature.
What do you think @greatfireball ?

Add CITATION.cff

A dedicated CITATION.cff file that is both human and machine readable would be useful to unambiguously communicate how we want this software to be cited when used.
See: https://citation-file-format.github.io/
Check: http://citeas.org/
Can be similar to the one just added in the develop branch of the AliTV repository.

Remove reviewninja integration

Due to Githubs own review feature, we do not require reviewninja integration.

Running alitv.pl twice with overwrite results in an error

When I run alitv.pl test.yml --overwrite repeatedly it works twice and then fails. The reason is that the first time the test.map file does not exist so no need to overwrite at all. The second time a test.map file exists and is backuped to test.map.bak and then overwritten. In all consecutive runs both the test.map and test.map.bak files exist and AliTV refuses to overwrite test.map.bak.

This behavior is not intuitive. For me setting the --overwrite flag explicitly means "please overwrite all existing, conflicting files". Therefore I would not even expect a .bak file to be created. And if it is created I wont expect it to cause consecutive runs to fail. I'd vote for not creating a backup file at all and this problem should be solved.

Estimated tick distance not suitable

The tick distance is estimated by the median sequence length (implemented to solve #39). However this method is not suitable for datasets with some large sequences and lots of small ones. E.g. bacterial genomes with plasmids or genome assemblies. The estimated tick distance tends to be too small and lots of ticks with very short distance clatter the image and negatively impact performance.
There will be negative examples for any method but I feel that we can do better.

Suggestions:

Use the total genome length of the largest genome to estimate tick distance (this should be robust as this is also the way to calculate the total width)
Use the length of the longest sequence
Use a medium value of the n largest sequences (where n is the number of genomes)
Use the 90% quantile instead of the median

I think solution 1 is the best but solution 2 will do as well.

Support lastz Sequence Specifiers

It is possible to provide lastz sequence specifiers: http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#seq_spec
Those are added to the filename in square brackets. A real world use case is to specify unmask to allow for lowercase letters in the fasta files. Otherwise those are treated as masked and ignored in the seeding stage. So fasta files with only lowercase letters will produce no alignments at all.
Unfortunately, it is not possible to provide those specifiers as normal parameters. So the current yml format is not sufficient. Actually, there is currently no way to get unmask behavior so the input files need to be changed.

Add --help option to filter script

The filter script is nice and works as intended. However a --help option that lists the available filter capabilities and a basic usage example would help a lot.

Browser performance bad with lots of very small links

If there are a lot of very small links the browser performance gets bad (loading times of up to a minute, or hanging). See #101
Performance is drastically improved if this small links are filtered out - which is not difficult to do in perl.
There are two possible solutions here:

Add options to the yml for link filtering by length (and identity)
Automatically guess good cutoffs for link length (and identity)

Default lastz parameter --ambiguous=iupac

AliTV when started with just fasta files uses a couple of default parameters for lastz:

    - --format=maf
    - --noytrim
    - --gapped
    - --strand=both
    - --ambiguous=iupac

They all make sense in multiple scenarios and might be better suited then lastz defaults for many cases the last one in particular --ambiguous=iupac has an interesting side effect:
Aligning a scaffold (with huge N stretches >10kbp) to a reference sequence creates alignments spanning those gaps but the with very low identities of 20% or so. This is not the image you expect to see in this scenario. This can easily be fixed by the user when using the yml file and omitting that option. But as it is not the default behavior of lastz we should probably document this. (I actually did it with this issue but the README is probably a better place 😄)