Code Monkey home page Code Monkey logo

alitv-perl-interface's Introduction

AliTV-Perl interface

Build status

Latest build: Build Status Coverage Status

DESCRIPTION

This repository contains two helper scripts for AliTV:

⚠️ If your fasta file contains lowercase letters they will be treated as masked by lastz, see issue#152

INSTALLATION

External Requirements:

Install with:

git clone https://github.com/AliTVTeam/AliTV-perl-interface
cpanm --installdeps .

CITE

An article about AliTV has been published in PeerJ Computer Science DOI Please cite this article if you use AliTV-perl-interface in your project. Additionally the software in any specific version can be cited via its zenodo doi: DOI

AUTHOR

SEE ALSO

alitv-perl-interface's People

Contributors

greatfireball avatar iimog avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

alitv-perl-interface's Issues

yml as input plus project name

Does not seem to work...

perl -I lib bin/alitv.pl data/chloroset/input.yml

***********************************************************************
*                                                                     *
*  AliTV perl interface                                               *
*                                                                     *
***********************************************************************

You are using version v0.1.12.
INFO - Created temporary folder at '/tmp/_Q2WDyftc2'
INFO - Starting alignment generation... (36 alignments required)
[...]

json/log/map files have the name autogen_XXXXX.*

But if proving a project name I am receiving a warning message:

perl -I lib bin/alitv.pl --project new_run data/chloroset/input.yml
YML file specified, therefore the project name was overwritten by 'input'!

***********************************************************************
*                                                                     *
*  AliTV perl interface                                               *
*                                                                     *
***********************************************************************

You are using version v0.1.12.
INFO - Created temporary folder at '/tmp/wDTDHkuKA3'
INFO - Starting alignment generation... (36 alignments required)

But json/log/map are now named as expected input.*

Poor performance on larger data sets

I started alitv.pl on the 6 reference E. coli genomes at NCBI. They are ~5 Mbp in size each.
As expected all lastz alignments (with default parameters) took about 1 h in total.
However after this step was finished for each of the alignments (105 in total) the following line was printed:

MAF input file detected... Therefore workaround for revcom issue activated

It took over 10 h to finish this MAF workaround for all 105 alignments.
Then the following message was printed:

Deleting temporary folder

However since this message more than 2 h have passed and still the json file is empty and the perl script runs with 100% CPU.

I expected the alignment generation to be the time limiting step but it seems that post processing takes more than 10fold that time.

Easy workflow

The easiest workflow should allow the following statement to generate AliTV JSON files:

# download a number of fasta and/or genebank files
./alitv.pl *

This should create a set of AliTV JSON input files and a yml file, for later modification and replay.

Add option to just create yml

I often find myself using AliTV in quick mode (providing a list of fasta files to alitv.pl instead of a .yml file) just to let AliTV create the proper .yml file. Then cancel, adjust the .yml and run alitv.pl again with this modified .yml. While this works just fine it is a bit of a hack and might be worth considering as a feautre. The idea is a --create_yml or --yml_only parameter that if present stops AliTV after creating the .yml file.

Allow filter criteria for feature selection

Features need to be filtered out a huge file to avoid huge amounts of manual work. Filters are defined inside the yml file and are applied while importing. Therefore different feature classes can be imported from a single gff/gb/tabular file.

Error with huge sequence sets

Error in tempfile() using template tempXXXXX.maf: Could not create temp file temp50E5h.maf: Too many open files at [...]/AliTV-perl-interface/lib/AliTV/Alignment/lastz.pm line 70.

Unable to load alignment module

From a user (via mail):

i have installed all the dependency package for alitv as per the github
instruction. i want to create image for custom genomes. i have made yml
file as per the tutorials.

i am getting following error

FATAL - Unable to load alignment module 'AliTV::Alignment::lastz' at
/data/ngs/programs/AliTV-perl-interface/bin/../lib/AliTV/Script.pm line 123.

Error in 'Alignment' module with MAFFT alignment

Hi,

I read your recent publication and I'm really interested in trying out AliTV, but I can't get it to work with my data.
I got 8 viral genomes in FASTA format which were aligned by MAFFT.
I added the following lines to the yaml file:

---
alignment:
    program: importer
    parameter:
        - "../WSSV_genomes_MAFFT_alignment.fasta"

genomes:
  - name: AF332093.3
    sequence_files:
      - WSSV_genome_AF332093.3.fasta
  - name: AF369029.2
    sequence_files:
      - WSSV_genome_AF369029.2.fasta
  - name: AF440570.1
    sequence_files:
      - WSSV_genome_AF440570.1.fasta
  - name: JX515788.1
    sequence_files:
      - WSSV_genome_JX515788.1.fasta
  - name: KR083866.1
    sequence_files:
      - WSSV_genome_KR083866.1.fasta
  - name: KT995470.1
    sequence_files:
      - WSSV_genome_KT995470.1.fasta
  - name: KT995471.1
    sequence_files:
      - WSSV_genome_KT995471.1.fasta
  - name: KT995472.1
    sequence_files:
      - WSSV_genome_KT995472.1.fasta
  - name: KU216744.1
    sequence_files:
      - WSSV_genome_KU216744.1.fasta
output:
  conf: conf.json
  data: data.json
  filter: filter.json

And ran it with the following command:
perl -Ilib ~/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/alitv.pl --overwrite --project WSSV WSSV.yml
However, I get the following error:

Use of uninitialized value in subroutine entry at /home/ibar/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/../lib/AliTV/Alignment.pm line 262, <GEN9> line 55170.
FATAL - unable to create features at /home/ibar/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/../lib/AliTV/Alignment.pm line 305

When I tried to use lastz to align the sequences (giving the fasta files as input, without a yaml file), it produced a different error:
FATAL - Unable to load alignment module 'AliTV::Alignment::lastz' at /home/ibar/etc/tools/Annotation/AliTV-perl-interface-1.0.2/bin/alitv.pl line 117.

Thank you, Ido

Add lastz binary

It is a little inconvenient that users installing AliTV as stated in the documentation with:

git clone https://github.com/AliTVTeam/AliTV-perl-interface
cpanm --installdeps .

still need to install lastz separately. It might not even be obvious that --installdeps does not include lastz.
The best solution would be to include a binary of lastz that is used if it is not found in the PATH.
This should not be a problem as lastz is under MIT license now as well https://github.com/lastz/lastz

If this is not possible it would be good to clarify the need to install lastz separately in the documentation (optionally with installation instructions).

Add `--no-self` option

alitv.pl by default calculates all-vs-all genome alignments this includes a comparison of each genome against itself. This information is currently not visualized in the AliTV frontend. It might be useful to do this in the future but right now this behavior just increases calculation times and the size of the output file without any benefit. Therefore I'd suggest disabling self alignments by default and adding a parameter to re-enable them if wanted.

Links on reverse strand have wrong coordinates

It seems like links that are twisted have the wrong coordinates as if start and stop coordinates of those hits are relative to the reverse strand (so 0 coordinate at the end).
Links on the forward strand seem to be ok.

Look at the following image for clarification:
alitv
Look at the links from the turquise "ycf2" gene. It should match while only being twisted. But the link from top to bottom hits as far from the end as it should hit from the start (notice the 0 coordinate as the sequence is shifted). The same applies for the link from bottom to top. It should hit from 19k to 25k instead it hits on 30k to 24k on the 49k long sequence (49-19=30 and 49-25=24).

Support gb and gff input

Currently only tabular feature anntotation is supported.

It needs to be possible to use gb or gff files directly.

Support multiple alignments

Now, only pairwise alignments are supported.
Due to MAF supports multiple alignments, alitv.pl should also do...

Nevertheless, multiple alignments have been splitted into pairwise, due to the fundamental design of AliTV

Missing YAML in dependencies

During an installation in a basic Docker container, I found that YAML module should be added to the list of dependencies in Makefile.PL.

Caching for alignments

If the sequence set was not changed and the alignment program parameters are still the same a cached version of the alignments can be used.

How can I load Seqs who do not begin from first base?

For example, I want to compare regions from human, mouse and rat, and the length of the regions I am interested in is only about 500kb or so, but none of these region are at the beginning of Chrs. Then how can I make json from such sequences?

Mapping file should be generated

After generation of unique sequence names, the original names of the contigs are still available inside the JSON file. Nevertheless, AliTV v0.4.0 does not seem to display this information. Even for the offset, the unique names (seq1..seqX) are indicated. This is hard to navigate. A possible solution is the creation of a mapping file containing the original and the new unique sequence names ordered by genome.

Add missing tests

I need to add missing test files/cases for the following items:

AliTV.pm

  • subroutine AliTV:: run
  • subroutine AliTV::get_json
  • subroutine AliTV::_import_links

Genome.pm

  • subroutine AliTV::Genome::get_features
  • subroutine AliTV::Genome::get_chromosomes
  • subroutine AliTV::Genome::_store_feature
  • subroutine AliTV::Genome::_orig_id_to_uniq_id
  • subroutine AliTV::Genome::_uniq_id_to_orig_id
  • subroutine AliTV::Genome::_get_uniq_seq_ids
  • subroutine AliTV::Genome::_get_orig_seq_ids
  • subroutine AliTV::Genome::_get_seq_ids

Dynamic conf values

The values for ticks in the conf section are still static

The should be dynamically calculated.

Wrong tree after generating JSON using AliTV.pl

After generating a new JSON file using the following commands
cd AliTV-perl-interface
perl -Ilib bin/alitv.pl data/chloroset/*.fasta > simple.json
perl -Ilib bin/alitv.pl data/chloroset/input.yml > simple.json

wrongtree

the tree is wrong

Error when fasta id contains pipe symbol

When one of the fasta files contains a pipe symbol | in the id as in >scaffold1|size16646 the following error occurs:

INFO - Wrote temporary YAML file 'test.yml'
INFO - Created temporary folder at '/tmp/A8sihAcMFs'
INFO - Starting alignment generation...
INFO - Finished alignment generation
INFO - Deleting temporary folder
FATAL - unable to create features
unable to create features at /home/bla/AliTV-perl-interface/lib/AliTV/Base.pm line 123

There is no json output in this case.
When replacing the pipe symbol with an underscore it runs without an error.

Long sequences need to be excluded from JSON file

If long sequences are imported, they will blow up the memory required by the script and will also blow up the JSON file created.

Therefore, starting at a total sequence length of 1 Mbp the sequence information should be excluded from writing to JSON.

Add default lastz option --allocate:traceback=800M

Aligning closely related bacterial genomes I observed frequent messages like this:

INFO - Finished 5. alignment (5 to go; 50.00 % done)
truncating alignment ending at (759335,706960);  anchor at (535615,483240)
truncation can be reduced by using --allocate:traceback to increase traceback memory
INFO - Finished 6. alignment (4 to go; 60.00 % done)
truncating alignment ending at (1048075,706960);  anchor at (824355,483240)
truncation can be reduced by using --allocate:traceback to increase traceback memory

I did at first not realize that this little message (that might well be hidden in a long stream of output) meant that completely homologous regions end up unaligned (because as the message states, the alignment is truncated...).

We could consider summarizing those messages in the AliTV output more prominently.

The message also already suggests the solution: increasing --allocate:traceback from the default of (80.0M) https://lastz.github.io/lastz/#option_alloc_traceback
For me setting it to 800M solved the problem (that was my first try so probably less would have sufficed as well).

As I think memory in this order of magnitude should not be a problem at all in current computers we can also consider passing this by default to lastz (when AliTV is called with fasta files).

Different rounding of identity in links and filters

This error is a little subtle and I stumbled across it by chance. When you run alitv.pl on the two sequences in this gist like this:

alitv.pl --project out seq0.fa seq1.fa

you end up with the json file also in the gist. When you visualize this file in the AliTV web interface you do not see any alignment despite the two sequences being almost identical.

Upon closer inspection I found the error: The filters.links.maxLinkIdentity is dynamically set to the link with highest identity found (with lots of precision), in this case: 99.8955395382848.
However the identity for the link is rounded to one decimal position so is 99.9.
As this is higher than the maxLinkIdentity the link is not shown. This is clearly a bug.

I can imagine multiple ways to address this issue:

  1. do not set maxLinkIdentity dynamically, rather always set it to 100
  2. round maxLinkIdentity and the separate links to the same number of decimal points

As a workaround if this case happens it helps to re-adjust the identity slider (set it a little lower, apply, set it to 100% and apply again). In this case with 99.89 it looks like it is fixed at 100 already...

Re-add the `--self` option

In early versions of AliTV there was a --self option that allowed easy comparison of a genome against itself. It took advantage of the self option of lastz. A user reported to me that this option was useful and convenient. Therefore I suggest re-adding this feature.
What do you think @greatfireball ?

Running alitv.pl twice with overwrite results in an error

When I run alitv.pl test.yml --overwrite repeatedly it works twice and then fails. The reason is that the first time the test.map file does not exist so no need to overwrite at all. The second time a test.map file exists and is backuped to test.map.bak and then overwritten. In all consecutive runs both the test.map and test.map.bak files exist and AliTV refuses to overwrite test.map.bak.

This behavior is not intuitive. For me setting the --overwrite flag explicitly means "please overwrite all existing, conflicting files". Therefore I would not even expect a .bak file to be created. And if it is created I wont expect it to cause consecutive runs to fail. I'd vote for not creating a backup file at all and this problem should be solved.

Estimated tick distance not suitable

The tick distance is estimated by the median sequence length (implemented to solve #39). However this method is not suitable for datasets with some large sequences and lots of small ones. E.g. bacterial genomes with plasmids or genome assemblies. The estimated tick distance tends to be too small and lots of ticks with very short distance clatter the image and negatively impact performance.
There will be negative examples for any method but I feel that we can do better.

Suggestions:

  1. Use the total genome length of the largest genome to estimate tick distance (this should be robust as this is also the way to calculate the total width)
  2. Use the length of the longest sequence
  3. Use a medium value of the n largest sequences (where n is the number of genomes)
  4. Use the 90% quantile instead of the median

I think solution 1 is the best but solution 2 will do as well.

Support lastz Sequence Specifiers

It is possible to provide lastz sequence specifiers: http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#seq_spec
Those are added to the filename in square brackets. A real world use case is to specify unmask to allow for lowercase letters in the fasta files. Otherwise those are treated as masked and ignored in the seeding stage. So fasta files with only lowercase letters will produce no alignments at all.
Unfortunately, it is not possible to provide those specifiers as normal parameters. So the current yml format is not sufficient. Actually, there is currently no way to get unmask behavior so the input files need to be changed.

Add --help option to filter script

The filter script is nice and works as intended. However a --help option that lists the available filter capabilities and a basic usage example would help a lot.

Browser performance bad with lots of very small links

If there are a lot of very small links the browser performance gets bad (loading times of up to a minute, or hanging). See #101
Performance is drastically improved if this small links are filtered out - which is not difficult to do in perl.
There are two possible solutions here:

  1. Add options to the yml for link filtering by length (and identity)
  2. Automatically guess good cutoffs for link length (and identity)

Default lastz parameter --ambiguous=iupac

AliTV when started with just fasta files uses a couple of default parameters for lastz:

    - --format=maf
    - --noytrim
    - --gapped
    - --strand=both
    - --ambiguous=iupac

They all make sense in multiple scenarios and might be better suited then lastz defaults for many cases the last one in particular --ambiguous=iupac has an interesting side effect:
Aligning a scaffold (with huge N stretches >10kbp) to a reference sequence creates alignments spanning those gaps but the with very low identities of 20% or so. This is not the image you expect to see in this scenario. This can easily be fixed by the user when using the yml file and omitting that option. But as it is not the default behavior of lastz we should probably document this. (I actually did it with this issue but the README is probably a better place 😄)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.