bioperl / bioperl-live Goto Github PK

View Code? Open in Web Editor NEW

292.0 54.0 181.0 133 MB

Core BioPerl 1.x code

Home Page: http://bioperl.org

Perl 94.86% Gnuplot 0.06% Parrot 0.03% Forth 0.01% Roff 5.04% F* 0.01%

bioinformatics bioperl perl toolkit

bioperl-live's People

Stargazers

Watchers

Forkers

vinjana nmenda kblin yinjun111 viklund winni2k cybersiddhu chmille4 dbolser gjuggler chiragmatkar kortschak michal-josef-spacek rspengle robinlove emuro josephhughes kellert fschwach awitney bennyyu mirahan gig77 tristanlefebure paolopavan daisieh peterjc chui101 de-husk jhannah achatterjee magnome iluminati fjossandon schoukri wburke jeanfred msufian mattoates jestill nelly2013 andrewjpage frankmusacchia yschensandiego richysix sestaton jadiew nathanweeks ctskennerton stu2 tsibley buuntu llwestphal rochoa85 ersimpso rkhaja lufeng891109 elbowen puneet-shivanand b-rich christiam david-a-parry xuegengwang rtvt123 stuartskelton ryys1122 drgirasol cooldownzhng digger83 hobbit majensen kadenerlab gpp9837 nw328 sillitoe wangk4 shicheng-guo northeyt jogoodma lskatz jkeenan junningwang errcricket jiaolongsun yinxx lairdm roychaudhuri wenmore alenzhao wyim-pgl htafer jvolkening yixf-self onecityuni wangniuniuwnn zmughal-contrib thibauthourlier rapsoulhaonan qwwz chenzhiw

bioperl-live's Issues

Migrate HOWTOs and other docs from BioPerl wiki to bioperl.github.io

We should have these in version control; readthedocs.org can be used to display the docs (which I believe need to be markdown).

Possible Indexing/Permissions issue with CJFIELDS/Bio-ASN1-EntrezGene-1.70

https://metacpan.org/release/CJFIELDS/Bio-ASN1-EntrezGene-1.70

Says Bio::ASN1::EntrezGene is unauthorized

But PAUSE based indexes are fine. It might just be something to take up with metacpan.

 cpan -D Bio::ASN1::EntrezGene
CPAN: Storable loaded ok (v2.53_01)
Reading '/home/kent/.cpan/Metadata'
  Database was generated on Wed, 13 Apr 2016 07:53:40 GMT
Bio::ASN1::EntrezGene
-------------------------------------------------------------------------
    Parser for NCBI Entrez Gene (ASN.1-format)
    C/CJ/CJFIELDS/Bio-ASN1-EntrezGene-1.70.tar.gz
    (no installation file)
    Installed: not installed
    CPAN:      1.70  Not up to date
    Mingyi Liu (MINGYILIU)
    [email protected]

 cpan -D Bio::ASN1::EntrezGene::Indexer
CPAN: Storable loaded ok (v2.53_01)
Reading '/home/kent/.cpan/Metadata'
  Database was generated on Wed, 13 Apr 2016 07:53:40 GMT
Bio::ASN1::EntrezGene::Indexer
-------------------------------------------------------------------------
    (no description)
    C/CJ/CJFIELDS/Bio-ASN1-EntrezGene-1.70.tar.gz
    (no installation file)
    Installed: not installed
    CPAN:      1.70  Not up to date
    Mingyi Liu (MINGYILIU)
    [email protected]

Bio::SearchIO::hmmer3.pm writing unparsable lines to stdout

Hello,

Thanks for all the work.

I've just run into a minor annoyance: When Bio::SearchIO::hmmer3.pm finds a line that it cannot parse, it writes that line to stdout along with the message 'Missed this line:' Lines 622 and 768 from the source code are examples.

This is messing up my downstream parsing. Can this to redirected to stderr or somehow treated better?

The particular lines that is causing problems are from hmmscan 3.1b1 which looks like:

2fo7A00
[No individual domains that satisfy reporting thresholds (although complete target did)]

I can deal with it in my local copy, but this seems like something that should be handled differently.

Problem with newer versions of AlignIO to write alignment data

Dear bioperl folks,

I have been using bioperl for a while to develop POTION, an end-to-end pipeline to detect positive selection in coding sequence data. POTION uses Bioperl to do a lot of things regarding handling sequence/alignment data. Im particular, POTION users AlignIO to change the format of alignment data from phylip to fasta and vice-versa.

When testing POTION using a somewhat older version of BioPerl (1.006901) it works just fine. However, when I updated Bioperl to a newer version (1.006923) POTION stopped working and got stuck when using AlignIO to write alignment objects.

I wrote a very simple code (pasted below) that works just fine when using BioPerl (1.006901) (it finishes after less than a second and correctly writes the output file in fasta format), but it gets stuck when using newer versions of Bioperl (I tested 1.006923 and 1.006924). Please let me know if I can help you to fix this bug.

use strict;
use warnings;
use Bio::AlignIO;

my $infile = $ARGV[0];

chomp $infile;

my $in  = Bio::AlignIO->new(-file => "$infile" ,
                                             -format => 'phylip');


my $out = Bio::AlignIO->new(-file => ">$infile.fa",
                                              -format => 'fasta');

while (my $aln = $in->next_aln ) {
  $out->write_aln($aln);
}

BioPerl cannot exactly guess the type of sequence in a Fasta file (dna, peptide)

I had a problem with a fasta file to identify the type of sequences.
The following code:

!/usr/bin/perl

use strict;
use warnings;

SEQUENCE MANAGEMENT MODULES

use Bio::SeqIO;

my $fastaFilePath = $ARGV[0];
my $seqio = Bio::SeqIO->new( -file => $fastaFilePath, -format=>'fasta');
my $obj = $seqio->next_seq();
print $obj->alphabet;

Returns: proteins

The first lines of the file are the following (here I cannot upload the Fasta file):

CDNA3-4_11_2010_1_rep_c2
ACGCGGGGATGCACTTTCTCTTTTGTCTTCAGCCCGGTAGCGGTAAGTGATAGAGCCACT
AGCTGATTGTACTCAATAAAGCCTAGGCTTGAGAAGCTGGTTAATACCCCTTCTCCGAGT
TGATTTCAGTTAGTTTTATTATTATATATTATTATAATAATTACTGTTGTGTTTGTATAG
GATTATATATTGTTTTTACATTATATATTTCCTTCACGCGTTGCTTCGGCATCGCGTTTC
TCGTGGTCTCCGTCTTTCCTCTGTTATTATTATTATTATTATTAATATATTTTGTATTTC
TGGGTTTTTTGTACATATATTGTTATATATATTTGTTGTATTTGTGTTTTTCTGAGAACG
CGTGTCGCGTTGGCTTAGGCCACGCGCATATATTGCAGCCTCGGTCTTGCTTAGGCCAGA
CCCCGTGTATGTGTGTGGGTGTGTAGTCTGCTTGCAGTCTTACTCCCCACCGCGGTGCGT
TCTACCGCCACGTGGTAACTGTCTTCAGTTCCGTGTTTTCTTACACATTATATTTTGTGT
ACTGCTGCAGCGTCAGAAAAAAAAAAACTCCAGTACTCTGCGTTGATACCACTGCTTAAG
CAGTGGTATCAACGCAGAGTGTTCATTTGCATTTTTTTCTTTTTTTTTATTGTATATAAT
AGAAGCGTTAAACTTATTTTTAAACCTATATTTTCACTCTACTCCAATTATTAAATACTA
GTTTATACTGTTGTCACATTTAACCTGGGTCACACTTACCCCAAAATTAACATAGTTCAA
GTTTAATTTGACGTATACAAGATATGTAATTGTAATATTTCAGCACCCTTCCTAGTATTG
TTATATGTTGTGTTTCTATAACAGTAGGATTGTAATGTTTGGCACATAGGAGGGTAATTT
GAAGCATCTGTCTTATTTAAGACATATTTTTGTAGTTGTTGTTTTTTTTTAATAAAGAGG
TTGTTTAAGTGTGAAAAXXXXXXXXXXXXXXTCCAGTACTCTGCGTTGATACCACTGCT
CDNA3-4_11_2010_1_rep_c5
TGAGGCTGTGTTATGAGCACCTTGTAATTATTTAATCCATCATAAATTAATAATGCCATA
CACATTATATAGCATGGGTAAAATAAGACATTAGTGCCACTTGGCATATATACAAGCAGG
AGGTATAATGTTATAAATACCAACCAACAAAAAAATAATAAAAAATAATGATAATAATAA
AATAACAAGAGACCTGCTTCTCCCTTTGTAGTCTGTGATCCTCTAGCTCAAACATCAAGC
GGTTAAGAAGGATTCAGCTTGTGTGTGTGTGCAGGTCCAGGCAGTGCTGCCAACTCCACA
GAACTGCATTCCCGAAAAGGCACCAATACTGTCTAAGCAGTGGTATCAACGCAGAGTACT
GGAGTTTTTTCTTTTTCAGGTCAAAGAAGTCCCCTGGGTTGACATTGTCTATACCTTTTA
GGATTTTGAATGTTTGAATCAGATCGCCGCGTAGTCTTCTTTGTTCAAGACTGAATAGAT
TCAATTCTTTTAGCCTGTCTACATACGACACGCCTTTTAAACCCGGGATAATTCTGGTTG
CTCTTCTTTGCACTCTTTCTAGAGCAGCAATATCCTTTTTGTAACGAGGTGACCAGAACT
GAACACAATATTCTAGGTGAGGTCTTACTAATGCATTGTAGAGTTTTAACATTACTTCCC
TTGATTAAAATTCAACACTTCTCACAATATATCCAAGCATCTTGTTAGCCTTTTTTATAG
CTTCTCCACATTGTCTAGATGAAGACATTTCTGAGTCAACATAAACTCCTAGGTCTTTTT
CATAGTTCCCTTCTTCAATTTCACTATCTCCCATATGATATTTATAATGCACATTTTTAT
TGCCTGCATGCAATACTTTACACTTTTCTCTATTAAATGTCATTTGCCATGTGTCTGCCC
AGTTCTGAATGCTGTCTAGATCATTTTGAATGACCTTTGCTGCTGCAACAGTGTTTGCCA
CTCCTCCTATTTTTGTGTCATCTGCAAATTTAACAAGTTTGCTTACTATACCAGAATCCA
AATCATTAATGTAGATTAGGAATAGCAGAGGACCTAATACTGATCCCTGTGGTACTCCAC
TTGTTACCTCGCTCCATTTGGAGGTTTCTCCTCTAATCAGTACTTTCTGTTTTCTACATG
TTAACCACTCCCTAATCCATGTGCATGCATGTCCTTGAATCCCTACTGCGTTCAGTTTGA
GAATTCATCTTTTATGCCTGACTTTCTAGAAATCTAAATAAACCATGTCATATGCTTTGC
AAAAAA

I already identified the problem.. it is in the presence of the Xs. When I just substitute N to X then the script works well. But isn't it an error to display protein with these sequencese?

best regards

Francesco

Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly

When I use Bio::SeqFeature::Tools::Unflattener to convert GenBank flat-feature-list to containment hierarchy,

$unflattener->unflatten_seq(-seq=>$seq, -use_magic=>1);

Everything is fine except these genes has pseudo tag.

Gene and CDS with pseudo tag have been convert to pseudogene and pseudoCDS, but they are flat. That is means the pseudoCDS is not the child of the pseudogene.
Gene and tRNA with pseudo tag have been convert to pseudogene and pseudotRNA, but they are not nested either. The converted pseudotRNA are not the child of the pseudogene.

Like to know if there is any other parameter, or any other method that I can use to convert both gene and pseudogene correctly.

sample file

http://www.ncbi.nlm.nih.gov/nuccore/FR823391
http://www.ncbi.nlm.nih.gov/nuccore/GL636509

sample codes

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::Seq::SeqBuilder;
use Bio::Species;
use Bio::Annotation::SimpleValue;
use Bio::SeqFeature::Tools::Unflattener;

my $seq_in = Bio::SeqIO->new( '-file' => "<$ARGV[0]", '-format' => 'Genbank' );
my $unflattener = Bio::SeqFeature::Tools::Unflattener->new;

while ( my $bioperlSeq = $seq_in->next_seq() ) {
    if ( !( $bioperlSeq->molecule =~ /rna/i ) ) {
        $unflattener->error_threshold(1);
        $unflattener->report_problems(*STDERR);
        $unflattener->unflatten_seq(
            -seq       => $bioperlSeq,
            -use_magic => 1
        );

        my @topSeqFeatures = $bioperlSeq->get_SeqFeatures;

        $bioperlSeq->remove_SeqFeatures;

        foreach my $bioperlFeatureTree (@topSeqFeatures) {
            my $type = $bioperlFeatureTree->primary_tag();
            if ( ( $bioperlFeatureTree->has_tag("locus_tag") ) ) {
                my ($cID) = $bioperlFeatureTree->get_tag_values("locus_tag");
                print STDERR "\nprocessing $cID...\n";
            }

            if (   $type =~ /pseudogene/i
                || $type =~ /pseudotRNA/i
                || $type =~ /pseudoCDS/i )
            {
                foreach my $subPseuFeat ( $bioperlFeatureTree->get_SeqFeatures )
                {
                    my $subPseuType = $subPseuFeat->primary_tag();
                    print STDERR "  subPseuType = $subPseuType\n";
                    my @pseuLocs = $subPseuFeat->location;
                    foreach my $pseuLoc (@pseuLocs) {
                        my $pseuLocStart = $pseuLoc->start();
                        my $pseuLocEnd   = $pseuLoc->end();
                        print "    pseuLoc location : $pseuLocStart ... $pseuLocEnd\n";
                    }
                }
            }
            print STDERR "\$type = $type\n";
            foreach my $subFeat ( $bioperlFeatureTree->get_SeqFeatures ) {
                my $subType = $subFeat->primary_tag();
                print STDERR "  subType = $subType\n";
                foreach my $subSubFeat ( $subFeat->get_SeqFeatures ) {
                    my $subSubType = $subSubFeat->primary_tag();
                    print STDERR "      subSubType = $subSubType\n";
                    if ( $subSubType eq "CDS" || $subSubType eq "exon" ) {
                        my @CDSLocs = $subSubFeat->location;
                        foreach my $CDSLoc (@CDSLocs) {
                            my $subSubStart = $CDSLoc->start();
                            my $subSubEnd   = $CDSLoc->end();
                            print STDERR
"            location: $subSubStart .... $subSubEnd\n";
                        }
                    }
                }
            }
        }
    }
}

I am using bioperl_live/1.6.9,
$ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
1.0069

Everything is fine when I use bioperl version 1.4. Not sure what is changed for Bio::SeqFeature::Tools::Unflattener between 1.4 and 1.6.9.

The different outputs with above code in different bioperl version 1.6.9 and 1.4.

####### BioPerl 1.6.9 #########
processing NCLIV_047911...
$type = pseudogene
processing NCLIV_047911...
$type = pseudoCDS

####### BioPerl 1.4 #########
processing NCLIV_047911...
$type = gene
subType = mRNA
subSubType = CDS
location: 2628722 .... 2629917
subSubType = exon
location: 2628722 .... 2629792
subSubType = exon
location: 2629795 .... 2629917

Thanks.

Legacy tickets

Old Redmine tickets have been migrated to GitHub in a separate repo for now:

https://github.com/bioperl/bioperl-live-redmine/issues

These may or may not be merged with this tracker; the original migration was a little problematic due to limitations of the Github REST API.

bp_seqcut.pl has ^M at the end of each line

If you encounter an error when running bp_seqcut such as

": No such file or directory"

it is due to "^M" at the end of each line. Perhaps it was prepared in a DOS environment and has been exported as it was. One can edit the perl source to remove it or convert it through dos2unix.

Rounding error causes obscure fail in MapTiling

The following code gives an exception (below) when run on blast output at http://pastebin.com/kCpnXyTi. Problem is that 1141 < 1141.00001...

use strict;
use warnings;
use Bio::SearchIO;
use Bio::Search::Tiling::MapTiling;
$SIG{__DIE__} = sub { print $_[0] };
my $searchio = Bio::SearchIO->new( -format => 'blast',
  -file   => shift );

my $gene;

while ( my $result = $searchio->next_result() ) {

    $result->query_name =~ /(\S+_\S+)_i/;
    my $g = $1;

    while( my $hit = $result->next_hit ) {

my $tiling = Bio::Search::Tiling::MapTiling->new($hit);
my $hsp    = $hit->next_hsp;

my $st    = $hsp->query->strand == "1" ? "p" : "m";
my $frame = $hsp->query->frame;
my $context = $st.$frame;

print $tiling->frac_aligned('query', 'exact', $context)."\n";
    }
}

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Start/stop out of range [824, 1141]
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.20.1/Bio/Root/Root.pm:449
STACK: Bio::Search::HSP::HSPI::matches_MT /usr/local/share/perl/5.20.1/Bio/Search/Tiling/MapTileUtils.pm:583
STACK: Bio::Search::Tiling::MapTiling::_calc_stats /usr/local/share/perl/5.20.1/Bio/Search/Tiling/MapTiling.pm:1163
STACK: Bio::Search::Tiling::MapTiling::length /usr/local/share/perl/5.20.1/Bio/Search/Tiling/MapTiling.pm:508
STACK: Bio::Search::Tiling::MapTiling::num_aligned /usr/local/share/perl/5.20.1/Bio/Search/Tiling/MapTiling.pm:694
STACK: Bio::Search::Tiling::MapTiling::frac_aligned /usr/local/share/perl/5.20.1/Bio/Search/Tiling/MapTiling.pm:668
STACK: blParse.pl:40
-----------------------------------------------------------

Invalid EMBL files generated in rare circumstances; line wrapping

If you craft a tag on a feature sneakily (or if you are unlucky)
Bio::SeqIO will create invalid EMBL, separating the "/" from the
qualifier name:

ID   unknown; SV 1; linear; unassigned DNA; STD; UNC; 4 BP.
XX
AC   unknown;
XX
XX
XX
FH   Key             Location/Qualifiers
FH
FT   CDS             1..4
FT                   /
FT                   note="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
FT                   X"
XX
SQ   Sequence 4 BP; 1 A; 1 C; 1 G; 1 T; 0 other;
     actg                                                                      4
//

In this example "/" and "note" are on separate lines, which is wrong; at
least BioPerl does not accept it itself.

Here is a script to create the above output (BioPerl 1.6.901 used):

#!/usr/bin/perl

use strict;
use warnings;

use Bio::Seq::RichSeq;
use Bio::SeqFeature::Generic;
use IO::String;
use Bio::SeqIO;

my $seq=Bio::Seq::RichSeq->new(-display_id=>'TEST', -seq=>'actg');
my $cds=Bio::SeqFeature::Generic->new(-primary_tag=>'CDS', -start=>1, -end=>4);
$cds->add_tag_value(note=>'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX X');
$seq->add_SeqFeature($cds);

my $string;
my $str=IO::String->new($string);
my $io=Bio::SeqIO->new(-fh=>$str, -format=>'embl');
$io->write_seq($seq);
print $string;

Changing the position of the space in the note makes a/the difference.

File open for 'command |' not working

In older versions of Bio/Root/IO.pm the following line was used to open files:

open ($fh,$file) || $self->throw("Could not open $file: $!");

In newer versions, the following line is used with $mode defaulting to "<".

open $fh, $mode, $file or $self->throw("Could not $action file '$file': $!");

So, for example, when we use:

$fastq_obj = Bio::SeqIO->new( -file => "gunzip -c myfile.gz |" , -format => 'Fastq');

This becomes:

open $fh, '<', 'gunzip -c myfile.gz |'

causing the following error:

Could not read file 'gunzip -c myfile.gz |': No such file or directory

Bio::SeqIO::embl parsing DT lines: regex fails to catch release update number.

In the package Bio::SeqIO::embl the two regex lines 305 and 311 may have ambiguous modifiers. The aim of these regex is to parse the DT lines from embl files, for example:
DT 29-SEP-2015 (Rel. 1, Created)
DT 29-SEP-2015 (Rel. 1, Last updated, Version 1)

the two regex use the xms modifiers. The modifier x implies the protection of whitespace characters in the regex. This is not the case in the two regex line 305 and 311:
if ($version =~ /(Rel. (\d+), Created)/xms ) => is always false.

Thus the modifier x should be removed (I am not convinced by the interest of modifiers ms too) or it can be kept by adding backslashes before all whitespaces.

memory leak

Hello all,

I've found a memory leak in BioPerl due to a circular reference, but can't fix it myself. Here's how to reproduce:

use Devel::Cycle;
use Bio::Species;
find_cycle(Bio::Species->new(-classification => ['A']));

Sincerely,
Dmitry

Issues with cutters from Bio::Restriction::Analysis

On my Mac with the latest version of Bioperl 1.6.923, Bio::Restriction::Analysis's cutters(2,50) will return unique cutters when it should return nothing. Following is my testing code.

use strict;
use warnings;
use Bio::Restriction::Enzyme;
use Bio::Restriction::Analysis;
use Bio::Seq;
use Data::Dumper;

my $seq = Bio::Seq->new(
    -display_id => 'foo',
    -seq => 'aaaaGGATCCaaa',
);

my $re = Bio::Restriction::Enzyme->new(
    -enzyme => 'BamHI',
    -seq => 'G^GATCC',
);

my $ra=Bio::Restriction::Analysis->new(
    -seq=>$seq,
    -enzymes=>$re,
);

my $bug = $ra->cutters(2,50);

print $_->name for $bug->each_enzyme;

### out-put

BamHI

Missing files in git?

Hi -

I am starting to look into speeding up EnsEMBL's Variant Effect Predictor and this led me to BioPerl, this git repository, and specifically Bio::DB::Fasta.

In trying to run:

perl Build.PL

and then:

perl ./Build test

files in the Bio::Root module are missing. But oddly they are there when I install via:

cpan Bio::Perl

And an older version of Bio::Root is distributed as part of the Variant Effect Predictor code version 78. So what's going on?

The merge request #92 (commit c7a88f9) tries to address this by using the files version 1.006924. This is the latest version, right?

If there are newer versions of the files in Bio::Root, of course, I'd like know where they are so I can use them in the upcoming changes I plan on making.

Build.pm absent

perl Build.PL
Can't locate Bio/Root/Build.pm in @inc (@inc contains: ./bin/bioperl-live etc)

Current v1.6.x branch build at Travis-CI fails because of Travis

I'm not sure how to fix this. The last build for v1.6.x branch on Travis failed specifically on the Perl 5.20 job because the command "sudo apt-get install libdb-dev graphviz libgd2-xpm-dev libxml2-dev 2>&1 | tail -n 4" from ".travis.yml" failed.

To compare, this is the equivalent Perl 5.20 master branch output that ends with success:

[0K$ sudo apt-get install libdb-dev graphviz libgd2-xpm-dev libxml2-dev 2>&1 | tail -n 4
Setting up libdb5.1-dev (5.1.25-11build1) ...
Setting up libdb-dev (5.1.4ubuntu1) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place
travis_time:end:078dbcfc:start=1450057084685507506,finish=1450057101523671501,duration=16838163995

While on the Perl 5.20 v1.6.x branch:

[0K$ sudo apt-get install libdb-dev graphviz libgd2-xpm-dev libxml2-dev 2>&1 | tail -n 4
Failed to fetch http://ppa.launchpad.net/ubuntugis/ppa/ubuntu/pool/main/libx/libxml2/libxml2-dev_2.8.0+dfsg1-5ubuntu2.2~precise1_amd64.deb  Unable to connect to ppa.launchpad.net:http:
Failed to fetch http://ppa.launchpad.net/ubuntugis/ppa/ubuntu/pool/main/libx/libxml2/libxml2_2.8.0+dfsg1-5ubuntu2.2~precise1_amd64.deb  Unable to connect to ppa.launchpad.net:http:
Fetched 3,047 kB in 10s (303 kB/s)
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
travis_time:end:01c3a570:start=1450057546701647944,finish=1450057560789409388,duration=14087761444

This is causing "t/LocalDB/Taxonomy/sqlite.t" to fail to load "Bio::DB::Taxonomy::sqlite" :(

Cheers,

parsing hmmer results: rare situation causes parsing failure

Hi there,

I've just started playing with parsing some hmmer output files (hmmsearch), and have run across a rare situation that causes the parser to choke. I think the issue arises where the subject sequence has a stop codon that gets wrapped to the first position in an alignment row. Stop codons in the middle of a row seem to do just fine. Hopefully it'll be easy to fix. I updated my Bioperl a few minutes ago using git.

I've attached a screenshot of the alignment, and will see if I can attach the full hmmer output file here.

I'm parsing with a very simple script for now that looks like this:

!/usr/bin/perl

use Bio::SearchIO;
my $hmmerIN = Bio::SearchIO->new(-file => "< ORF1align4.edit.fa.hmm.hmmsearchVSproblemSeq", -format => "hmmer");
while( my $result = $hmmerIN->next_result ) {
print $result->query_name(), " for HMM ", $result->hmm_name(), "\n";
}

and it crashes on this particular alignment with this: error
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE E-IWDYVKRPNLRLIGVPESDVENGTKLENTLQDIIQENFPNLARQANVQIQEIQRTPQRYSSRRATPRHIIVR/ at /home/jayoung/malik_lab_shared/perl/bioperl-live/Bio/SearchIO/hmmer3.pm line 721, line 39.

all the best,

Janet Young

Dr. Janet Young

Malik lab
http://research.fhcrc.org/malik/en.html

Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512
email: jayoung ...at... fhcrc.org

$[some feature object] -> seq; throws a fatal error in the case of a gene that wraps around the ends of a linearized chromosome

In the linear representation of a circular chromosome, sometimes a gene wraps around the ends of the numbering and therefore is annotated with coordinates like
join(2006035..2007700,1..257)

I noticed this error when working with
accession number NC_008309.1

In this case, even a simple
$[feature] -> seq will give a fatal error

MSG: start [2006035] is greater than end [257].
If you want to truncated and reverse complement,
you must call trunc followed by revcom. Sorry.
STACK Bio::PrimarySeqI::trunc /usr/local/share/perl/5.14.2/Bio/PrimarySeqI.pm:456
STACK Bio::SeqFeature::Generic::seq /usr/local/share/perl/5.14.2/Bio/SeqFeature/Generic.pm:705

Additionally, attempting to skip this case by checking
(location -> start) > (location -> end)
doesn't work, because start and end are given as
1 and 2007700 respectively.

I was able to find a workaround in my specific case, but the fix is somewhat of an inelegant hack, and this probably should be fixed on the side of the module itself.

parsing segmented Genbank records

Hi there,

I'm parsing a whole bunch of Genbank records to get CDS sequences, and found one weird record that messes up my pipeline. The Genbank accession is S81162. It turns out it's a segmented record (the CDS joins four regions from four different Genbank entries). Reading the wiki, it seems like Bioperl should be able to recognize this, but I think maybe the code no longer parses that part of the Genbank record? Details are below.

I'd like to just check for segmented records and skip them so they don't throw my code and I can still parse all the other records in the same file (I don't need every single CDS - I think segmented records will be rare).

From the help given here: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Getting_the_Annotations it looks like I should be able to use code like this:

    my $anno_collection = $seq_obj->annotation;

and

    $anno_collection->get_all_annotation_keys;

to recognize that this record has a SEGMENT annotation. However, it actually looks like the SEGMENT annotation is ignored when the Genbank record is parsed. Shouldn't it have come through into $anno_collection? I updated my bioperl from github this morning.

I guess I can parse each record outside of bioperl before I go into Bioperl, but it'd be great if I can just use Bioperl to get at those SEGMENT annotations. Does that seem easy to implement (or re-implement - from the HOWTO page it looks like it was something that used to be possible)?

Some code that hopefully shows what I mean is below.

thanks very much,

Janet Young

Dr. Janet Young
Malik lab
http://research.fhcrc.org/malik/en.html
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.
tel: (206) 667 4512

email: jayoung ...at... fhcrc.org

#!/usr/bin/perl

use warnings;
use strict;
use Bio::SeqIO;
use Bio::DB::EUtilities;

##### get the troublesome sequence from Genbank:
my $file = "S81162.gb";
if (!-e $file) {
    my $factory = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                           -db      => 'protein',
                                           -rettype => 'gb',
                                           -email   => '[email protected]',
                                           -id      => "S81162");
    $factory->get_Response(-file => $file);
}

#### parse it:
my $seqIN = Bio::SeqIO->new(-file => "$file", '-format' => 'Genbank');
while (my $seq = $seqIN ->next_seq) {
    ##### look at the annotations - SEGMENT is not captured
    my $anno_collection = $seq->annotation;
    for my $key ( $anno_collection->get_all_annotation_keys ) {
        my @annotations = $anno_collection->get_Annotations($key);
        for my $value ( @annotations ) {
            print "tagname : ", $value->tagname, "\n";
            print "  annotation value: ", $value->display_text, "\n\n";
        }
    }
}

Speeding up DB::Fasta::subseq

This follows Pull request #96.

The background is basically that in running Variant Effect Prediction, a significant portion of time is spent in the stripping carriage returns and line feeds in the subseq method via:

$data =~ s/\n//g;
$data =~ s/\r//g

Compiling the match portion in a regular expression outside of the function helps, but better is doing it in C.

This commit I hope will be acceptable, and if not should serve as a concrete discussion for further work.

In that commit, in the Inline-C-Fasts branch of my fork, if module Inline::C is around, that will compile use a C function that overwrites another function of the same name which is a pure Perl function.

Bio::Align::Utilities::aa_to_dna_aln does not fix display_id

Bio::Align::Utilities::aa_to_dna_aln converts an alignment from protein to DNA space. To do this, it uses an hash that maps the aligned proteins display_id to a Bio::PrimarySeq of the DNA sequence. However, the returned alignment retains the display_ids of the proteins instead of the DNA sequences.

The only reason I see to do this is that display_id attribute is not required to create a Bio::PrimarySeq but the same is true for the seq attribute.

This seems easy to fix but before I do so, I wonder if it's by design. Can someone comment on this? Other attributes could also be transferred from the DNA sequence. I found desc and display_id to be the most important for me since I then write the alignments back to fasta.

At the moment, I'm working around this via:

my $mrna_aln = Bio::Align::Utilities::aa_to_dna_aln($aa_aln, \%seqs);
for my $aln_seq ($mrna_aln->each_seq)
  {
    $mrna_aln->remove_seq($aln_seq);

    my $mrna = $seqs{$aln_seq->display_id};
    $aln_seq->display_id($mrna->display_id);
    $aln_seq->desc($mrna->desc);

    $mrna_aln->add_seq($aln_seq);
  }

but I reckon this would be a bit more efficient if this changed in bioperl.

t/Root/RootIO.t

t/Root/RootIO.t should be skipped when no network connections.

Just like other tests requiring network.

BEGIN {
    test_begin(
        -requires_networking => 1
    );
}

bioperl-guts-l notifications

Would like to get bioperl-guts-l notifications for github projects. Will look into this.

CONTIG lines in GenBank files

Per the mail list:

http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071579.html

Example file:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Streptomyces_avermitilis_MA_4680_uid57739/NC_004719.gbk

Can't query website: 500 Can't connect to www.ncbi.nih.gov:80 (Bad hostname)

I get this error:

Can't query website: 500 Can't connect to www.ncbi.nih.gov:80 (Bad hostname)
..
STACK: Bio::DB:Taxonomy::entrez::get_taxonids /usr/share/perl5/Bio/DB/Taxonomy/entrez.pm:472

when trying to use the get_Taxonomy_node() method from Bio::DB::Taxonomy. When I modified the value of $EntrezLocation in entrez.pm file line 472 from 'http://www.ncbi.niv.gov/entrez/eutils/' to 'http://www.ncbi.nlm.niv.gov/entrez/eutils/' (added the nlm subdomain) it then seems to work fine.
I guess ncbi changed its url or something?

Why are Bio::DB::Taxonomy::flatfile indexes being deleted every time?

Such as here: https://github.com/bioperl/bioperl-live/blob/master/Bio/DB/Taxonomy/flatfile.pm#L531-L534

Re-computing them from the NCBI taxonomy dump takes quite a bit of time so it would be nice to be able to cache them.

Bio::Tools::CodonTable missing entries for translation tables 24 and 25

Prokka users have alerted me to the fact that there are two new translation tables which bioperl does not support (tables 24 and 25)

The tables are officially defined here: http://www.insdc.org/documents/feature_table.html#7.4.5

I had a look at the source, and my only concern in adding the new ones is there already is a "magic" table 24 called "Strict" which could break something.

If I attempt to use table 25 with translate(-codontableid=>25) I get the expected error "MSG: Not a valid codon table ID [25]".

The relevant Prokka issue is tseemann/prokka#42

Test issue

Testing whether this goes to bioperl-guts-l

RPSBLAST parsing error

We have just noticed a parsing error when dealing with RPSBLAST output. Previous versions of BioPerl could parse the same output with no trouble.

We narrowed down the cause of the problem to a single regex in Bio/SearchIO/blast.pm

But I am getting ahead of myself. The problem first: an error is thrown if the alignment contains something like the following:

Query  61   VGILADLQGPKIRLGR--------------------------------------------  76
              I  DL GPK+R G
Sbjct  199  CRIAMDLAGPKLRTGPIAPGPRVIKLRPTRDALGRVLTPARLWLTASESPPPSPPPGPVG  258

Query       ------------------------------------------------------------

Sbjct  259  LPVDPEWLARLEPGDELRFTDARGKKRKLTVTEVDDEGVLAEGSQTAYLANGTLLRLGRH  318

Query  77   ---------FTEGPVLLERGDTFTITVEDGA------EGDGHMCGTTYAGLADDVTPGER  121
                       E  + L+ GD   +T +D        +        T          GER
Sbjct  319  DSTRVGGLPPVEQKLRLKVGDRLVLTRDDAPGDPAQGDAPPARISCTLPEAFRAARVGER  378

As you can probably see, the second part of that output snippet contains a query line that is exclusively gaps (the dashes), and therefore contains no sequence coordinates, as would be logical in our opinion. Trying to parse BLAST results containing a line like that results in:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: no data for midline Query       ------------------------------------------------------------
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:368
STACK: Bio::SearchIO::blast::next_result /usr/local/share/perl5/Bio/SearchIO/blast.pm:1805

We dug a bit and found that the problem is caused by the following change in Bio/SearchIO/blast.pm (the file's header contains the following info at the top: # $Id: blast.pm 16123 2009-09-17 12:57:27Z cjfields $), as shown by part of the diff output comparing this file and an older one (without any such info at the top), from a version of BioPerl that did not throw the error:

1869c1790
<                 if (/^((Query|Sbjct):?\s+(\-?\d+)?\s*)(\S+)\s+(\-?\d+)?/) {

---
>                 if (/^((Query|Sbjct):?\s+(\-?\d+)\s*)(\S+)\s+(\-?\d+)/) {

As can be seen, the difference between the two versions is the ? quantifier after the pieces of code that match the sequence coordinates, i.e. (\-?\d+). Without the ?, we understand that the match to the coordinates part of the output became obligatory, that is, no match at all without the presence of that part in the line being examined.

Although uncommon, that piece of (RPS)?BLAST output seems to be perfectly legitimate, and it causes BioPerl to die. Thus, we believe this is a bug.

Please let me know if you need further information that would be of help here.

Problems with running bp_genbank2gff3.pl

Hi,

Here is the error I keep getting whenever I run this script:

Can't locate YAML.pm in @INC (@INC contains: blib/lib /Users/Areej/bioperl-live /sw/lib/perl5 /sw/lib/perl5/darwin /Library/Perl/5.16/darwin-thread-multi-2level /Library/Perl/5.16 /Network/Library/Perl/5.16/darwin-thread-multi-2level /Network/Library/Perl/5.16 /Library/Perl/Updates/5.16.2/darwin-thread-multi-2level /Library/Perl/Updates/5.16.2 /System/Library/Perl/5.16/darwin-thread-multi-2level /System/Library/Perl/5.16 /System/Library/Perl/Extras/5.16/darwin-thread-multi-2level /System/Library/Perl/Extras/5.16 .) at /usr/local/bin/bp_genbank2gff3.pl line 224. BEGIN failed--compilation aborted at /usr/local/bin/bp_genbank2gff3.pl line 224.

I tried installing YAML but it didn't help.

Thanks,
Areej

Bio::DB::EUtilities sometimes retrieves features (CDS, etc) and sometimes does not

Hi again,

I've found an odd issue with Bio::DB::EUtilities. I used it to download a bunch of annotated genomic sequences from Genbank. I realised that for some Genbank entries, it gives me the features I expect to see (CDS, mRNA, etc), but for others, no features are returned, even though I know there should be some.

If I get the sequences a different way, using Bio::DB::GenBank, I get the correct number of features.

The code below should demonstrate what I'm talking about - check out the output files for Genbank accession GG662498: GG662498.gb.DB.EUtilities (no features) and GG662498.gb.DB.GenBank (has features). In contrast, the output for accession GG662497 looks fine using either method. Odd.

thanks very much,

Janet Young

Dr. Janet Young

Malik lab
http://research.fhcrc.org/malik/en.html

Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512
email: jayoung ...at... fhcrc.org

!/usr/bin/perl

use warnings;
use strict;
use Bio::SeqIO;
use Bio::DB::EUtilities;
use Bio::DB::GenBank;

my @ACCS = ("GG662497","GG662498");

get seqs using the Bio::DB::EUtilities method - code based on this: http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook. This gets features for GG662497 but not GG662498

foreach my $acc (@ACCS) {
my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
-db => 'nucleotide',
-rettype => 'gb',
-email => '[email protected]',
-id => $acc);
my $outfile = "$acc.gb.DB.EUtilities";
$factory->get_Response(-file => $outfile);
}

get seqs using the Bio::DB::GenBank method. This works, gets all features for both seqs

my $gb = Bio::DB::GenBank->new();
foreach my $acc (@ACCS) {
my $seq = $gb->get_Seq_by_id($acc);
my $seqOUT = Bio::SeqIO->new(-file=>"> $acc.gb.DB.GenBank", -format=>"genbank");
$seqOUT->write_seq($seq);
}

spliced_seq problem: parsing reverse strand CDS seqs wrongly

Hi there,

I've been parsing a bunch of Genbank files to get the CDS sequences out. Should be simple, right? I think my script is OK, and I think I'm using a current version of Bioperl, but for spliced CDSs on the reverse strand I get the wrong sequence. I'm not sure if this is a bug in Bioperl, or that Genbank sometimes stores things in a inconsistent ways. This seems like a pretty major issue.

The problem is with CDS that are specified in the Genbank entry like this (example from fourth CDS in Genbank AC004770):
CDS complement(join(100032..100206,100389..100477,
100743..100821,100896..101031,101437..101510,
101822..101939,102474..102585,102855..103055,
103442..103532,103887..103997,104187..104263,
106163..106348))
From the sequence I get out when I use spliced_seq on this, it looks like Bioperl is getting each of those exons using the coordinates, complementing each one and then joining them together. This gives me a sequence where all the exons are present with correct strand and sequence, but are joined together wrongly, with the last exon first, and preceding towards the first exon.

I want the true CDS with the exons in the right order - seems like the simplest way to do this would be to get each exon using the specified coords, join them together, and then reverse-complement the whole thing (i.e. should join exons then complement, rather than complement exons then join).

Hope this makes a little sense to you! Below is an example script, which pulls out BAC sequence AC004770 and its CDSs. A BLAST search shows that the fourth CDS (AC004770_CDS_4) is extracted wrongly. A BLAST search back to the original BAC sequence shows that AC004770_CDS_4 has joined the exons together in the reverse order.

I'm using a version of Bioperl I got from github on Friday Oct 24th.

thanks very much,

Janet Young

Dr. Janet Young

Malik lab
http://research.fhcrc.org/malik/en.html

Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512
email: jayoung ...at... fhcrc.org

!/usr/bin/perl

use warnings;
use strict;
use Bio::SeqIO;
use Bio::DB::GenBank;

get an example seq, and write the whole thing to a fasta file:

my $gb = Bio::DB::GenBank->new();
my $seq = $gb->get_Seq_by_id('AC004770');
my $BACseq = Bio::SeqIO->new(-file=>"> BACseq.fa", -format=>'fasta');
$BACseq->write_seq($seq);

get the CDS seqs and write them to fasta file

my $CDSseqs = Bio::SeqIO->new(-file=>"> CDSseqs.fa", -format=>'fasta');
my $CDScounter = 1;
for my $feat_object ($seq->get_SeqFeatures) {
my $tag = $feat_object->primary_tag;
if ($feat_object->primary_tag eq "CDS") {
my $splicedSeq = $feat_object->spliced_seq;
my $CDSname = "AC004770_CDS_$CDScounter";
$splicedSeq->display_id($CDSname);
$CDScounter++;
$CDSseqs->write_seq($splicedSeq);
}
}

Using bioperl-live with travis-ci

Hi,

I'm trying to set up a .travis.yml for a repository that needs bioperl-live (and bioperl-run) but when I give 'BioPerl' as a dependency in this repo's Makefile.PL travis tries to get all the dependencies that CPAN's version of bioperl-live suggests, including DB_File - which fails. Is there some other way that this can be done?

Thanks,

Rutger

Bio::SearchIO::infernal hangs on results from newer Infernal versions

Parsing Infernal 1.1.1 output through Bio::SearchIO::infernal seems to hang on some kind of infinite loop when attempting to read it in. I've attached a short Infernal output file that can be used to reproduce the error.

I think this is the same issue as reported here:
http://bio.perl.org/pipermail/bioperl-l/2012-August/036710.html

Migrate issues from Redmine

Migrate tickets from OBF Redmine (now in read-only status) to Github Issues. We only care about open tickets, no others are needed.

Mismatched output in bp_taxid4species

The bp_taxid4species script produces incorrect output.

$ bp_taxid4species "Arabidopsis thaliana" "Homo sapiens"
"arabidopsis thaliana[All Names]", 9606
"homo sapiens[All Names]", 3702

However the taxon id of A. thaliana should be 3702 and humans should be 9606.

The bug stems from a false assumption about Entrez output. As shown below

$ base='http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
$ term='Arabidopsis+thaliana+OR+Glycine+max'
$ wget -qO /dev/stdout "${base}?db=taxonomy&term=${term}" | xmlstarlet fo
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult>
  <Count>2</Count>
  <RetMax>2</RetMax>
  <RetStart>0</RetStart>
  <IdList>
    <Id>3847</Id>
    <Id>3702</Id>
  </IdList>
  <TranslationSet/>
  <TranslationStack>
    <TermSet>
      <Term>arabidopsis thaliana[All Names]</Term>
      <Field>All Names</Field>
      <Count>1</Count>
      <Explode>N</Explode>
    </TermSet>
    <TermSet>
      <Term>glycine max[All Names]</Term>
      <Field>All Names</Field>
      <Count>1</Count>
      <Explode>N</Explode>
    </TermSet>
    <OP>OR</OP>
  </TranslationStack>
  <QueryTranslation>arabidopsis thaliana[All Names] OR glycine max[All Names]</QueryTranslation>
</eSearchResult>

Here I am using the same entrez query used in the bp_taxid4species. The terms in and the terms in the are not in the same order. There is simply no way to map scientific names to taxon ids using this XML output.

One option would be to query entrez once for each species name. However this is slow.

A better option would be to extract the ids in IdList and then perform a second entrez query using the efetch or esummary utilities against the taxonomy database.

Bug in _codon_is method of Bio/Tools/CodonTable.pm file

The for loop doesn't check all the codons possible of " $self->unambiguous_codons($value)" list but only the first one due to the return statement.
Example: Using the codon "AYG" will give you the list (ACG ATG) but only ACG will be tested and the method will return 0.

Instead of:

sub codon_is {
my ($self, $value, $table, $key ) = @;

return 0 unless length $value == 3;

$value = lc $value;
$value =~ tr/u/t/;

my $id = $self->{'id'};
for my $c ( $self->unambiguous_codons($value) ) {
my $m = substr( $table->[$id], $CODONS->{$c}, 1 );
return 0 unless $m eq $key;
}
return 1;
}

We should write:

sub codon_is {
my ($self, $value, $table, $key ) = @;

return 0 unless length $value == 3;

$value = lc $value;
$value =~ tr/u/t/;

my $id = $self->{'id'};
for my $c ( $self->unambiguous_codons($value) ) {
my $m = substr( $table->[$id], $CODONS->{$c}, 1 );
if ($m eq $key) { return 1; }
}
return 0;
}

Update Travis-CI scripts

We need to update the current Travis-CI workflow to use the new containerized infrastructure:

https://docs.travis-ci.com/user/migrating-from-legacy

This will likely require some debugging, might be best done on a branch. Any takers?

Related to #134

IUPAC module generated regexs will not match Ns in the target

For example this returns False

-------GTGYCAGCMGCCGCGGTAA--------            #Probe
TAACTCCGNGCCAGCAGCCNCGGTAATACGGAGG            #Reference seq

Issues with bioperl versioning

Blocker on new 1.6/1.7 releases, see:

andk/pause#75

Example attachment

Testing simple text attached files
blorg.txt

Current bioperl-live won't commence Build

Just tried to build latest live 01ec10d and got the error below.

I have File::Copy 2.09 from Perl 5.14 (Ubuntu Server 12) but latest 2.29 is part of Perl 5.20 core which I can only force install.

% git clone https://github.com/bioperl/bioperl-live.git
% cd bioperl-live
% perl ./Build.PL
% ./Built test

'blib/script/bp_chaos_plot.pl' and 'blib/script/bp_chaos_plot.pl' are identical (not copied) at /usr/share/perl5/Bio/Root/Build.pm line 219
Use of uninitialized value $atime in utime at /usr/share/perl/5.14/File/Copy.pm line 393.
Use of uninitialized value $mtime in utime at /usr/share/perl/5.14/File/Copy.pm line 393.
Can't rename 'blib/script/bp_chaos_plot.pl' to 'blib/script/bp_chaos_plot.pl': No such file or directory at /usr/share/perl5/Bio/Root/Build.pm line 219.

Bio::AlignIO::stockholm creates improperly formatted mark-up lines

Within the Bio::AlignIO::stockholm module, when creating the mark-up lines within sub write_aln, in cases where the sequence accession exceeds the arbitrary size limit of 22 characters (the Stockholm specification states "No size limits on any field"), the 'feature' text becomes joined to the 'seqname' text. This causes hmmbuild to error out.

Line 600:

my $text = sprintf("%-4s%-22s%-3s%s\n",$seq_ann,
                   $aln->displayname($seq->get_nse), $tag, $acc);

If:

$seq_ann = "#=GS "
$tag = "AC"
$acc = "accession"
$aln->displayname($seq->get_nse) = "1234567890123456789012345"

then:

$text = "#=GS 1234567890k123456789012345AC accession"

where the proper output should be:

$text = "#=GS 1234567890k123456789012345 AC accession"

The simple fix is to enforce spaces within the sprintf command:
Line 600:

my $text = sprintf("%-4s % -22s %-3s %s\n",$seq_ann,
                   $aln->displayname($seq->get_nse), $tag, $acc);

Which would yield:

$text = "#=GS  1234567890k123456789012345 AC  accession"

This will parse even with the extra whitespace.

Cheers,
Bill

Bio::DB::Taxonomy::flatfile needs rewrite

We need to rethink the index structure here now that the number of taxonomy items is so large, DB_File hash interface is not working for so many entries. Can we explore either BerkeleyDB or perhaps a NOSQL option that can still run as a flatfile indexed file?

SimpleAlign.pm needs file input named a particular way?

Admittedly we're using galaxy to run a pipeline that draws on SimpleAlign.pm, but Galaxy is throwing this error:
Argument "2087/1" isn't numeric in numeric comparison (<=>) at /opt/galaxy/.perlbrew/libs/perl-5.18.1@galaxy/lib/perl5/Bio/SimpleAlign.pm line 777, line 132.

The paired input files are:
07-2087_R1.fastq.gz
07-2087_R2.fastq.gz

Is it the file naming scheme that is causing this error? Issue is that unless files are coming straight from a miseq, they're often renamed. Could the file naming check done (I believe it is the _alpha_startend {} function) be improved? Possibly by using a regular expression (multiline and ignore caps flags set) like the following to identify first and second pair file names if the strings were sorted in \n delimited list?:
^((.+)((1)|(f))(.*))\n((\2)(?(4)2)(?(5)r)(\6))$
This will match both [pref]1[suf] [pref]2[suf] and [pref]f[suf] [pref]r[suf] pairs with one regular expression.

e.g. test data:
a1
a2
a.f
a.r
af.1
af.2
ar
ar1
ar2

af1
dfsdf
ar1

a_1f2
a_1R2

a1f
a1r

Regards,
Damion

Bio::Align::IO::maf alignment coordinates on revcom are wrong

Hi,

using Bio::AlignIO::maf to parse maf generated by the latest LASTZ I get wrong coordinates for features on the negative strand. The MAF output of LASTZ reports is inline with the standard's definition at the UCSC site: https://genome.ucsc.edu/FAQ/FAQformat.html#format5, i.e:
start -- The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is '-' then this is the start relative to the reverse-complemented source sequence

I don't know whether there are other MAF flavors that have been used to develop the module, if so an additional change like a switch to toggle parsing behavior would be needed.

Here is the patch that worked for me:

--- /home/lang/tmp/maf.pm       2016-05-03 12:07:21.243467000 +0200
+++ /usr/share/perl5/Bio/AlignIO/maf.pm 2016-05-03 14:09:40.355467000 +0200
@@ -172,8 +172,8 @@
     $strand = $strand eq '+' ? 1 : $strand eq '-' ? -1 : 0;
        my $seq = Bio::LocatableSeq->new('-seq'          => $text,
                                         '-display_id'   => $src,
-                                        '-start'        => $start,
-                                        '-end'          => $start + $size - 1,
+                                        '-start'        => $strand > 0 ? $start                : ($srcsize-($start+$size-2)),
+                                        '-end'          => $strand > 0 ? ($start + $size - 1)  : ($srcsize-($start-1)) ,
                                         '-strand'       => $strand,
                                         '-alphabet'     => $self->alphabet,
                                        );

Best,
Daniel

GenBank parsing CONTIG issues

See:

http://mailman.open-bio.org/pipermail/bioperl-l/2014-September/088945.html

Basically, this works with the August patch for GenBank parsing (so the bug isn't there) but some change since 1.6.922 has caused parsing to slow dramatically. We'll need to bisect this.

Add Bio::Root back

Consensus (well, handful of 👍 on mail list) want this back in. Not sure how easily this can be done; @bosborne mentioned he could try taking it on?

translate( -orf => 1) does not return first full ORF as advertised

Colin wrote:

I've got an older application that I'm migrating from BioPerl 1.5 to
1.6.923, and it didn't pass testing due to a change in the behavior of
Bio:Seq::translate between 1.5 and 1.6

Per the bioperl documentation Seq->translate(-orf=> 1) should return
the first orf in a sequence. It appears to be returning the orf
upstream of the first stop codon, so an internal out of frame orf is
found instead of a full length orf.

Here's my test code. It passes in 1.5, fails in 1.6. Is there a
switch or different parameter to get the old behavior? Is this a bug
or intentional change in translate?

use strict;
use lib '/home/maj/Code/BioPerl/bioperl-live';
use Bio::Seq;

my ($bs, $prot, $orf0, $orf1, $orf) ;
$bs = Bio::Seq->new( -seq           => "ATGAATGTAAATAA",
                        -display_id       => "TestSequence",
                        -alphabet         => 'dna' );

$orf0 = $bs->translate(-frame=>0);
$orf1 = $bs->translate(-frame=>1);

# should output MNVN orf that starts in frame 0, not M* orf that starts in frame 1
$orf = $bs->translate(-orf=>1, -start=> 'atg');
if ($orf0->seq eq $orf->seq) {
        print "PASS";
      }
else {
    print $orf0->seq . " != " . $orf->seq . "\n";
  }

I replicated this with bioperl-live version of Bio::PrimarySeqI. the pod states

if you want translate() to find the first initiation
codon and return the corresponding protein:

$protein_seq_obj = $cds_seq_obj->translate(-orf => 1);

but the orf-finding routine appears to stop at the first termination
codon (which is at bp 8 - TAA in the example) and kicks out the orf M*.

bioperl / bioperl-live Goto Github PK

bioperl-live's People

Stargazers

Watchers

Forkers

bioperl-live's Issues

!/usr/bin/perl

SEQUENCE MANAGEMENT MODULES

### out-put

!/usr/bin/perl

Janet Young

email: jayoung ...at... fhcrc.org

!/usr/bin/perl

get seqs using the Bio::DB::EUtilities method - code based on this: http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook. This gets features for GG662497 but not GG662498

get seqs using the Bio::DB::GenBank method. This works, gets all features for both seqs

!/usr/bin/perl

get an example seq, and write the whole thing to a fasta file:

get the CDS seqs and write them to fasta file

Recommend Projects

Recommend Topics

Recommend Org