brettc / partitionfinder Goto Github PK

View Code? Open in Web Editor NEW

60.0 60.0 42.0 54.7 MB

PartitionFinder discovers optimal partitioning schemes for DNA sequences.

License: Other

Python 99.69% Makefile 0.29% Shell 0.02%

partitionfinder's People

Contributors

Stargazers

Watchers

Forkers

roblanf bioinformaticsarchive dkainer pbfrandsen wrightaprilm idiv-biodiversity axelbarlow dstankovichius yuanning-li unagichin gitter-badger laurenealicia yanhou jucortez cdoorenweerd cmayer lowry76 timkahlke biologyguy marekborowiec 5775ms alephreish hdetering dayedepps tw7649116 janaskopalikova altingia litswu wangdang511 mcassatt guizhouxiuwen jpresern natay hj1994412 sangeeta97 pengfang11 ki-beep sun-jiao wook2014 fafuyyk hashara reddressnotme

partitionfinder's Issues

Order of clustering steps can differ on different machines, even in reruns

Running PF in rcluster mode can result in different orders of the clustering steps.

I have started PF on one machine. Eventually it turned out that RAM was insufficient.
In order to save time I copied the analysis folder to a machine with more RAM
and continued the analysis there.
When restarting PF on this data set, I expected it to make its way until the point
where it stopped before calling raxml again, since it can read all results from its data base. However, after a few hundred clustering steps it conducted a step it did not do in the first run, so that raxml was called for all successive clustering steps to evaluate a small number of subsets.

Potential cause: Rounding errors due to different machines or after writing and reading from the data base could lead to this effect.

It's not a critical and potentially unavoidable issue.

tigger matrix

So I just realised something that could speed things up, but I don't think there's any point implementing it.

At the start of a kmeans tigger run, we calculate the pairwise compatibility matrix for the WHOLE ALIGNMENT. THe site rates are then just colum averages, right?

If that's right, then we never need to calculate anything with tigger again, because for any given subset, we should be able to extract the relevant bit of the matrix and just re-calculate the column averages.

At the moment we re-run tigger for every single new subset.

Thoughts? Have I missed something?

I think this is only worth impelmenting if it turns out that entropies aren't so good after all. For that we need to do some benchmarking.

fix parser tests

A bunch of failing tests for the parser. I've looked, but can't figure it out (though I suspect it's simple).

@brettc, can you take a look?

tests/test_parser.py::test_one FAILED
tests/test_phyml.py::test_simple PASSED
tests/test_phyml.py::test_interleaved PASSED
tests/test_phyml.py::test_subset PASSED
tests/test_raxml.py::test_one PASSED
tests/test_raxml.py::test_parse_nucleotide FAILED
tests/test_raxml.py::test_parse_aminoacid FAILED
tests/test_submodels.py::test_consistency PASSED
tests/test_submodels.py::test_scheme_lengths PASSED
tests/test_subset.py::test_identity PASSED
tests/test_subset.py::test_overlap ERROR
tests/PF2/test_pf2.py::test_missing_sites_warning ERROR
tests/PF2/test_pf2.py::test_overlapping_blocks ERROR

generate likelihoods for "fabricated subsets" in the kmeans algorithm

The "fabricated subsets" feature requires that some sort of BIC score be assigned to subsets that we cannot analyze. To do this we must estimate the log likelihood for the subset as a whole. Since the definition of the fabricated subset is that raxml/phyml cannot analyze it, we don't have the subset log likelihood. In the first version of kmeans, we simply added up the site log likelihoods that we had conveniently generated for the clustering step, i.e.:

This is no longer viable since we now use TIGER site rates rather than site likelihoods.

Shall we:

Since we switched to TIGER rates, I haven't seen a dataset that required fabricated subsets yet, should we get rid of the fabricated subset function altogether and throw an error if the subsets get to small?
Keep the BIC of the unsplit subset and set the BIC of the problematic subset to a value that makes it, plus the new subset one BIC point better than the unsplit subset so that algorithm keeps going?
Other ideas?

fix rerun tests

It looks like PhyML now writes alignments differently. All that needs to be done here is that the base files for all of the rerun tests need to be updated.

Problem with installation

I am following the isntruction for installing the phyml on Biolinux that uses Ubuntu 14.04.
However, when I type

make
make  all-recursive
make[1]: Entering directory `/usr/local/lib/partitionfinder-master/programs/phyml_source'
Making all in src
make[2]: Entering directory `/usr/local/lib/partitionfinder-master/programs/phyml_source/src'


:: Building [phytime]. Version 20150123 ::


gcc  -I. -I..     -ansi -pedantic -Wall -std=c99 -O3 -fomit-frame-pointer -funroll-loops -arch i386      -mmacosx-version-min=10.4 -MT main.o -MD -MP -MF .deps/main.Tpo -c -o main.o main.c
gcc: error: i386: No such file or directory
gcc: error: unrecognized command line option ‘-arch’
gcc: error: unrecognized command line option ‘-mmacosx-version-min=10.4’
make[2]: *** [main.o] Error 1
make[2]: Leaving directory `/usr/local/lib/partitionfinder-master/programs/phyml_source/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/usr/local/lib/partitionfinder-master/programs/phyml_source'
make: *** [all] Error 2

Any idea what is the reasonn for these errors?
thanks

Add support for sequential analyses over a model list

One thing about raxml is that you can only use a single model of rate variation for each partition (parameters estimated independently per partition).

So one thing that would be useful in PF would be to be able to provide a list of models, e.g.

models = GTR, GTR+G, GTR+I+G;

and then rather than consider combinations of these models, just run three analyses - one with all models set to GTR, one with all set to GTR+G, and one with GTR+I+G.

We would then just pick the scheme with the lowest AICc, as usual, and output that along with the best models.

This is a low priority, as it is fairly cosmetic, but user-friendly nonetheless

Catch segmentation faults from RAxML / PhyML

The basic idea here is that sometimes other programs we rely on will die horribly, for reasons we don't understand and can't fix. Right now, PF just exits saying that RAxML (or whatever) didn't execute successfully. It would be helpful to provide more information. The tips below from a user are to help sort that out.

Email from the user:

we have been trying to get PartitionFinder 1.1.1 to run on a Linux
machine of Dani Bartel with 8 GB of RAM. the test alignment file is 12
MB in size and has about 338,000 amino acid positions in 1348 partitions
(I ran it on my machine in Aussie successfully, the one we both started
together).

We compiled standard RAxML version from Brett's github 1 using the
Makefile for GCC. RAxML crashed during LG+G branch length estimation
(BLTREE) with a segmentation fault. We found out by running RAxML
manually using this command (PartitionFinder runs it during its
analysis):

raxml -f -e -s DATEN.phy -t TREE.phy -m PROTGAMMALG -n BLTREE -w
START_TREE -e 1.0 -O

While a segmentation fault is never pretty and this is a RAxML issue
(has already been posted to its developers) it would be nice if
PartitionFinder would trap that segfault and provide a more useful error
message than in the attached file.

You could trap the segmentation fault signal (SIGSEGV) using the signal
module 2 and write a handler like described in 3 that at least tells
the user that something terribly wrong happened to the subprocess.
[this here was written by Malte Petersen with helps me currently here in
the US ti get run everything)

Would this be possible? I'm sure future users would greatly appreciate
this improvement!

heuristic searches with 1 subset

currently, if you have e.g. search = 'greedy'; and a single subset, PF gets confused and quits without being informative.

This should be very very easy to fix. Just put a catch at the top of each heuristic search to make sure that we only do the search if there's >1 subset in the initial scheme. Otherwise, just analyse that one subset and send out the results.

output methods text and suggested citations

Following a conversation on twitter, we realised it might be helpful to output at the end of any given run the suggested citations and methods text. This could incorporate:

Suggested citations
Citations in bibtex format
Suggested text

All of these would be determined by the details of the analysis someone ran. E.g. did it use PhyML, RAxML, or something else? Did it use algorithms (like the relaxed clustering or kmeans) described in other publications? Which version of PF did it use?

E.g. if some ran relaxed clustering in PF2, that would use RAxML, the relaxed clustering algorithm, and PF2, so the text might read:

"To determine an appropriate partitioning scheme, we used the relaxed clustering algorithm [ref1] implemented in PartitionFinder 2.0 [ref2], which relies on RAxML [ref3]".

And then the refs in text and possibly bibtex format.

TIGER rates

So right now the TIGER K-means keeps recalculating TIGER rates.

I am not sure this is good for two reasons:

It is slow to do it, particularly in the early stages of large datasets
I wonder if it could fall into something of the same trap of overfitting in some cases

So, I'm wondering if we should instead stick with calculating TIGER rates once, at the top of the algorithm, and then just stick with them. Thoughts?

New model names break tests

The new model loading doesn't have names for a number of options contained in the tests (ie. 'raxml'). This breaks a number of tests.

Allow for relocation of output folder

Currently, output is dumped into a subfolder below the configuration files. This doesn't allow for separation of the results from configuration. This would be useful for our own testing (so we don't clog up the development folders with output), but also (I imagine) in situations where large amounts of output would be better off in a separate place from the configuration. (Note: we need to make sure that the "restart" tests are working again before doing this).

Cache site rates?

Right now, the site rates aren't cached or saved. We should probably cache them in case the run is interrupted and the user has to restart. Note that this might not always allow the user to 'skip' to the exact place that the algorithm left off since the k-means algorithm isn't guaranteed to converge on the same solution, even given identical rates.

Consider packaging / distribution for 2.0 (maybe using a Conda recipe)

PF 2.0 moves away from the "batteries included" approach in PF 1.0, as we dependencies on various packages (numpy etc), and we might also require building a cython extension. Python packaging is notoriously crappy, so we could consider anaconda packaging as a supported solution. ie. Download anaconda; use "conda install partfinder2".

update RAxML to very latest version

In testing with the 1Kite crew, we have discovered a small bug in the RAxML LG4X model implementation. This is currently being fixed, so before release we have to know that it's fixed, and update our Windows and Mac executables for RAxML. We may need help compiling the windows version, but I have a few people I could ask.

re-instate .cfg tests

At some point, @brettc removed these tests for development. They need to get put back in.

There are two types of test currently missing:

Comparison of old .cfg file and current one, if there are saved subsets. E.g. were subsets analysed with RAxML or PhyML, and did the data block definitions change;
Basic .cfg checks, e.g. that data blocks do not contain the same site > once.

Rob

Add subset info to output

For certain analyses (anything with RAxML, but more specifically rcluster and scluster) we should output subset parameters in the best_schemes.txt file, e.g. including:

base frequencies
relative rate
model parameters
gamma rate parameter (if present, NA if not)
invariant sites parameter (if present, NA if not)

Feature request: mrBayes block generation?

Would it be possible for PartitionFinder to automatically create mrBayes blocks for partitioned analyses? I feel like this could help automate using PartitionFinder for mrBayes analyses, rather than having to manually construct the bayes block.

semi colons

We have a couple of issues that could be solved by making people use semi-colons at the end of lines in the .cfg file.

This example of partitions:

part1 = 1-100\3
12S = 101-1000

at the moment, our parser thinks the "12" is a site in the 'part1' partition. Since 12S, 16S, 28S etc are common genes for phylogenetics, we'll see this problem a lot.

This example of a user screw-up where you accidentally add something rubbish after defining an option

models = all k

If we made people use semi-colons, we could get around both of these problems, and have a much better crack at giving people useful parser error messages, which I suspect will be where most of our problems come from if and when people start using the program.

What do you think? Happy to implement this.

models

Right now, we're not doing as well as we could with models.

At the moment, a subset just picks the best model based on all the models it has analysed. This is inconvenient if I want to run a full analysis (i.e. all models) and then see what the answer would have been with a more restricted model list, or vice versa.

It's not a problem in that users are banned from re-running an analysis with a different model list. However, it would be something that would be useful (and possible straightforward) to fix.

Installations on windows network drives seem to fail

See this thread in google groups:

https://groups.google.com/forum/#!topic/partitionfinder/HPULy1ZlkzA

cython tiger implementation has a flag requiring OS 10.8 or higher

The cython version of tiger on the development branch has a flag requiring mac os 10.8 or greater--mmacosx-version-min=10.8. It would be nice if we could compile the cython code on a linux machine.

Keep (some) phylofiles

It woud be useful to keep some files, like the best "Subset Partitions", or even all of the tested partitions.

add support for amino acid alignments with iterative k-means

It might be a good idea to add support for amino acid alignments using iterative k-means. I think most of it works already, we just need to add support for the estimation of amino acid site rates. @cmayer pointed out that since there are 20 character states in amino acids (rather than 4) that there will likely be a greater amount of conflict leading to some very small rates. I'm not sure what effect this will have and it will have to be tested to see if it works. I still think it would be worth adding since we should be able to implement it with minimal extra effort.

PhyML output filename conventions on linux

Hi,

I ran into a problem running partition finder on linux. This applies to using it with phyml version 20141029, but maybe affects phyml on linux more generally. I'm running the test example "python PartitionFinderProtein.py examples/aminoacid".
On linux phyml output files don't have a .txt suffix - they just end .phy_phyml_tree (or stats). However, your file handling code in partfinder.phyml.py (functions make_tree_path and make_output_path) assumes that the .txt suffix is there. So this leads to an error when trying to read the BioNJ starting tree at the beginning of the analysis. The tree file gets written successfully, it just doesn't have the suffix, so PF can't find it, and it errors out with a standard python IOError.
If I alter the code in phyml.py (just delete ".txt" in the three places it occurs) then everything runs OK.

I don't know if you've run into this problem before, or if you already have another fix, but I thought I'd let you know.

fix full tests with results comparisons

The following tests fail, because the new version of phyml or raxml is finding better likelihoods so breaking our rerun tests (either that or updates to the rcluster algorithm are giving slight improvements). In each of these cases I've checked the output, and the differences are very small. Everything else looks good.

For the protein tests that are failing, they fail because I simplified the tests to make them run quicker, so the expected output has changed.

The fix is to update the results.bin files to runs from the current value. @brettc, can you do this for the following tests:

tests/full_analysis/test_full.py::test_dna[DNA2] FAILED
tests/full_analysis/test_full.py::test_dna[DNA7] FAILED
tests/full_analysis/test_full.py::test_dna[DNA8] FAILED
tests/full_analysis/test_full.py::test_prot[prot1] FAILED
tests/full_analysis/test_full.py::test_prot[prot6] FAILED
tests/full_analysis/test_full.py::test_prot[prot7] FAILED
tests/full_analysis/test_full.py::test_prot[prot8] FAILED

Memory usage

OK,

At the moment we go gung ho for all the processors we can find. This is almost always the best thing to do, but there are a couple of places where we should be more careful. Particularly with big datasets, as the partitions we analyse get bigger (say, in the greedy analysis), each processor needs more RAM to do the calculations.

I think the best thing here would be to use the processors more dynamically, this doesn't need to be too difficult. I'm pretty sure that Stephane has a formula (something to do with the number of sites and the number of species) to calculate roughly how much memory any given phyml analysis will use. If we could also get an estimate of how much memory is avialable on the host machine, we could do a simple test before adding a task to the threadpool:

if available_memory>estimated_memory:
add_next_task
else:
dont_add_task

If we wanted to get clever, we could also just try runnning through all the waiting tasks to see if there's one that will fit in the available_memory, but we'd have to remember to still do this in order of priority of model difficulty (which is coded currently to run the longer analyses first). This part will only really be relevant if we change the job scheduling to do >1 partition's analyses at a time though.

That is all. Just a thought.

Update RAxML version

We need to update the version of raxml we use to version 8.x

As far as I remember, there are some things which used to work in RAxML, which no longer work. I will attempt to work on this as soon as possible, so I can contact Alexis and resolve any outstanding issues with the RAxML code that we might need to be fixed.

Note that someone has been compiling windows versions of RAxML, so once we have a version of RAxML we can use, we might be able to ask that person to compile us a version too...

https://github.com/stamatak/standard-RAxML/tree/master/WindowsExecutables_v8.1.15

add a branchlengths=equal option

RAxML has only two models for branch lengths currently:

All branch lengths equal
Each partition has its own set of branch lengths.

The latter corresponds to 'unlinked' branch lengths in PF (i think it's by using the option -M)

The former is not implemented in PF, but could be as an 'equal' branch lengths option.

Note that right now, RAxML doesn't implement anything that's equivalent to what we call 'linked' branch lengths, where there's an underlying set of branch lengths but each partition gets its own rate multiplier. This is a shame, because it's my suspicion that this is by far the best of the three approaches (i.e. out of 'unlinked', 'linked', and 'equal').

IOError: [Errno 2] No such file or directory: './analysis/start_tree/filtered_source.phy_phyml_tree.txt'

I am on Linux. When I try to run the examples I get the following error:

$ python2 PartitionFinder.py -v --force-restart examples/nucleotide
...
Traceback (most recent call last):
  File "PartitionFinder.py", line 23, in <module>
    sys.exit(main.main("PartitionFinder", "DNA"))
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/main.py", line 333, in main
    options.processes)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/analysis.py", line 55, in __init__
    self.make_tree(cfg.user_tree_topology_path)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/analysis.py", line 149, in make_tree
    self.cfg.cmdline_extras)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/phyml.py", line 133, in make_branch_lengths
    dupfile(topology_path, tree_path)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/phyml.py", line 101, in dupfile
    shutil.copyfile(src, dst)
  File "/usr/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: './analysis/start_tree/filtered_source.phy_phyml_tree.txt'

$ python2 PartitionFinderProtein.py -v --force-restart examples/aminoacid
...
Traceback (most recent call last):
  File "PartitionFinderProtein.py", line 23, in <module>
    sys.exit(main.main("PartitionFinderProtein", "protein"))
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/main.py", line 333, in main
    options.processes)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/analysis.py", line 55, in __init__
    self.make_tree(cfg.user_tree_topology_path)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/analysis.py", line 149, in make_tree
    self.cfg.cmdline_extras)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/phyml.py", line 133, in make_branch_lengths
    dupfile(topology_path, tree_path)
  File "/home/wookietreiber/src/idiv/partitionfinder/partfinder/phyml.py", line 101, in dupfile
    shutil.copyfile(src, dst)
  File "/usr/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: './analysis/start_tree/filtered_source.phy_phyml_tree.txt'

I wonder why that is and whether this is a Linux-only problem. It must be or else, I suppose, you would have figured it out already for the Mac and Windows versions. Maybe, though, it may be related to one of the external tools, i.e. phyml or raxml, since I installed those locally and symlinked them to the programs directory like this:

$ ls -go programs/phyml programs/raxml
lrwxrwxrwx 1 14 Oct 16 14:57 programs/phyml -> /usr/bin/phyml
lrwxrwxrwx 1 14 Oct 16 14:57 programs/raxml -> /usr/bin/raxml

phyml is version 20140926
raxml is version 8.1.1

Test morphology partitionfinder

Hi all-

If anyone wants to give a spin to the morphological extension to PF, it's here:

https://github.com/wrightaprilm/partitionfinder/tree/feature/morphology2

instructions are in the README. It was mostly playtested on 335 datasets on Ubuntu, though I think @pbfrandsen got it working on Mac. Feel free to reply here with issues, or raise them on the morphology2 branch on my fork.

clean up model list

It would be much cleaner to have a single list of models, parameter values, command line options for different programs.

Right now we duplicate this to some extent between raxmlmodels.py and phymlmodels.py

The benefits of a single list would be:

Fewer bugs, easier to spot and fix errors
Easy to query whether a given program supports a given model (right now this requires some annoying duplication in the code)
Easier to modify as new models come online in PhyML and/or RAxML

Big greedy runs fail at ~50%

A very helpful user of the develop branch said this:

"That exception [to otherwise good performance] is a PF-develop run (search=greedy, MrBayes-specific models) I attempted for comparison with a PF-1.1.1 run of the same parameters. As I have mentioned previously, the PF-1.1.1 search=greedy runs were going very slowly, so I was hoping the PF-develop run would be fast or even finish before the previously-started PF-1.1.1 run.

In the end, the PF-1.1.1 run took >25 days (28/Dec – 25/Jan) to finish. The PF-develop run started very quickly and progressed to ~50% in 4-5 days, after which progress pretty much stopped. I let it run for a few more days and thought it had locked up so restarted. After the restart I let it run for another 5-6 days with little progress, after which I needed the computer for other analyses and killed the job. Interestingly, the computer had written a >40GB swap trying to deal with this analysis."

Need to figure this out and fix it. My suspicion is that the current method of loading up ALL the schemes at once is no good (this is the big change in the greedy algorithm from 1.1.1). But it could also be something to do with the databasing - the greedy algorithm uses a lot of subsets, and I wonder if DB.py is getting overloaded (if so, what to do?). Third option - it's because we abandoned the weakref dictionary, and we're just keeping too many subsets around.

2 things to try:

Revert to old greedy algorithm, and run big analyses. See if we still get this error (assuming we can replicate it first)
If (1) fixes the error, we can stick with the greedy algorithm that yields schemes, but yield ~1000 schemes at a pop. That will keep most of the performance benefits and may work around the issue.
Go back to flushing out useless subsets from memory somehow. Should be simple enough to do, and could even move to the numpy solution that I use in the relaxed clustering algorithm. [NOTE TO SELF - the newest formulation greedy is == relaxed clustering with % = 100, so TRY THAT FIRST].

Add support for additional RAxML models

RAxML 8.x has a lot more models available. Easy to add support for these.

Include a --seed option

K-means is stochastic. This is fine, but to make sure we can replicate things, we should include a commandline option to set the random number seed, e.g.

--seed

which feeds through to the k-means output.

A sub-issue, is that we should record the seed (user specified or not) in the saved .cfg file, so that we can re-start and checkpoint easily (see issue 35 too: #35)

add section to manual on major differences with PF1

pretty self explanatory, but if anyone thinks of anything we should add (this list will also be useful for the paper), then stick it here.

model lists are a bit different
there are a lot more models implemented
kmeans algorithm is implemented
but it uses entropy, not TIGER rates, which is quicker (O(N) with sites, rather than O(N-squared)) and usually better
rcluster is now controlled with --rcluster-max not --rcluster-percent. This makes it O(N) with the number of data blocks, rather than O(N-squared). I need to do a few comparisons with the default settings to measure the difference this makes.
much less file writing (great for speed, disk space, and use on clusters with I/O limits)
best_schemes.txt contains a lot more information
we use Numpy, Scipy, and scikit-learn, so everything is quicker
installing the right version of Python (with all the dependencies) is now much easier with Anaconda point-and-click installers (especially for Windows users)
parallelisation is much more efficient (i.e. we analyse multiple models from multiple subsets at once, which means that we can make much better use of all the available processors)

long file names

Simon has found a problem,

If you have LOTS of subsets (he has 50), then our way of making up filenames gives you names that are too long, and nothing works.

However, I like our filenames in general. I wonder if we could add in a conditional (if len(name)>50) and in those situations switch to a different system. Not sure what the system would be, because we need to make sure it's replicable each and every time, so that we can reliably look up old subset results.

Any ideas?

Abandon BIC and AIC, discuss

Hi All,

I have a proposal for PF2. Right now we calculate three metrics:

AIC, AICc, BIC

My proposal is that we ditch them all apart from the AICc. Here's why.

First, there is no reason to ever use the AIC over the AICc. The AICc corrects for small sample sizes, and converges to the AIC for larger sample sizes (see Burnham and Anderson's book). So providing both is pointless and perhaps even misleading.

Second, the BIC is not really an information criterion at all (it's only named as such). And more importantly, it makes some totally untenable assumptions for molecular phylogenetics. Most egregiously, it assumes that the TRUE model is in the set being considered. This is ridiculously far from the truth for phylogenetics. Perhaps because of this, the only studies that really show support for the accuracy of the BIC are simulation studies, where that one ridiculous rule is actually true - in the simulation studies that showed support for the BIC the true model IS in the set being considered.

Reducing everything to a single metric makes the program simpler, and would help to move the field along a bit too.

Any thoughts?

Rob

Update PhyML version

We should try and update to the latest version of PhyML for PF2

support latest phyml version

See #15.

building phyml from source

There are a bunch of sim links in partitionfinder/programs/phyml_source/, and ./configure doesn't run:

$: ./configure
configure: error: cannot find install-sh or install.sh in "." "./.." "./../.."

Am I missing something?
Thanks,
-Steve

Alignment parsing

Hey,

I have added a test which fails the alignment parser. Tried for a while to fix it, but ran out of energy after a couple of hours. This one's important to fix before release.

The issue is with interleaved phylip alignments. like this one: http://molecularevolution.org/resources/fileformats/phylip_dna

The deal is that there are multiple sequence 'blocks', each separated by an empty line. The names are contained only in the first block.

So, the phylip parser as is is doing fine on the top block, but it fails when there are additional blocks of sequences.

Brett - reckon you can fix this?

I had a go by defining top_block, and then OneOrMore(extra_block). This seemed OK, but then I got stuck on zipping up the sequences so that each sequence from the top block gets stuck onto the end of the corresponding seq. from the top block.

checks on bad subsets are not working

Hey @brettc, can you take a look at this ASAP? This is a very important check that currently doesn't work.

Previously, we have checked that sites in data blocks are non-overlapping, and also spit a warning for any missing sites.

E.g. with the aminoacid example file:

This is fine
'''
[data_blocks]
COI = 1-407;
COII = 408-624;
EF1a = 625-949;
'''

This should spit a warning about missing sites 1-3 and 941-949
'''
[data_blocks]
COI = 4-407;
COII = 408-624;
EF1a = 625-940;
'''

And this should quit with an error about a site only being allowed to appear in a single data block:
'''
[data_blocks]
COI = 1-407;
COII = 408-624;
EF1a = 625-949;
OOPS = 1-949;
'''

Right now, the second and third options are not behaving properly. No warning, and no error. Can we re-instate these?

More advice on RAxML versions

A few users have been finding out that the latest version of RAxML doesn't work. This is most important for linux users, who have to compile their own version.

We need to make this more obvious. E.g. from a recent user:

"
Hope everything’s going well! Just writing to offer a suggestion. In your recent Google Group post for PartitionFinder, you state that:

"RAxML changes very frequently, and sometimes in ways that make it incompatible with PartitionFinder (e.g., as you discovered, the very latest RAxML doesn't work with PF). We appreciate that this can be a pain, and so we ship RAxML binaries that should work on most Mac and Windows machines. If the binaries don't work on any given machine, you can just go to our fork of RAxML here: https://github.com/brettc/standard-RAxML, and download and compile the source code for your machine. This fork of RAxML should always be compatible with PF."

Do you think that it’d be possible to stick this in the manual or emphasize this, with a modified URL, on the FAQ? Maybe it was ignorant of me not to go to the Google Group page immediately, but (since I do lots of programming) my first instinct was just to go grab older versions(s) of RAxML and compile them from source. Once I pulled from the version listed above, after exhausting the last few versions of RAxML, everything has been going smoothly.
"

Easy to fix:

Change the FAQ on the website where I give these instructions
Change the manual
Anything else?

very small datablocks with RAxML

Sometimes, users have very small data blocks. They can end up with this error from RAxML:

Empirical base frequency for state number 1 is equal to zero in DNA data partition No Name Provided

However, although we do print out the output from RAxML, , because of our threading etc. it looks like something odd has gone on.

So, what we should do is catch this particular RAxML error and make PartitionFinder output a clear description of the problem.

Here's the thread on the google group:

https://groups.google.com/forum/?fromgroups#!topic/partitionfinder/KZU_lvQcekU

And here's my response to the user, which we could use as part of the description output by PF:

"What this means is that you have a single data block which has no A's C's or G's, but just T's (I think I got that right, in any case it has only a single base). Since it's not advisable to try and estimate the parameters for a GTR model from this kind of data, RAxML won't do it, and will exit. You could confirm this (if you wanted to!) by running the alignment b12c320a3c8cae07356dd884b5f54e3a.phy (in your subsets folder) in RAxML. You'll get the same error message.

The pragmatic solution here is to just merge that data block with another one (the most similar one you can think of a priori). Obviously this is not 100% ideal, but it's the only way to get RAxML to analyse this data, and so to get PartitionFinder to work on your data."

Check error handling in threading

April reported some problems with the error handling in threads. We need to ensure that errors generated in threads are properly handled and propagated to the caller without killing off all threads (this issues is probably related to how we handle crashes/errors in phyml / raxml).

(see #32 (comment))

user schemes seem to be broken

on develop branch. @brettc, can you take a quick look?

I can't seem to specify a user scheme, e.g. like this for the example dataset:


## SCHEMES, search: all | greedy | rcluster | hcluster | user | kmeans ##
[schemes]
search = greedy;

s1 = (Gene1_pos1) (Gene1_pos2) (Gene1_pos3) (Gene2_pos1) (Gene2_pos2) (Gene2_pos3) (Gene3_pos1) (Gene3_pos2) (Gene3_pos3);

We still call fast_TIGER instead of the new cython tiger implementation for rate calculation

We are still calling fast_TIGER within kmeans.py gen_per_site_stats(). I thought that we made the change to the cython tiger rates during the workshop, but I can't find that function anywhere.

Use cython for rate estimation

Right now we call the fast_TIGER C++ program and parse the results. This is kind of clunky and the C++ code used to calculate the rates isn't very complicated. Perhaps we should use something like Cython to estimate TIGER rates, so that we can avoid calling an external program? @brettc might have some ideas on this. If we can maintain a similar runtime, it might be worth implementing for PF 2.0.