tskit-dev / tskit Goto Github PK

View Code? Open in Web Editor NEW

149.0 149.0 70.0 18.01 MB

Population-scale genomics

License: MIT License

Python 54.85% Makefile 0.03% C 41.80% Meson 0.12% C++ 3.20% Dockerfile 0.01%

bioinformatics genomics phylogenetics python science tree-sequences

tskit's People

Contributors

Stargazers

Watchers

Forkers

jeromekelleher gtsambos hyanwong brianzhang01 awohns gertjanbisschop andrewkern petrelharp marianne-aspbury chris1221 benjeffery nikbaya apragsdale dianalir winni2k daniel-goldstein saurabhbelsare saunack claudiamoreau mufernando grahamgower pythseq ivan-krukov eldonb brieuclehmann castedo molpopgen lclclclclclclc jelliotthiggins institutefordiseasemodeling percyfal minesrebollo connor-j-jordan tuomasjalonen mmosmond bodkan jeetsukumaran clwgg astheeggeggs savitakartik roohy nspope bguo068 jgnaranj jeremyguez szhan lkirk billyzhang1229 alxsimon lukeandersontrocme general-solution charandeepsinghb chriscrsmith lukashuebner duncanmr palashlalwani atongsa tymekpieszko jacksonkelly2354 daikitag hfr1tz3 currocam silloky

tskit's Issues

Documentation - make clear that the sequence interval (0,L] is half closed

In the docs (http://msprime.readthedocs.io/en/stable/api.html#simulation-model) it says "When running simulations we define the length in bases L of the sequence in question using the length parameter. This defines the coordinate space within which trees and mutations are defined. L is a continuous value, and coordinates can take any value from 0 to L"

It is not clear from this whether the exact positions 0 or L are allowable values. I take it from later comments that the sequence length is (0,L], rather than e.g. (0,L). So it would be better to say e.g. "coordinates can take any continuous value from 0 up to (but not including) L"

Clarify encoding of binary metadata for binary input/output

Two things here. First, I want to check whether arbitrary bytes are allowed in metadata, derived_state, and ancestral_state columns, for tskit to deal with them. (I believe they are now, but want to check everyone is on board with this.)

Second, I run into issues when trying to actually do this. We are storing in derived_state a sequence of ints that indicate a set of slim's mutations, just packed into the char *, as done here. Naively, i was hoping that I could table_collection_dump() these tables, then read them into python and do things. But, when I try do many thing with the tree sequence in python, I get utf-8 errors, like:

>>> ts = msprime.load("test_output/test_output.treeseq")
>>> vv = ts.variants()
>>> v = next(vv)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peter/.local/lib/python3.6/site-packages/msprime/trees.py", line 1995, in variants
    for site_id, genotypes, alleles in iterator:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte

I have not yet tracked down where this assumption comes in, but I don't understand the big picture. Could you clarify, @jeromekelleher?

Define branch length over a root = 0?

We have to do annoying things when dealing with branch lengths, as we assume the parent exists. I think we should define the branch length as zero when the argument is a root. Can you see any pitfalls @petrelharp?

Remove requirement that input samples for simplify have the IS_SAMPLE flag

There's no need for simplify to require that the input sample nodes (i.e., the samples we want in the output tree sequence) should be marked as samples in the intput tree sequence. This is quite restrictive and annoying.

Add ``total_bytes`` attribute to Table objects

It'll be useful to be able to track the actual size in bytes of tables. Add an attribute to return the total amount of memory consumed by each table.

VCF should round to nearest integer

The current VCF conversion code has an algorithm for ensuring that we don't output duplicate sites. This is a bad idea. The simplest thing to do is to round all positions to the nearest integer. It's up to the user how the interpret these. If they need to control this, then they're free to update the positions in the site table beforehand.

Provide read/write access to table contents

Currently we provide a read-only interface to the contents of tables by passing back copies of the data in numpy arrays. This is quite tedious when trying to make changes to tables, and is also a major gotcha for users.

We can facilitate this by keeping a flag internally in the table struct. This will keep track of whether anyone external has a reference to internal memory. If so, then any attempts to grow the table (and therefore realloc memory, making external pointers invalid) will fail. This should be safe I think.

Update the tests in test_tables.py:TestSimplifyTables to use this interface.

Save sample-relevant intermediate (unary) nodes during simplify()

@pkalbers is interested in saving information about those mutations that make it into the samples in a (forwards-simulated) tree sequence. These mutations are often likely to belong to a node that is "intermediate" (i.e. it is not a coalescence node - it has a single child and a single parent at the focal position). Normally these nodes are cut out during simplify(), even if they their descendants are present in the samples. It would be useful to be able to mark those nodes during simplify(), such that they aren't cut out, unless they have no descendant samples any more. Also see a short discussion at MesserLab/SLiM#25 (comment) and @petrelharp 's reply to the point just below that point in the thread.

p.s. I have put this as a (first!) issue in ts-kit, as it really isn't relevant to msprime, which has no concept of a unary node with an associated mutation (mutations are always placed above binary nodes)

Add ``left`` and ``right`` arguments to TreeSequence.trees()

These should be seen as analogous to the start and stop arguments to Python's builtin range, and be specified in tree sequence coordinates. We should initially ensure that start < stop, but in future we could allow start > stop, and enable reverse iteration in this way.

It would be best to do this with a low-level sparse_tree_seek() method to position the tree at the correct location.

Method to set single column in table

It's tedious and error prone calling, x_table.set_columns(c1=x_table.c1, ..) when you only want to change one column.

We can do things like

node_table.flags = np.zeros_like(node_table.flags)

to set the full flags column in one go using some property magic. Does this seem like a good idea @petrelharp, @molpopgen?

implement an `allele_counter`

Given num_tracked_samples or num_samples at a mutated site, we need to be able to count up how many of these go with each allele. There's a rough possible start at this here.

This is the main (only?) place we'll use the "mutation parent" information.

method to mark samples

It would be useful to have a method to change who is a sample, as for instance at tskit-dev/pyslim#23 .

Here is a prototype:

def mark_samples(ts, samples, keep=True):
    '''
    Returns the same tree sequence except with the the nodes listed in `samples`
    marked as samples (keeping existing samples if `keep` is True).
    '''
    if (min(samples) < 0) or (max(samples) >= ts.num_nodes):
        raise ValueError("Illegal samples.")
    tables = ts.dump_tables()
    flags = tables.nodes.flags
    if not keep:
        flags = np.bitwise_or(flags, (~ msprime.NODE_IS_SAMPLE)).as_dtype("uint32")
    flags[samples] = np.bitwise_or(flags[samples], msprime.NODE_IS_SAMPLE).as_dtype("uint32")
    tables.nodes.set_columns(flags=flags, population=tables.nodes.population,
            individual=tables.nodes.individual, time=tables.nodes.time,
            metadata=tables.nodes.metadata, 
            metadata_offset=tables.nodes.metadata_offset)
    ts = tables.tree_sequence()
    return pyslim.SlimTreeSequence(ts)

Sort using mutation.parent property

Currently sort_tables doesn't ensure mutations occur after their parents. It probably should, but this behvaior is documented, so it's not major.

This is a way that tables can be out of order that sort_tables does not fix, so we should probably address it, although we don't have a use case that produces tables out of order in this way.

The only way I can think of to do this is to add a field to the mutation struct used for sorting that is "number of hops to the root", computed by following mutation.parent indices until you hit a -1.

Update low-level API for TreeSequence.samples to return np array

#391 changed the TreeSequence.samples method to return a numpy array, but it did this by creating a new array from a list returned by the low-level API. Update the low-level API to do the actual population query, as well as natively returning a numpy API.

add `is_sample` argument to `NodeTable.add_row()`

Currently to add rows correctly one needs to import _msprime.NODE_IS_SAMPLE, which is suboptimal. Proposal: default flags is 0 but can still be passed in; which gets or'ed with _msprime.NODE_IS_SAMPLE if is_sample=True.

I am writing #213 assuming this exists; will need to modify if it does not.

I suppose this should also be an option to .set_columns(), working just the same.

eq support for SparseTree

It would be useful to have equality testing for trees on the same tree sequence. The semantics are already implement in C, so just need to hook into this code.

Unary nodes not rendered correctly in SVG

Compare:

                26             
       ┏━━━━━━━━┻━━━━━━━┓      
       0                7      
       ┃                ┃      
       25               ┃      
  ┏━━━━┻━━━━┓           ┃      
  23        24          ┃      
┏━┻━┓    ┏━━╋━━━┓       ┃      
┃   21   ┃  ┃   22      20     
┃  ┏┻━┓  ┃  ┃  ┏┻━┓  ┏━━╋━━┓   
10 14 19 11 18 15 17 12 13 16

and

Add more cases for test_table_collection_simplify_errors

Many of the cases in the simplifier_check_input function are not tested here.

This would be a good started issue for someone looking to get familiar with the C API.

``tskit info`` command needed

It would be very handy to have an 'info' (or something similar) command in the CLI that writes out some basic information about a tree sequence: number of nodes, edges, sites, mutations, for a start.

SparseTree.num_branch_mutations method required

We need an efficient way to count the number of mutations on a particular branch. This should be exposed to the API as a new method SparseTree.num_branch_mutations(u), or something similar. It may also be useful to provide an iterator to the actual mutations as SparseTree.branch_mutations(u).

See #113 for motivation.

cc @petrelharp

Tools for ancestry/inheritance calculation

Here's some closely related use cases that we're seeing (e.g. #612). I'd like to collect them in one place so we can settle on a minimal number of useful methods and an API. So far:

(ancestry painting) Given a list of nonoverlapping sets of nodes, the ancestors, and a collection of samples, return somehow an assignment for each sample of which set of nodes it inherits from on each part of the genome (if any).
(ancestry proportions) Instead of the whole ancestry painting, return summaries of it in windows: what proportion of the samples's genomes inherit from each set of ancestors, in each window.
(number of descendants) Given a list of samples and a list of nonoverlapping sets "ancestral" nodes, return for each set of ancestors what proportion of the samples' genomes that inherit from that set of nodes. (This should also be done in windows.)

The last two sound very similar, but one is indexed by ancestors; the other by descendants.

Functionality to output nexus-format list of trees

The newick format will output a single tree. There is a "standard" format for sets of trees, which is NEXUS. I have code for writing a tree sequence out to nexus, which is fairly simply, but requires (1) the ability to output newick trees which label tips from 1..N_samples (by default msprime has 0..(N_samples-1) (2) a "TRANSLATE" header to give the msprime labels -> nexus labels (3) a "name" for each tree, which should probably be the upper limit over which this tree applies.

simplify() can't cope with an np.array of standard integers

Slightly surprisingly, I can't pass a numpy array of type np.int to ts.simplify() on my machine, as it defaults to np.int64, and is then downcast somewhere in simplify, causing the error:

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

Remove table copies from tree_sequence_t and add options to share tables

It should now be possible to make a tree sequence based on a "borrowed" or "stolen" reference to a table_collection_t. For a borrowed reference, we store a pointer to the supplied tables, and do not free these tables when the tree sequence object is destroyed. For a stolen reference, we store a pointer to the supplied table collection which we free when the tree sequence object is destroyed. These can be specified with (mutually exclusive) flags. The default behaviour should be the present case, where we take a copy of the argument table_collection_t.

This behaviour will be useful for simulations, where we really don't need to have two copies of the same tables. However, we'll need to be careful to ensure that the underlying tables don't get modified. Possibly need to add some locks to the tables to ensure this.

See also tree_sequence_load for current wasteful behaviour.

sort_tables should sort migrations

Migrations should be sorted by time by sort_tables. See also #389.

Provide TableCollection.dump( ) method

Currently to write out a table collection, you have to make it into a tree sequence first, which then just calls table_collection_dump() under the hood. Unless there's a good reason not, it'd be convenient sometimes to dump the table collection directly.

Add ``squash_edges`` function

It's useful to be able to 'squash' redundant edges (i.e, two edges (l, x, p, c) and (x, r, p, c) squash into (l, r, p, c)), and we should add a public function the tables API for this. The C function already exists, so it's just a question of setting up the plumbing.

This would be a good place to starter issue for someone looking to figure out the connections between the Python APIs and C code.

Truncate metadata for table.str

As @molpopgen suggested here it would be good to truncate the metadata output when we call __str__ on NodeTable, SiteTable and MutationTable.

Other behaviour would be desirable if a decoder for the metadata was present.

`Chromosome painting' with tree sequences

Hello, thanks for the excellent software! This is a request for a feature rather than an `issue'.

It would be really useful to be able to track the local ancestry of each leaf along a chromosome so that the simulated chromosomes can be 'painted' according to their population of origin. Ie. traverse back up the tree from a leaf until you get to the ancestor at the time of the relevant admixture, but at every tree and every sequence.

Text format for migrations

There is currently no support for reading/writing migrations as text.

Should be a straightforward update.

mutation_table_set_columns requires non-null input even with zero rows

x_table_set_columns should allow a NULL input when the number of rows is zero.

ASCII tree drawer does not support several options

There are several options on the tree.draw method that the ascii backend ignores. These should be updated and fixed.

At a minimum, we should update it to support drawing mutations.

Suggest collections.namedtuple for metadata.

from @molpopgen over at #332.

One thing I realized that may be worth a mention in the manual: it seems like an ideal format for metadata encoded from a Python source is a pickled instance of collections.namedtuple. It satisfies what one wants from JSON (field names) but has the space efficiency of the regular tuple, is easily converted to pandas.DataFrame records, etc.

TableCollection needs interface for indexes

We should provide access to the indexes in TableCollection, i.e., to the arrays themselves and to the build_indexes and drop_indexes methods.

Rename tree.length attribute to tree.span?

At the moment the msprime/tskit API defines SparseTree.length to be the genomic length covered by a tree. It can be quite confusing when branches on a tree can have "lengths" too. I’ve been trying to use the word "span" instead of "length" to refer to the genome span covered by an edge or a tree. Would it be sensible to add the attribute .span as a preferred alternative to .length on a SparseTree (and perhaps change the description in other places in the documentation where we refer to extent over a genomic region, such as the "span" (i.e. length) of an edge).

accomodate recurrent mutations in branch_stats algorithm

Thoughts:

There's only quite such an easy correspondance between branch length and # of mutations if we assume all mutations are distinguishable. So to exactly match the branch length statistics, we'd ignore the fact that there may be double mutations. But to match what we see in the sequence (like we actually want to do): take divergence: a mutation that occurs on branches leading to both samples should be counted zero times, not twice. But, if the mutations returned by branch_mutations know where else on the tree they occur, then this is do-able with a bit more bookkeeping. Like for instance: first update the X's for each edge; then iterate over the mutations, evaluating (and caching) f(\sum_{branches} X) for each mutation to get the weight for that mutation. Or maybe just using the original algorithm but then going and looking individually at all the recurrent mutations to adjust.

Implement reversed() methods for ts.nodes() (and others if appropriate)

If this is easy, it would be useful to be able to iterate over e.g. nodes in a tree sequence in forward time as well as backwards time, like
for n in reversed(ts.nodes()):

I guess this might also be useful for ts.edges() and the like, but that might turn out to be difficult.

do LD calculations with recurrent mutations... and more than two alleles?

Currently, the row of the LD matrix corresponding to site A is filled out as follows (roughly, iiuc): for each other site B:

get the allele frequency at B using num_samples`
get the frequency of 11 alleles at A and B by tracking those samples having the derived allele at A and using num_tracked_samples`.

To extend this to recurrent but still biallelic mutations, we need only add in the intermediate step of counting alleles, #298.

If we want to compute this for more than biallelic mutations, here's one way forward: note that if X and Y take values in {0, 1} and if we define X' and Y' to each have the same distribution as X and Y respectively, but are indepenent of each other and everything else, then

cov[X,Y] = (1/2) * ( P(X == Y) - P(X' == Y') )

and so the r2 measure of LD is

r2[X,Y] = ( P(X == Y) - P(X' == Y') )^2 / ( (1 - P(X == X')) * (1 - P(Y == Y')) )

This can be computed in general, and would involve tracking the sets of samples at A that have each possible allele, as in #297.

But, maybe we don't want to bother with multiallelic LD at the moment.

add `.individuals` property to `Population` object?

It is a very common thing to need to (a) first go through the individuals in a tree sequence and find their population, and then (b) subsample according to population. Right now, we might get the list of individuals by population like so:

pop_inds = [[] for _ in range(ts.num_populations)]
for ind in ts.individuals():
  pops = [ts.node(n).population for n in ind.nodes]
  for pop in set(pops):
      pop_inds[pop].append(ind.id)

... and I've omitted the error checking enuring no individual has nodes in more than one population. This would be easier if we could do

pop_inds = [pop.individuals for pop in ts.populations()]

This suggestion is somewhat fraught because of the possibility that an individual's nodes may have different populations (although if they do, it's probably an error). So, I propose throwing an error if any individual's nodes have more than one population associated. (This check could be an optional argument, but I can't think of when this wouldn't be an error.)

Update tracked_samples to natively use numpy array.

#391 changed the TreeSequence.samples method to return a numpy array. Change the .trees() method to use a numpy array without first changing to a python list. Needs an update to the low level C API.

use individual tables when writing out vcf files

Currently write_vcf has a ploidy argument, and will assume that adjacent samples belong to the same individual. This should be deprecated after we provide a make_individuals method, and write_vcf should have an argument that toggles whether to write individuals or genomes (maybe: output either monoploid by default or phased or `unphased).

Documentation missing for ts.samples()

It is present for SparseTree.samples(), but not ts.samples()

Document quintuply linked tree structure.

We do not currently document left_child, etc.

Metadata schemas and automatic decoding

The metadata API allows you to store any byte string along with nodes, sites and mutations. This allows us to store pickled Python objects, for example, which will work fine. However, this isn't great for portability:

Obviously won't work on any other language than Python
Pickling/unpickling requires the class definition to be in the current namespace, which can get tricky. In practise, this would probably end up causing compatability headaches across code versions.

It would be better if we could use some third-party approach for encoding metadata so that we can automatically decode the stored information into a Python object using a given schema. I'm thinking of things like JSON-schema and protobuf. If the schema is also stored in the HDF5 file, then the schema goes along with the data, making it much more likely the data can be interpreted correctly later.

I've done a little initial experimentation with this in #330, which shows how we can use JSON-schema, where the user is responsible for managing the schemas and so on. Here is a sketch of how it might work in the future if we put in some infrastructure for automatically decoding metadata into Python objects via a few pluggable hooks:

# Your schema for custom metadata defined using json-schema
schema = """{
    "title": "Example Metadata",
    "type": "object",
    "properties": {
        "one": {"type": "string"},
        "two": {"type": "string"}
    },
    "required": ["one", "two"]
}"""

# Load this schema into an object builder, so that we make metadata objects
# Using https://github.com/cwacek/python-jsonschema-objects
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
# Make a new metadata object, and assign it to a row in the nodes table.
metadata = ns.ExampleMetadata(one="node1", two="node2")
encoded = json.dumps(metadata.as_dict()).encode()
nodes.add_row(time=0.125, metadata=encoded)
# Load this table into a tree sequence, and store the schema so it can be retrieved later.
ts = msprime.load_tables(nodes=nodes, edges=edges, sequence_length=1)
ts.node_metadata_schema = schema
ts.dump("metadata-example.hdf5")

# Later, on another system, etc, we can load up the metadata and work with it.
ts = msprime.load("metadata-example.hdf5")
schema = json.loads(ts.node_metadata_schema)
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
def decoder(encoded):
   return ns.ExampleMetadata.from_json(encoded.decode())

# If a decoder function is set, accesses to the ``node.metadata`` are intercepted and the decoder
# function is use to translate the raw bytes into a Python object.
ts.node_metadata_decoder = decoder
node = ts.node(0)
node.metadata.one  # == "one"
node.metadata.two  # == "one"

This is all a bit vague, but I thought I'd open an issue to start discussion on the topic. This is definitely not something we should tackle before 0.5.0 is released --- none of this is backwards incompatible with what we have now.

Feature request: msp fasta

I would like to be able to produce fasta files directly from the binary VCF file using some nucleotide evolution model like HKY.

Update tree stats API for use numpy natively

#391 changed the TreeSequence.samples method to return a numpy array, and we should aim to make all APIs that interact with sets of samples work efficiently with numpy arrays of node IDs. The tree stats API currently uses a lot of list operations, and we should update it to use numpy fully.

cc @petrelharp.

TableCollection.sort and simplify no longer release GIL

When changing the sort_tables and simplify_tables functions to be methods of the TableCollection object, I removed the complexity around dropping the GIL. This can be reinstated easily enough, but there are quite subtle issues around it so I didn't want to do it in a hurry.

Is this still important to you @molpopgen?

node_labels in tree.draw() don't allow root label

compare

display(HTML("".join([tree.draw() for tree in msprime.simulate( sample_size=7, Ne=1000, length=1e4, recombination_rate=2e-8, random_seed=seed).trees()])))

with

display(HTML("".join([tree.draw(node_labels={a:a for a in range(14)}) for tree in msprime.simulate( sample_size=7, Ne=1000, length=1e4, recombination_rate=2e-8, random_seed=seed).trees()])))

Support migrations in simplify

Simplify current ignores migration records. We need to update simplify so that migrations are also processed. This should be simple to do; we just use the node ID map to translate the node IDs in the old table into new node IDs. If the old node ID maps to -1, we leave the migration out.

Would be nice if SVG trees could have the tip labels rotated by 90 degrees

At the moment, for large trees with custom labels, labels from SparseTree.draw(format="svg") are bound to overlap. An option to rotate the labels by 90 degrees would be helpful, and probably very easy to code.