tskit-dev / tskit Goto Github PK
View Code? Open in Web Editor NEWPopulation-scale genomics
License: MIT License
Population-scale genomics
License: MIT License
In the docs (http://msprime.readthedocs.io/en/stable/api.html#simulation-model) it says "When running simulations we define the length in bases L of the sequence in question using the length parameter. This defines the coordinate space within which trees and mutations are defined. L is a continuous value, and coordinates can take any value from 0 to L"
It is not clear from this whether the exact positions 0 or L are allowable values. I take it from later comments that the sequence length is (0,L], rather than e.g. (0,L). So it would be better to say e.g. "coordinates can take any continuous value from 0 up to (but not including) L"
Two things here. First, I want to check whether arbitrary bytes are allowed in metadata
, derived_state
, and ancestral_state
columns, for tskit
to deal with them. (I believe they are now, but want to check everyone is on board with this.)
Second, I run into issues when trying to actually do this. We are storing in derived_state
a sequence of ints that indicate a set of slim's mutations, just packed into the char *
, as done here. Naively, i was hoping that I could table_collection_dump()
these tables, then read them into python and do things. But, when I try do many thing with the tree sequence in python, I get utf-8
errors, like:
>>> ts = msprime.load("test_output/test_output.treeseq")
>>> vv = ts.variants()
>>> v = next(vv)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/peter/.local/lib/python3.6/site-packages/msprime/trees.py", line 1995, in variants
for site_id, genotypes, alleles in iterator:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte
I have not yet tracked down where this assumption comes in, but I don't understand the big picture. Could you clarify, @jeromekelleher?
We have to do annoying things when dealing with branch lengths, as we assume the parent exists. I think we should define the branch length as zero when the argument is a root. Can you see any pitfalls @petrelharp?
There's no need for simplify to require that the input sample nodes (i.e., the samples we want in the output tree sequence) should be marked as samples in the intput tree sequence. This is quite restrictive and annoying.
It'll be useful to be able to track the actual size in bytes of tables. Add an attribute to return the total amount of memory consumed by each table.
The current VCF conversion code has an algorithm for ensuring that we don't output duplicate sites. This is a bad idea. The simplest thing to do is to round all positions to the nearest integer. It's up to the user how the interpret these. If they need to control this, then they're free to update the positions in the site table beforehand.
Currently we provide a read-only interface to the contents of tables by passing back copies of the data in numpy arrays. This is quite tedious when trying to make changes to tables, and is also a major gotcha for users.
We can facilitate this by keeping a flag internally in the table struct. This will keep track of whether anyone external has a reference to internal memory. If so, then any attempts to grow the table (and therefore realloc memory, making external pointers invalid) will fail. This should be safe I think.
Update the tests in test_tables.py:TestSimplifyTables
to use this interface.
@pkalbers is interested in saving information about those mutations that make it into the samples in a (forwards-simulated) tree sequence. These mutations are often likely to belong to a node that is "intermediate" (i.e. it is not a coalescence node - it has a single child and a single parent at the focal position). Normally these nodes are cut out during simplify(), even if they their descendants are present in the samples. It would be useful to be able to mark those nodes during simplify(), such that they aren't cut out, unless they have no descendant samples any more. Also see a short discussion at MesserLab/SLiM#25 (comment) and @petrelharp 's reply to the point just below that point in the thread.
p.s. I have put this as a (first!) issue in ts-kit, as it really isn't relevant to msprime, which has no concept of a unary node with an associated mutation (mutations are always placed above binary nodes)
These should be seen as analogous to the start
and stop
arguments to Python's builtin range
, and be specified in tree sequence coordinates. We should initially ensure that start < stop, but in future we could allow start > stop, and enable reverse iteration in this way.
It would be best to do this with a low-level sparse_tree_seek()
method to position the tree at the correct location.
It's tedious and error prone calling, x_table.set_columns(c1=x_table.c1, ..)
when you only want to change one column.
We can do things like
node_table.flags = np.zeros_like(node_table.flags)
to set the full flags column in one go using some property magic. Does this seem like a good idea @petrelharp, @molpopgen?
Given num_tracked_samples
or num_samples
at a mutated site, we need to be able to count up how many of these go with each allele. There's a rough possible start at this here.
This is the main (only?) place we'll use the "mutation parent" information.
It would be useful to have a method to change who is a sample
, as for instance at tskit-dev/pyslim#23 .
Here is a prototype:
def mark_samples(ts, samples, keep=True):
'''
Returns the same tree sequence except with the the nodes listed in `samples`
marked as samples (keeping existing samples if `keep` is True).
'''
if (min(samples) < 0) or (max(samples) >= ts.num_nodes):
raise ValueError("Illegal samples.")
tables = ts.dump_tables()
flags = tables.nodes.flags
if not keep:
flags = np.bitwise_or(flags, (~ msprime.NODE_IS_SAMPLE)).as_dtype("uint32")
flags[samples] = np.bitwise_or(flags[samples], msprime.NODE_IS_SAMPLE).as_dtype("uint32")
tables.nodes.set_columns(flags=flags, population=tables.nodes.population,
individual=tables.nodes.individual, time=tables.nodes.time,
metadata=tables.nodes.metadata,
metadata_offset=tables.nodes.metadata_offset)
ts = tables.tree_sequence()
return pyslim.SlimTreeSequence(ts)
Currently sort_tables
doesn't ensure mutations occur after their parents. It probably should, but this behvaior is documented, so it's not major.
This is a way that tables can be out of order that sort_tables
does not fix, so we should probably address it, although we don't have a use case that produces tables out of order in this way.
The only way I can think of to do this is to add a field to the mutation struct used for sorting that is "number of hops to the root", computed by following mutation.parent
indices until you hit a -1
.
#391 changed the TreeSequence.samples method to return a numpy array, but it did this by creating a new array from a list returned by the low-level API. Update the low-level API to do the actual population query, as well as natively returning a numpy API.
Currently to add rows correctly one needs to import _msprime.NODE_IS_SAMPLE
, which is suboptimal. Proposal: default flags
is 0 but can still be passed in; which gets or'ed with _msprime.NODE_IS_SAMPLE
if is_sample=True
.
I am writing #213 assuming this exists; will need to modify if it does not.
I suppose this should also be an option to .set_columns()
, working just the same.
It would be useful to have equality testing for trees on the same tree sequence. The semantics are already implement in C, so just need to hook into this code.
Many of the cases in the simplifier_check_input
function are not tested here.
This would be a good started issue for someone looking to get familiar with the C API.
It would be very handy to have an 'info' (or something similar) command in the CLI that writes out some basic information about a tree sequence: number of nodes, edges, sites, mutations, for a start.
We need an efficient way to count the number of mutations on a particular branch. This should be exposed to the API as a new method SparseTree.num_branch_mutations(u)
, or something similar. It may also be useful to provide an iterator to the actual mutations as SparseTree.branch_mutations(u)
.
See #113 for motivation.
cc @petrelharp
Here's some closely related use cases that we're seeing (e.g. #612). I'd like to collect them in one place so we can settle on a minimal number of useful methods and an API. So far:
(ancestry painting) Given a list of nonoverlapping sets of nodes, the ancestors
, and a collection of samples, return somehow an assignment for each sample of which set of nodes it inherits from on each part of the genome (if any).
(ancestry proportions) Instead of the whole ancestry painting, return summaries of it in windows: what proportion of the samples's genomes inherit from each set of ancestors, in each window.
(number of descendants) Given a list of samples and a list of nonoverlapping sets "ancestral" nodes, return for each set of ancestors what proportion of the samples' genomes that inherit from that set of nodes. (This should also be done in windows.)
The last two sound very similar, but one is indexed by ancestors; the other by descendants.
The newick format will output a single tree. There is a "standard" format for sets of trees, which is NEXUS. I have code for writing a tree sequence out to nexus, which is fairly simply, but requires (1) the ability to output newick trees which label tips from 1..N_samples (by default msprime has 0..(N_samples-1) (2) a "TRANSLATE" header to give the msprime labels -> nexus labels (3) a "name" for each tree, which should probably be the upper limit over which this tree applies.
Slightly surprisingly, I can't pass a numpy array of type np.int to ts.simplify() on my machine, as it defaults to np.int64, and is then downcast somewhere in simplify, causing the error:
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
It should now be possible to make a tree sequence based on a "borrowed" or "stolen" reference to a table_collection_t
. For a borrowed reference, we store a pointer to the supplied tables, and do not free these tables when the tree sequence object is destroyed. For a stolen reference, we store a pointer to the supplied table collection which we free when the tree sequence object is destroyed. These can be specified with (mutually exclusive) flags. The default behaviour should be the present case, where we take a copy of the argument table_collection_t.
This behaviour will be useful for simulations, where we really don't need to have two copies of the same tables. However, we'll need to be careful to ensure that the underlying tables don't get modified. Possibly need to add some locks to the tables to ensure this.
See also tree_sequence_load
for current wasteful behaviour.
Migrations should be sorted by time by sort_tables
. See also #389.
Currently to write out a table collection, you have to make it into a tree sequence first, which then just calls table_collection_dump()
under the hood. Unless there's a good reason not, it'd be convenient sometimes to dump the table collection directly.
It's useful to be able to 'squash' redundant edges (i.e, two edges (l, x, p, c) and (x, r, p, c) squash into (l, r, p, c)), and we should add a public function the tables API for this. The C function already exists, so it's just a question of setting up the plumbing.
This would be a good place to starter issue for someone looking to figure out the connections between the Python APIs and C code.
As @molpopgen suggested here it would be good to truncate the metadata output when we call __str__
on NodeTable, SiteTable and MutationTable.
Other behaviour would be desirable if a decoder for the metadata was present.
Hello, thanks for the excellent software! This is a request for a feature rather than an `issue'.
It would be really useful to be able to track the local ancestry of each leaf along a chromosome so that the simulated chromosomes can be 'painted' according to their population of origin. Ie. traverse back up the tree from a leaf until you get to the ancestor at the time of the relevant admixture, but at every tree and every sequence.
There is currently no support for reading/writing migrations as text.
Should be a straightforward update.
x_table_set_columns should allow a NULL input when the number of rows is zero.
There are several options on the tree.draw method that the ascii backend ignores. These should be updated and fixed.
At a minimum, we should update it to support drawing mutations.
from @molpopgen over at #332.
One thing I realized that may be worth a mention in the manual: it seems like an ideal format for metadata encoded from a Python source is a pickled instance of collections.namedtuple. It satisfies what one wants from JSON (field names) but has the space efficiency of the regular tuple, is easily converted to pandas.DataFrame records, etc.
We should provide access to the indexes in TableCollection, i.e., to the arrays themselves and to the build_indexes
and drop_indexes
methods.
At the moment the msprime/tskit API defines SparseTree.length
to be the genomic length covered by a tree. It can be quite confusing when branches on a tree can have "lengths" too. I’ve been trying to use the word "span" instead of "length" to refer to the genome span covered by an edge or a tree. Would it be sensible to add the attribute .span as a preferred alternative to .length on a SparseTree (and perhaps change the description in other places in the documentation where we refer to extent over a genomic region, such as the "span" (i.e. length) of an edge).
Thoughts:
There's only quite such an easy correspondance between branch length and # of mutations if we assume all mutations are distinguishable. So to exactly match the branch length statistics, we'd ignore the fact that there may be double mutations. But to match what we see in the sequence (like we actually want to do): take divergence: a mutation that occurs on branches leading to both samples should be counted zero times, not twice. But, if the mutations returned by branch_mutations know where else on the tree they occur, then this is do-able with a bit more bookkeeping. Like for instance: first update the X's for each edge; then iterate over the mutations, evaluating (and caching) f(\sum_{branches} X) for each mutation to get the weight for that mutation. Or maybe just using the original algorithm but then going and looking individually at all the recurrent mutations to adjust.
If this is easy, it would be useful to be able to iterate over e.g. nodes in a tree sequence in forward time as well as backwards time, like
for n in reversed(ts.nodes()):
I guess this might also be useful for ts.edges() and the like, but that might turn out to be difficult.
Currently, the row of the LD matrix corresponding to site A
is filled out as follows (roughly, iiuc): for each other site B
:
B using
num_samples`11
alleles at A
and B
by track
ing those samples having the derived allele at A and using
num_tracked_samples`.To extend this to recurrent but still biallelic mutations, we need only add in the intermediate step of counting alleles, #298.
If we want to compute this for more than biallelic mutations, here's one way forward: note that if X
and Y
take values in {0, 1}
and if we define X'
and Y'
to each have the same distribution as X
and Y
respectively, but are indepenent of each other and everything else, then
cov[X,Y] = (1/2) * ( P(X == Y) - P(X' == Y') )
and so the r2
measure of LD is
r2[X,Y] = ( P(X == Y) - P(X' == Y') )^2 / ( (1 - P(X == X')) * (1 - P(Y == Y')) )
This can be computed in general, and would involve tracking the sets of samples at A
that have each possible allele, as in #297.
But, maybe we don't want to bother with multiallelic LD at the moment.
It is a very common thing to need to (a) first go through the individuals in a tree sequence and find their population, and then (b) subsample according to population. Right now, we might get the list of individuals by population like so:
pop_inds = [[] for _ in range(ts.num_populations)]
for ind in ts.individuals():
pops = [ts.node(n).population for n in ind.nodes]
for pop in set(pops):
pop_inds[pop].append(ind.id)
... and I've omitted the error checking enuring no individual has nodes in more than one population. This would be easier if we could do
pop_inds = [pop.individuals for pop in ts.populations()]
This suggestion is somewhat fraught because of the possibility that an individual's nodes may have different populations (although if they do, it's probably an error). So, I propose throwing an error if any individual's nodes have more than one population associated. (This check could be an optional argument, but I can't think of when this wouldn't be an error.)
#391 changed the TreeSequence.samples method to return a numpy array. Change the .trees()
method to use a numpy array without first changing to a python list. Needs an update to the low level C API.
Currently write_vcf
has a ploidy
argument, and will assume that adjacent samples belong to the same individual. This should be deprecated after we provide a make_individuals
method, and write_vcf
should have an argument that toggles whether to write individuals or genomes (maybe: output
either monoploid
by default or phased
or `unphased).
It is present for SparseTree.samples(), but not ts.samples()
We do not currently document left_child
, etc.
The metadata API allows you to store any byte string along with nodes, sites and mutations. This allows us to store pickled Python objects, for example, which will work fine. However, this isn't great for portability:
It would be better if we could use some third-party approach for encoding metadata so that we can automatically decode the stored information into a Python object using a given schema. I'm thinking of things like JSON-schema and protobuf. If the schema is also stored in the HDF5 file, then the schema goes along with the data, making it much more likely the data can be interpreted correctly later.
I've done a little initial experimentation with this in #330, which shows how we can use JSON-schema, where the user is responsible for managing the schemas and so on. Here is a sketch of how it might work in the future if we put in some infrastructure for automatically decoding metadata into Python objects via a few pluggable hooks:
# Your schema for custom metadata defined using json-schema
schema = """{
"title": "Example Metadata",
"type": "object",
"properties": {
"one": {"type": "string"},
"two": {"type": "string"}
},
"required": ["one", "two"]
}"""
# Load this schema into an object builder, so that we make metadata objects
# Using https://github.com/cwacek/python-jsonschema-objects
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
# Make a new metadata object, and assign it to a row in the nodes table.
metadata = ns.ExampleMetadata(one="node1", two="node2")
encoded = json.dumps(metadata.as_dict()).encode()
nodes.add_row(time=0.125, metadata=encoded)
# Load this table into a tree sequence, and store the schema so it can be retrieved later.
ts = msprime.load_tables(nodes=nodes, edges=edges, sequence_length=1)
ts.node_metadata_schema = schema
ts.dump("metadata-example.hdf5")
# Later, on another system, etc, we can load up the metadata and work with it.
ts = msprime.load("metadata-example.hdf5")
schema = json.loads(ts.node_metadata_schema)
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
def decoder(encoded):
return ns.ExampleMetadata.from_json(encoded.decode())
# If a decoder function is set, accesses to the ``node.metadata`` are intercepted and the decoder
# function is use to translate the raw bytes into a Python object.
ts.node_metadata_decoder = decoder
node = ts.node(0)
node.metadata.one # == "one"
node.metadata.two # == "one"

This is all a bit vague, but I thought I'd open an issue to start discussion on the topic. This is definitely not something we should tackle before 0.5.0 is released --- none of this is backwards incompatible with what we have now.
I would like to be able to produce fasta files directly from the binary VCF file using some nucleotide evolution model like HKY.
#391 changed the TreeSequence.samples method to return a numpy array, and we should aim to make all APIs that interact with sets of samples work efficiently with numpy arrays of node IDs. The tree stats API currently uses a lot of list operations, and we should update it to use numpy fully.
cc @petrelharp.
When changing the sort_tables
and simplify_tables
functions to be methods of the TableCollection
object, I removed the complexity around dropping the GIL. This can be reinstated easily enough, but there are quite subtle issues around it so I didn't want to do it in a hurry.
Is this still important to you @molpopgen?
compare
display(HTML("".join([tree.draw() for tree in msprime.simulate( sample_size=7, Ne=1000, length=1e4, recombination_rate=2e-8, random_seed=seed).trees()])))
with
display(HTML("".join([tree.draw(node_labels={a:a for a in range(14)}) for tree in msprime.simulate( sample_size=7, Ne=1000, length=1e4, recombination_rate=2e-8, random_seed=seed).trees()])))
Simplify current ignores migration records. We need to update simplify so that migrations are also processed. This should be simple to do; we just use the node ID map to translate the node IDs in the old table into new node IDs. If the old node ID maps to -1, we leave the migration out.
At the moment, for large trees with custom labels, labels from SparseTree.draw(format="svg") are bound to overlap. An option to rotate the labels by 90 degrees would be helpful, and probably very easy to code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.