Python 0.3.0
Major feature release
This release adds metadata schemas, set-like operations, mutation times, SVG drawing improvements and many others. This release also comes with wheels for windows, osx and linux.
❤️ Many thanks go to the tskit community and contributors for their awesome work on this release. ❤️
Breaking changes
-
The default display order for tree visualisations has been changed to
minlex
(see below) to stabilise the node ordering and to make trees more readily comparable. The old behaviour is still available withorder="tree"
. -
File system operations such as dump/load now raise an appropriate OSError instead of
tskit.FileFormatError
. Loading from an empty file now raises andEOFError
. -
Bad tree topologies are detected earlier, so that it is no longer possible to create a
TreeSequence
object which contains a parent with contradictory children on an interval. Previously an error was thrown when some operation building the trees was attempted (@jeromekelleher, #709). -
The
TableCollection object
no longer implements the iterator protocol. Previouslylist(tables)
returned a sequence of (table_name, table_instance) tuples. This has been replaced with the more intuitive and future-proofTableCollection.name_map
andTreeSequence.tables_dict
attributes, which perform the same function (@jeromekelleher, #500, #694). -
The arguments to
TreeSequence.genotype_matrix
,TreeSequence.haplotypes
andTreeSequence.variants
must now be keyword arguments, not positional. This is to support the change fromimpute_missing_data
toisolated_as_missing
in the arguments to these methods (@benjeffery, #716, #794).
New features
-
New methods to perform set operations on TableCollections and TreeSequences.
TableCollection.subset
subsets and reorders table collections by nodes (@mufernando, @petrelharp, #663, #690).TableCollection.union
forms the node-wise union of two table collections (@mufernando, @petrelharp, #381 #623). -
Mutations now have an optional double-precision floating-point
time
column. If not specified, this defaults to a particularNaN
value (tskit.UNKNOWN_TIME
) indicating that the time is unknown. For a tree sequence to be considered valid it must meet new criteria for mutation times, see mutation requirements. Also added functionTableCollection.compute_mutation_times
. Table sorting orders mutations by non-increasing time per-site, which is also a requirement for a valid tree sequence (@benjeffery, #672). -
Add support for trees with internal samples for the Kendall-Colijn tree distance metric. (@daniel-goldstein, #610)
-
Add background shading to SVG tree sequences to reflect tree position along the sequence (@hyanwong, #563).
-
Tables with a metadata column now have a
metadata_schema
that is used to validate and encode metadata that is passed toadd_row
and decode metadata on calls totable[j]
and e.g.tree_sequence.node(j)
See metadata (@benjeffery, #491, #542, #543, #601). -
The tree-sequence now has top-level metadata with a schema (@benjeffery, #666, #644, #642).
-
Add classes to SVG drawings to allow easy adjustment and styling, and document the new
tskit.Tree.draw_svg()
andtskit.TreeSequence.draw_svg()
methods. This also fixes #467 for duplicate SVG entityid
s in Jupyter notebooks (@hyanwong, #555). -
Add a
to_nexus
function that outputs a tree sequence in Nexus format (@saunack, #550). -
Add extension of Kendall-Colijn tree distance metric for tree sequences computed by
TreeSequence.kc_distance
(@daniel-goldstein, #548). -
Add an optional node traversal order in
tskit.Tree
that uses the minimum lexicographic order of leaf nodes visited. This ordering ("minlex_postorder"
) adds more determinism because it constraints the order in which children of a node are visited (@brianzhang01, #411). -
Add an
order
argument to the tree visualisation functions which supports two node orderings:"tree"
(the previous default) and"minlex"
which stabilises the node ordering (making it easier to compare trees). The default node ordering is changed to"minlex"
(@brianzhang01, @jeromekelleher, #389, #566). -
Add
_repr_html_
to tables, so that jupyter notebooks render them as html tables (@benjeffery, #514). -
Remove support for
kc_distance
on trees with unary nodes (@daniel-goldstein, #508). -
Improve Kendall-Colijn tree distance algorithm to operate in O(n^2) time instead of O(n^2 * log(n)) where n is the number of samples (@daniel-goldstein, #490).
-
Add a metadata column to the migrations table. Works similarly to existing metadata columns on other tables (@benjeffery, #505).
-
Add a metadata column to the edges table. Works similarly to existing metadata columns on other tables (@benjeffery, #496).
-
Allow sites with missing data to be output by the
haplotypes
method, by default replacing with-
. Errors are no longer raised for missing data withisolated_as_missing=True
; the error types returned for bad alleles (e.g. multiletter or non-ascii) have also changed from_tskit.LibraryError
to TypeError, or ValueError if the missing data character clashes (@hyanwong, #426). -
Access the number of children of a node in a tree directly using
tree.num_children(u)
(@hyanwong, #436). -
User specified allele mapping for genotypes in
variants
andgenotype_matrix
(@jeromekelleher, #430). -
New
root_threshold
option for the Tree class, which allows us to efficiently iterate over 'real' roots when we have missing data (@jeromekelleher, #462). -
Add
tree.as_dict_of_dicts()
function to enable use with networkx. See the tutorial (@winni2k, #457). -
Add
tree_sequence.to_macs()
function to convert tree sequence to MACS format (@winni2k, #727). -
Add a
keep_input_roots
option to simplify which, if enabled, adds edges from the MRCAs of samples in the simplified tree sequence back to the roots in the input tree sequence (@jeromekelleher, #775, #782).
Bugfixes
- #453 - Fix LibraryError when
tree.newick()
is called with large node time values (@jeromekelleher, #637).
Deprecated
- The
sample_counts
feature has been deprecated and is now ignored. Sample counts are now always computed. - For
TreeSequence.genotype_matrix
,TreeSequence.haplotypes
andTreeSequence.variants
theimpute_missing_data
argument is deprecated and replaced withisolated_as_missing
. Note that to get the same behaviourimpute_missing_data=True
should be replaced withisolated_as_missing=False
(@benjeffery, #716, #794).