Background: Reconstructing complete ancestral genomes (at least in terms of their gene inventory and arrangement) is attracting much interest due to the rapidly increasing availability of whole genome sequences. While modest successes have been reported for mammalian and even vertebrate genomes, more divergent groups continue to pose a stiff challenge, mostly because current models of genomic evolution support too many choices. Results: We describe a novel type of genomic signature based on rearrangements that characterizes evolutionary changes that must be common to all minimal rearrangement scenarios; by focusing on global patterns of rearrangements, such signatures bypass individual variations and sharply restrict the search space. We present the results of extensive simulation studies demonstrating that these signatures can be used to reconstruct accurate ancestral genomes and phylogenies even for widely divergent collections. Conclusion: Focusing on genome triples rather than genomes pairs unleashes the full power of evolutionary analysis. Our genomic signature captures shared evolutionary events and thus can form the basis of a robust analysis and reconstruction of evolutionary history.

from The Seventh Asia Pacific Bioinformatics Conference (APBC 2009) Beijing, China. 13?16 January 2009

The study of evolution is a study of patterns of change, but
also of conservation, the latter being typically easier to
detect and characterize. Moreover, elements conserved
across many species were probably present in their last
common ancestor and preserved through selection
pressures, so that these conserved elements probably play a
major role in the fitness of the organisms. Biologists have
long studied patterns of conservation in DNA sequences:
first pairwise sequence similarity in large databases (as in
the widely used FASTA [

The focus to date in evolutionary genomics has been on DNA sequence evolution, in part because of the nature of the available data (collections of gene sequences form the overwhelming majority of biomolecular data) and in part because of their relative simplicity. The assumed model of evolution has been a simple process of point mutation and gap-forming indels. However, other processes affect the evolution of a genome, including large-scale events that rearrange genes along the chromosomes, introduce new genes, or remove existing ones. Rearrangements, in particular, interfere with our ability to align sequences: for instance, a single inversion (in which a segment of genes is reversed in place) can make two sequences unalignable under the mutation and indels model.

As more and more genomes are fully sequenced, interest
in reconstructing complete ancestral genomes has grown;
Pevzner's group, for instance, has published extensively
on the topic in the context of vertebrate genomes (see,
e.g., [

As is usual in the study of rearrangements, we represent a chromosome of n genes by a signed permutation on the elements {1, 2, ..., n}. Given a signed permutation ?, an inversion r(i, j) is a permutation that, when applied to ?, reverses the order and the sign of a segment of ? that begins at the ith gene and ends at the jth one. Thus, if we write the identity permutation as 1, ..., i - 1, i, i + 1, ..., j 1, j, j + 1, ..., n, then r(i, j) becomes 1, ..., i - 1, -j, -(j - 1), ..., -(i + 1), -i, j + 1, ..., n. r ? denotes the application of inversion r to permutation ?. For signed permutations ? and ?', the (inversion) edit distance d (?, ?') is the minimum number of inversions needed to transform ? into ?'. We say that a sequence of permutations ?0, ?1, ..., ?d forms an edit path if for all ?i, 0 ? i <d, we have d(?i, ?i + 1) = 1; each inversion applied along this path is then deemed an edit inversion. Taking each ?i to be a vertex and linking two vertices with an edge whenever the corresponding permutations occur consecutively on an edit path creates an edit path graph. The relation "is on the edit path from" thus induces a partial order, the edit partial order, or EPO. We denote the EPO between ?0 and ?d as EPO? 0 (?d) or EPO? d (?0). So if we have ?3 = 2 -1 -3 and ?0 = 1 2 3 then an edit path between them might visit permutations ?2 = 3TF}hi,geaunurdnei?o1Ln=of1t2he3inversion lattices for ?A = {-2 3 1, ?B = 2 -1 The union of the inversion lattices for ?A = {-2 3 1, ?B = 2 -1 3}, and ?L = 1 2 3. The signature graph is highlighted in bold. -2 -1 -3 and ?1 = -2 -1 3 before reaching ?0. Figure 1 shows the EPOs for 2 -1 -3 and -2 3 1.

We are interested in the intersection of EPOs, which will yield the desired inversion signatures. For a set of k + 1 permutations, one of which is the reference permutation called the locus, an inversion signature is the permutation corresponding to a vertex in the intersection of the k EPOs from each of the other k permutations to the locus. Definition 1 The set of all inversion signatures for permutations ?1, ..., ?kwith locus ?L is S? L (? 1,...,? k) = V ( EPO? L (? 1) ? EPO? L (? 2) ? ? EPO? L (? k) ) , where V(G) denotes the set of vertices of graph G. Whenever the context is unambiguous, we shall simply write S? L for S? L (?1, ..., ?k). Similarly, the signature graph on ?1, ..., ?k with respect to ?L is the graph EPO? L (? 1) ? EPO? L (? 2) ? ? EPO? L (? k) . An inversion signature ?s ? S? L is thus a permutation that embodies some of the commonality between the k other permutations with respect to ?L, in the sense that they all possess an edit path to ?L that passes through ?s. A maximum signature is a signature in S? L that is as far away from ?L (and thus as close to the k other permutations) as possible.

Definition 2 The set of all maximum signatures is S?? L = {? s ? S? L | for all ? ?s ? S? L , d(? L ,? s) ? d(? L ,? ?s)} . A maximum inversion signature is thus a permutation that represents the "maximum commonality" between the k permutations: it is as close to these k permutations as possible while still being part of all edit paths to ?L. From a biological perspective, this edit path from ?L to the signature can be thought of as the evolution that happened before speciation, or the pattern of change that the k sequences have in common.

As with the special case for Steiner points called the
median [

In Figure 1 we have ?A = 2 -1 -3, ?B = -2 3 1, and ?L = 1 2 3 (the identity permutation of length 3). The signature graph is outlined in bold. The signatures in this case are -2 -1 -3 -2 -1 3, 1 2 -3, and the trivial signature ?L = 1 2 3. The only maximum signature is also the only maximal signature -2 -1 -3.

We begin with an investigation of rearrangement-based genomic signatures as defined above, then give procedures for signature-based phylogenetic and ancestral reconstruction.

Definition 1 can be restated inductively in terms of edit paths that move from the locus ?L towards the other permutations ?1, ..., ?k. We say that some permutation ? has a common edit inversion r with respect to ?1, ..., ?k if we observe d(?L, ?i) - d(?L, r?i) = 1 for 1 ? i ? k.

Definition 3 The locus ?Lis an inversion signature for permutations ?1, ..., ?k. If permutation ? is an inversion signature and r is a common edit inversion with respect to ?1, ..., ?k, then r? is also an inversion signature.

Thus, starting at the locus (which is the smallest possible
signature), one can enumerate all signatures by repeatedly
applying every possible common edit inversion to the
current collection of signatures; maximal signatures are those
signatures for which no common edit inversion exists and
maximum signatures are the largest of these maximal
signatures (i.e., the farthest away from the locus). Common
edit inversions form the basis for the MGR algorithm of
Bourque and Pevzner [

We say that a set of edit inversions on a permutation ? commutes iff applying every inversion in the set always yields the same permutation ?, regardless of the order in which the inversions are applied. (Trivially, inversions that operate on disjoint intervals commute.) Definition 4 A set of n inversions on ? with respect to ? is noninterfering if and only if it commutes and applying these inversions in any order reduces by n the inversion distance between ? and ?.

Commuting and noninterfering inversions offer a way to
reduce the search space in computing a median or in
examining all sorting paths: for a set of size n, it is enough
to look at a single ordering of its inversions rather than at
all n! possible orderings [

The concept of noninterfering inversions extends naturally to our framework with a defined ancestor. Definition 5 A set of inversions R is mutually noninterfering for ?A and ?B with locus ?L if it is noninterfering for ?L with respect to ?A and also for ?L with respect to ?B.

Such mutually noninterfering sets form the basis for another greedy algorithm: we repeatedly find and apply to ?L sets of mutually noninterfering inversions until there are none left. Mutually noninterfering sets can be found very quickly, so a greedy algorithm based on this approach runs very fast. We use this particular greedy heuristic in our experiments.

Since signatures are just nodes along evolutionary paths, they can be used as internal nodes in a process of phylogenetic reconstruction. We begin with a naïve algorithm to illustrate the basic approach.

The idea is to overlay the EPOs from each of the leaves ?1, ..., ?k to the locus ?L and construct a tree representative of the resulting structure. Consider the set of these EPOs, O = { EPO? L (?i)| 1 ? i ? k}; our algorithm constructs a tree from the current version of O, iteratively choosing a node from pairwise intersections of graphs in O and updating O to reflect this choice. Specifically, at iteration i, 1. select from O a vertex ?s that maximizes d (?L, ?s); 2. if the vertex selected in the previous step belongs to the intersections of PA, PB ? O, then create a node in the tree to be the parent of the subtrees represented by PA and PB; 3. in O replace EPO? L (?A) and EPO? L (?B) with their intersection.

This algorithm yields a tree without internal node labels, because EPOs are not closed under intersection, so that a node in the tree may represent two graphs from O that no longer have a least upper bound.

Our second algorithm overcomes this problem; in addition, it yields implicit edit paths from the leaves to the root that join at the internal nodes. In this improved version, we maintain the invariant that elements of O are always EPOs. Thus only the third step of the iteration is affected, and replaced by the following:

in O replace EPO? L (?A) and EPO? L (?B) with EPO? L (?s).

Step 1 in each iteration is obviously the computationally intensive one; our implementation for this step uses the MGR heuristic.

We develop an upper bound based on pairwise distances
to help us evaluate our greedy signature methods in the
TFhigeudriest3ances around a signature ?S
The distances around a signature ?S.
experimental phase. Denote by A, resp. B, the inversion
distance between the locus and ?A, resp. ?B, and by D the
inversion distance between ?A and ?B. (Inversions
distances can be computed in linear time [

D+ A?B 2 , with the symmetric version for b. Without loss of generality, assume A ? B; then we get ? D+ A?B ? d(? L ,? S)c ? A ? ? ? , ? 2 ? the desired upper bound.

We demonstrate the use of pairwise inversion signatures for ancestral reconstruction and for phylogenetic reconstruction through extensive simulations. We first show that, under certain reasonable conditions, maximum signatures coincide with ancestral genomes most of the time, then proceed to show that, under more stringent conditions, maximum signatures always coincide with ancestral genomes. Since no polynomial-time algorithm for computing maximum signatures is known at present, we show that our heuristics perform well, both in terms of accuracy and running time, even when applied to larger genomes (to the size of small prokaryotic genome). Finally, we show that the signature method use for phylogenetic reconstruction produces trees comparable in quality to neighbor-joining while providing ancestral reconstructions along the way.

Our experiments for ancestral reconstruction simply use triplets of genomes generated from an ancestral genome by generating three evolutionary paths, using randomly chosen inversions. The locations of these inversions is distributed uniformly at random, but their lengths are distributed according to one of two possible distributions: uniform and normal. The lengths of the edges from the ancestor to the three leaves are chosen in both a balanced manner and several skewed manners. All of our experiments used 1,000 repetitions unless stated otherwise and the results presented show averages over these 1,000 tests. We present most of our results in the form of tables. Tables 1 through 6 group columns by the percentage of the length of the longest simulated path P in the triplet. For instance, column two of Table 1 shows the percentage of true ancestors that are within 0 15 × |P| inversions away from a maximum signature (in this case, no more than one inversion away because |P| is no greater than 8 for any row of column two). The rows in these cases are labeled by the edge length as a percentage of the genome size. The first set of tables apply to triplets where all edges have the same length (that is, the same number of random inversions). Table 1, for normally distributed inversion lengths, shows that the simulated ancestor is a maximum signature most of the time, even when the evolutionary rates are extremely high. When the rates are already high 10% of the genome size, 97% of the true ancestral genomes are maximum signatures. The table also shows that (the last two rows aside) the true ancestor is within 2 inversions from a maximum signature more than 90% of the time. Table 2 shows similar, but slightly weaker results for uniformly distributed inversion lengths.

The next set of tables examines the influence of the size of the genome. Table 3 shows that the accuracy scales well. In addition, we tested genomes of size 100; the results are shown in Table 4.

The exhaustive algorithm rapidly reaches its limits: for genomes of size 100 with edge lengths of 10, computations already take on the order of hours instead of minutes. Table 4 shows favorable results for exhaustive computation of maximum signatures on such genomes. We now proceed to compare these results with those of our new maximal signature algorithms. Under most circumstances, the true ancestor is found by such maximal signature computations. Table 5 shows that the Bourquelike approach and the approach based on noninterfering inversions fare well with respect to the exhaustive search, the latter dropping off first. Table 6 shows results for the two greedy methods on genomes of size 100. For reasonable rates of evolution (10% or less per edge), we again see that the true ancestor is found most of the time. Finally, we tested on genomes of more realistic sizes, but of a size usually considered forbidding for ancestral inference?up to 1,000 genes. With 50 random events per edge the Bourque-like computations take just under 30 minutes, while for 80 random events they take under 2 hours. The accuracy remains very high: in 99% of the 380 trials with 50 random events per edge, the signature returned is within 5 inversions of the true ancestor, while in 66% of these trials, the signature returned is in fact the true ancestor. The approach based on noninterfering inversions is by far the fastest, taking under a half a minute for each of these trials, even with 80 random events per edge. Using 50 random inversions per edge, we found that 97% of the 1000 trials gave an ancestor within 5 inversions of the true ancestor, while 57% gave the true ancestor. With 80 events per edge, 91% gave an ancestor within 8 inversions of the true ancestor, while 15% gave the true ancestor. The largest genomes we tested had size 2000 (corresponding to small bacterial genomes, for instance) and 100 operations per edge, and 5000 (corresponding to the genomes of free-living bacteria such as E. coli) with 250 operations per edge. All trials gave a signature within 10 inversions of the true ancestor, while 90% gave one within 4 inversions, all running in under 2 minutes per trial for size 2000 and 4 minutes per trial for size 5000. These speeds are enormously higher than methods such as MGR or median-based reconstructions, yet the accuracy is also much higher. Thus, by focusing on the characteristic (shared) patterns of inversions, we are able to win on two fronts at once, mostly because we avoid the confusion and long explorations associated with multiple reconverging paths.

The true ancestor will not always be equidistant from the leaves and the locus. While large amounts of skew can sometimes move an ancestor farther from a maximum signature, the true ancestor usually remains very close to a maximum signature. # of ops as % of n

Method 5 8 10 15 20 We call the number of random inversions from the locus to the true ancestor c and the number of random inversions from the true ancestor to each of the leaves a and b. We fix a to be 10% of the total length and vary c and b from values equal to a up to 2.5 times a. Table 7 shows that, for genomes of size 50, the true ancestor is a maximum signature in most cases and that almost as often it is a maximal signature found by the Bourque-like greedy method. Our maximum signature method appears slightly more robust to skew on one of the child branches as opposed to skew on the branch to the locus.

We simulated evolution over 300 trees to test our signature-based tree reconstruction method. We found that our method (using the Bourque-like signatures for efficiency) reconstructs the true topology most of the time and that any error remains very small. The trees were constructed using the birth-death model and the mean of the normally distributed edge lengths was varied from 5 to 9 operations with a standard deviation varying from 2 to 3. The mean of the normally distributed inversion lengths was varied from 8 to 30 with a standard deviations varying from 5 to 10. The generated trees have from 5 to 24 taxa and are distributed as shown in Figure 4.

Two methods were used for choosing a locus. The first method used the true root of the tree given by the simulation (an ideal method not available in practice, of course), while the second method used a random leaf as the locus. With the true root as the locus, we found that 94% of the trees were reconstructed perfectly, while 16 of the 17 remaining trees had a Robinson-Foulds error of 2, giving an average RF error of 0.15. With a random leaf as the locus, we found that 85% of the trees were reconstructed perfectly, while 28 of the 45 remaining trees had an RF error of 2 and 11 of the last 27 had an RF error of 4, giving an average RF error of 0.5.

Using the true root as the locus demonstrates that the pairwise signature contain a great deal of information about the phylogeny. Using a random leaf as the locus demonstrates that such information remains recoverable even when the choice of locus is arbitrary (and usually far from ideal), justifying our initial claim that comparing two genomes with respect to a third tremendously enriches b what can be had from a direct pairwise comparison. (As an example, trees that were not properly reconstructed by the neighbor-joining method, which uses strictly pairwise comparisons, were commonly reconstructed correctly by our signature-based method.) Our tests for phylogenetic reconstruction are obviously of limited scope, meant to exemplify the usefulness of the method rather than provide a full evaluation; and the method itself is subject to many obvious improvements (better ways to choose a locus, using k-way signatures rather than pairwise ones to support a top-down reconstruction method, etc.) TFihgeusrieze4of the generated trees The size of the generated trees.

Finally, we present experimental results suggesting that our upper bound is on average very tight and then use the bound to show that the greedy signatures, used for ancestral reconstruction of genomes too large for the exhaustive computation, are indeed close to a maximum signature. Since the computed ancestor is bracketed within this bound, our results imply that the maximum signature is very close to the true ancestor with high probability. The upper bound was computed for each trial in Table 1. For each of the sets of 1000 trials, the average difference between the upper bound and the maximum signature was 0.029, 0.073, 0.176, 0.27, and 0.327 for trials with 10, 15, 20, 25, and 29 percent respectively. For the lengthdependent data from Table 3, the average difference stays between 0.021 and 0.082. Table 8 indicates similar performance for experiments run on skewed triplets. The tests from Table 6 give average differences from 0.024 up to 1.375 for the Bourque-like method and differences from 0.048 up to 2.228 for the noninterfering inversions method. Only one of the tests from genomes of size 1000 did not match the upper bound for the greedy method.

In any study of evolutionary changes, the challenge is to distinguish global patterns from a background of many local changes?or, to put it another way, to find commonalities among many equally plausible evolutionary paths that lead to the same modern organism. We have proposed an approach to this problem that focuses on intermediate states along such paths in the setting of a speciation event and seeks to return the last (most recent) states from which both species of organisms could still have been derived. This approach offers multiple benefits: the focus on intermediate states translates readily into one on ancestral reconstruction; the study of paths going through a fork (the speciation event) stresses the role of evolutionary history rather than just final states; and the search for the most recent states that are part of the fork naturally separates common evolutionary changes (prior to the fork) from individual variations (subsequent to the fork). Although finding such signatures appears hard, we gave an efficient heuristic that does very well through an extensive range of simulations. Our signatures are based on inversions, since inversions are the best studied of the various genomic rearrangements to date, but the concept readily extends to any other rearrangement operation or family of such operations.

The authors declare that they have no competing interests.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1