Showing posts with label Directed graph. Show all posts
Showing posts with label Directed graph. Show all posts

Wednesday, November 7, 2012

Explanation of the many names for types of phylogenetic networks


Two types of phylogenetic network are commonly recognized, although there can be gradations between the two extremes. These go by many different names, which inevitably leads to some confusion on the part of users.

Some of the names are listed here, along with an explanation of what the terminology is intended to convey. The terms are arranged in pairs, indicating the two different types of network. The "network" part of the name is assumed in each case unless indicated otherwise.

      Type 1       Type 2
  1. Affinity  Genealogical
  2. Data-display Reticulogeny
  3. Implicit  Explicit
  4. Directed  Undirected
  5. Rooted  Unrooted
  6. Splits graph Augmented tree, Reconciliation, Recombination,
                  Hybridization
1.  This reflects the biologists' perspective, describing the different purposes for which networks have been used. Affinity networks display overall similarity relationships among the organisms, whereas genealogical networks display only historical relationships of ancestry.

2.  This reflects the assumptions used for the data analysis. Data-display networks are interpreted solely as visualizations of the patterns of variation in the data, while the reticulogenies are based on some inferences about those data patterns (such as their possible cause). Some network types, such as Reduced Median Networks and Median-Joining Networks, are based on algorithms that make partial inferences from the data. Data-display networks have mainly been used as affinity networks and reticulogenies as genealogical networks.

3.  This reflects the computational perspective, describing the goal of the algorithm used to analyze the data. Explicit networks are intended to provide a phylogeny in the traditional sense used for phylogenetic trees, displaying both vertical and horizontal patterns of descent with modification. Implicit networks provide information that can be used to explore phylogenetic patterns in a dataset without any direct interpretation as necessarily showing a phylogeny. Implicit networks have mainly been used as data-display networks and explicit networks as reticulogenies.

4.  This reflects the mathematical interpretation of networks as line graphs. In a directed graph the edges have a direction, usually indicated by an arrow, in which case the edges are more correctly referred to as arcs. Undirected graphs do not have directed edges.

5.  This reflects the tree-thinking view of phylogenetic networks, in which directed graphs are called rooted trees and undirected graphs are called unrooted trees. Rooted networks are usually treated as explicit networks and are thus used as genealogical networks, although there is no reason why they could not be used simply as a convenient form of data display.

6.  This reflects the modelling approach to network analysis based on mathematical structures. Splits graphs model phylogenetic patterns as bipartitions of the data, and build the network from those partitions (the result will be a tree if there are no incompatible bipartitions). Augmented trees are essentially trees with a few added reticulation edges / arcs, while reconciliation networks are based on reconciling the differences between trees. Recombination networks are based on analyzing data patterns in terms of a simple model of genetic cross-over, while hybridization networks model the data in terms of patterns in conflicting trees.

So, there are reasons why so many different terms have appeared in the literature. Unfortunately, they are not always used consistently with the meaning that was originally intended.

Wednesday, September 26, 2012

Networks and most recent common ancestors


The interpretation of an evolutionary network is confounded by the fact that descendants of reticulation nodes have complex ancestry. Therefore, the concept of a Most Recent Common Ancestor (MRCA) is not as straightforward as it is for a tree, as there may be multiple paths from any one descendant back to its ancestors. This creates several possible interpretations of what we might mean by a MRCA.

Figure 1 illustrates the calculation of the MRCA in a tree of five taxa (A-E), showing the MRCA of taxa C and D. We simply trace each of the descendant taxa backward along the branches towards the root, and the ancestral node where all of these traces first intersect is the MRCA of those taxa.

Figure 1.

Figure 2 illustrates a more complex history, involving two hybridization events. The incoming branches to the reticulation nodes have arrows, for emphasis. The figure also recognizes several possible interpretations of the MRCA of taxa C and D (see Huson and Rupp 2008; Fischer and Huson 2010).

A conservative definition of the MRCA (or a stable MRCA) is the intersection of all paths from the descendants to the root, so that any reticulation pushes the MRCA back towards the root. In this example it pushes the MRCA all the way to the root. Alternatively, we could define the Lowest Common Ancestor (or the minimal common ancestor) as the shared ancestor that is furthest from the root along any path. That is, the LCA is not an ancestor of any other common ancestor of the taxa concerned.

Figure 2.

In the mathematical terminology of lattices, which can have an algebraic or order theoretic definition, the Conservative MRCA is called the Least Lower Bound (LLB) and the LCA is called the Greatest Lower Bound (GLB).

We could also have a biological compromise between these two mathematical concepts and recognize a Fuzzy MRCA, in which only a specified proportion of the paths (representing some proportion of the genomes) needs to be accommodated by the MRCA, thus keeping the MRCA close to the main collection of descendants (Fischer and Huson 2010). In this example, the Fuzzy MRCA represents 75% of the genome of taxon C and 100% of the genome of taxon D. (The Conservative MRCA represents 100% for both taxa, by definition; and in this example the LCA represents 50% of the genome of taxon C and 100% of the genome of taxon D.)

Figure 3.

However, neither the Fuzzy MRCA nor the LCA is necessarily unique, although the Conservative MRCA will always be unique. Figure 3 shows an example where there are two independent LCAs of taxa C and D. Neither of these LCAs is an ancestor of the other, as required by the definition, and so they are both equal candidates as LCA. Each one represents 50% of the genome for both taxa C and D.

In terms of a lattice, Figure 2 is called a lower semi-lattice (or meet semi-lattice), because every pair of nodes has only one GLB, whereas Figure 3 is not a semi-lattice, because at least one node pair has more than one GLB.

This leads to the biological question of how we are best to interpret the MRCA in situations such as that represented by Figure 3. This is a question that does not yet seem to have been addressed by biologists. Figure 3 does not represent an impossible evolutionary history, although it may be an unusual one because one lineage hybridizes with another lineage twice, presumably at different times.

The lack of a unique LCA is clearly problematic, as it almost defeats the purpose of the concept of a MRCA. It would certainly make life easier if we could restrict evolutionary networks to the class of lower semi-lattices.

An alternative is to restrict the MRCA concept to the Conservative MRCA. However, it is easy to imagine situations where this pushes the MRCA so far towards the root of the network as to be uninformative, especially in cases involving horizontal gene transfer, which can occur between widely separated evolutionary groups. If we insist that a eukaryote MRCA represent 100% of the genome, and we include non-nuclear genomes in the calculation, then the Conservative MRCA creates an extreme theoretical problem.

A Fuzzy MRCA may be the best compromise between these two extremes, although there are obvious practical issues for obtaining agreement on how much of the genome history is to be discounted from the MRCA.

References

Fischer J., Huson D.H. (2010) New common ancestor problems in trees and directed acyclic graphs. Information Processing Letters 110: 331–335.

Huson D.H., Rupp R. (2008) Summarizing multiple gene trees using cluster networks. Lecture Notes in Bioinformatics 5251: 296–305.

Wednesday, August 8, 2012

Direction is important when showing history


In an earlier post I presented an evolutionary network showing the history of the various software and hardware components of the revolutionary Xerox 8010 "Star" computer. This network is shown below.


This network summarizes how various systems related to the Star have influenced one another over the years. Time progresses downwards (as indicated), double arrows indicate direct successors (i.e. follow-on versions), while single arrows indicate "influence" on the subsequent ideas. It is thus interpretable as a network showing hybridization and possibly horizontal transfer. It is thus a valid evolutionary diagram, and it neatly and succinctly displays the historical patterns of "descent with modification".

Unfortunately, a recent attempt to update this diagram is not successful at all.

Click to enlarge.

Note that this version loses most of the important components of an evolutionary diagram:
(i) there is no indication of the time direction
(ii) the edges are not directed
(iii) the bold lines are not explained as being direct descent.
Furthermore, in the accompanying text the (internal) nodes are referred to as "leaves". The authors have thus turned a directed graph into an undirected one, and have thereby created something with only ambiguous interpretation. This network no longer makes evolutionary sense, although it intends to display evolutionary information.

The authors are apparently well-intentioned, and they are trying to use an appropriate form of information display, but they have failed to appreciate the importance of using a directed versus an undirected graph when displaying historical relationships.

Wednesday, June 20, 2012

Rooted networks for exploratory data analysis


Leo van Iersel has recently been trying to convince me that rooted networks might also be useful as exploratory data analysis (EDA), in addition to the unrooted networks that I have championed in print (Morrison 2010) and in this blog. I have tried to find a dataset that will support his case, and the one discussed here is the best that I have been able to find.

In infection biology we are interested in the transmission of pathogens from one host to another, possibly in geographically distant locations. It is usually assumed that pathogens (viruses, bacteria, protists, microfungi, helminths) with the same genotype found in different locations represent transmission from a single source location. Conversely, a mixture of genotypes at a single location is assumed to represent multiple sources of infection, possibly at different times. This type of analysis is a combination of population genetics and phylogenetics.

Such transmission studies can produce quite complex results, even to the extent of having different pathogen genotypes simultaneously in the same host. Data analysis is usually based on either a rooted tree or an unrooted haplotype network, but it can also conveniently be studied using a rooted reticulation network. I will illustrate the latter with a simple example.

Click to enlarge

The figure shows a rooted network for 1,544 aligned nucleotides from 72 samples of the nematode Dictyocaulus viviparus, which is the parasitic lungworm of domestic cattle. The data are concatenated mitochondrial protein (2 genes), rRNA and tRNA gene sequences, from Höglund et al. (2006). The analysis shows the inferred historical relationships among 64 farm samples from Sweden (8 worms from each of Farms 29, 34, 36, 38, 49, 65, 68 and 76) and 8 samples from a isolate that had been maintained in the laboratory (L, used as the outgroup to root the network).

The data have been analyzed using the reticulation network method of Huson et al. (2007), based on splits generated by the Median network. Since the character data are essentially binary (with two exceptions), this produces exactly the same result as for a recombination network.

In the network, most of the samples from within each farm seem to be closely related in a simple divergent fashion through time, as would also be conveniently displayed by a standard tree-based analysis. There are apparently two major clades of genotypes, with 6-7 subclades. We can conclude from the tree-like relationships that four farms show evidence of only a single source of infection (Farms 34, 36, 38 and 76 each have a single genotype), while two farms appear to have at least two genotypes and thus probably two sources of infection (Farms 49 and 68).

However, two of the farms show more complex patterns than these, which would not be revealed by a simple tree analysis. These two farms have groups of samples that descend from reticulation nodes (indicated by the arrows), thus suggesting the pooling of two distinct sources of genetic material. Note that there is no suggestion that these reticulations represent either recombination or hybridization, given that the data are from mitochondrial genes. This analysis is best treated as exploratory (EDA), highlighting genotypic complexity that warrants further biological investigation, rather then providing an explicit hypothesis of evolutionary history.

Farm 29 is shown as having one unique genotype (5 individuals) plus another genotype (3 individuals) that has elements possibly related to both of the major clades of genotypes. Perhaps these latter 3 individuals represent an earlier infection, given their apparent association with the basal branches of the two clades.

Farm 65 appears to be even more noteworthy. There are 3 individuals that are apparently related to those on Farm 36, plus 3 individuals of somewhat uncertain relationship. Then there are 2 individuals with elements possibly related to the genotypes on Farms 76 and 49. This is clearly a very interesting farm, from the point of view of lungworm infection and transmission, with at least three possible infection sources. This is important information that needs to be taken into account for possible management strategies.

This use of a rooted network analysis for exploratory data analysis seems not to have been considered before. However, it seems to me that it adds considerably to the practical information that can be gleaned from a study of the transmission of pathogens.

References

Höglund J., Morrison D.A., Mattsson J.G., Engström A. (2006) Population genetics of the bovine/cattle lungworm (Dictyocaulus viviparus) based on mtDNA and AFLP marker techniques. Parasitology 133: 89-99.

Huson D.H., Klöpper T.H. (2007) Beyond galled trees — decomposition and computation of galled networks. Lecture Notes in Bioinformatics 4453: 211-225.

Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution 27: 1044-1057.

Wednesday, April 18, 2012

An explanation of graph types


Biologists sometimes are not clear about the distinction between directed and undirected graphs in relation to whether they are cyclic or acyclic. So, to help clarify matters, I have included a figure here that places examples of the various graphs into their respective categories.

There are four combinations of characteristics, shown in the figure as a 2x2 table.

Click to enlarge

In all of the graphs there are four (unlabelled) leaves, but the number of internal nodes and edges varies depending on whether there are cycles (4 nodes, 4 edges) or not (2 nodes, 1 edge; or 4 nodes, 4 edges).

The important point for biologists to note is that any evolutionary diagram must involve a directed acyclic graph (DAG). An undirected graph cannot represent history, because the direction of that history is not shown (and history is defined in terms of a past relative to the present). A directed cyclic graph cannot represent a realistic history, because at one of the nodes in the cycle an inferred ancestor in also its own descendant (or one of the inferred descendants is also its own ancestor).

Note that an undirected cyclic graph can be turned into either a directed cyclic graph or a directed acyclic graph. In a phylogenetic analysis, the goal is to produce a directed acyclic graph.

The main practical distinction between a "data-display network" and an "evolutionary network" is that the former is usually undirected and the latter always directed. The usual conceptual difference between a phylogenetic tree and an equivalent phylogenetic network (= evolutionary network) is that the latter has a reticulation node while the former does not.

There seems to be no consistency in the literature about what to call a cycle in the various graphs. I have made two suggestions here (loop and circuit). But, what should one call the reticulated part of a DAG?