Monday, April 16, 2018

Networks in the news, at last

Phylogenetic networks do not always fare very well in the traditional media. The general public has enough troubles dealing with a phylogenetic tree, let alone networks. For example, many people still consider that Darwin claimed that monkeys are our ancestors (a chain-based relationship) rather than our cousins (a tree-based relationship) — who knows what they must think about humans inter-breeding with Neandertals (a network-based relationship).

Nevertheless, a few news reports about a recent network-based paper have suggested that the situation might be improving.

The paper in question is:
Úlfur Árnason, Fritjof Lammers, Vikas Kumar, Maria A. Nilsson, Axel Janke. Whole-genome sequencing of the blue whale and other rorquals finds signatures for introgressive gene flow. Science Advances 4: eaap9873.
This paper details extensive genomic admixture among six species of Baleen whales. The phylogenetic scenarios involving gene flow cannot be represented by a tree, of course, so the authors include the following set of networks (along with a Median network).

News reports have appeared in at least two places, reporting on this paper, that discuss the difference between networks and "Darwinian trees", and do quite a good job of it.

For example, this quotation is from the New York Times ("Baleen Whales intermingled as they evolved, and share DNA with distant cousins"):
The relationships are so complicated, however, that the senior researcher Axel Janke said "family tree" is too simple a metaphor. Instead, the species, all part of a group called rorquals, have evolved more into a network, sharing large segments of DNA with even distant cousins. Scientists expressed surprise that there had been so much intermingling of baleen whales, given the variety of sizes and shapes.
This quotation is from Popular Science ("A new study on whales suggests Darwin didn't quite get it right"):
Evolutionary network analysis takes the tree metaphor and turns it into a complex web, which acknowledges the different kinds of familial connections shown by whole-genome sequencing. Comparing the whole genomes of rorquals shows that genetics is much more fluid than the Darwinian “tree” model, Janke says.
"Gene flow and hybridization is more common than biologists usually think," Janke says. Analysis of the rorquals’ genes shows that they've interbred in different ways at various times in their evolutionary history. This doesn't make much sense if you rely only on Darwin's model, where branches of the family tree never touch again after they separate.
I think that these give us all a reason for optimism.

Monday, April 9, 2018

The curious case(s) of tree-like matrices with no synapomorphies

(This is a joint post by Guido Grimm and David Morrison)

Phylogenetic data matrices can have odd patterns in them, which presumably represent phylogenetic signals of some sort. This seems to apply particularly to morphological matrices. In this post, we will show examples of matrices that are packed with homoplasious characters, and thus lead to trees with a low Consistency Index (CI), but which nevertheless have high tree-likeness, as measured by a high Retention Index (RI) and a low matrix Delta Value (mDV). We will also try to explore the reasons for this apparently contradictory situation.


A colleague of ours was recently asked, when trying to publish a paper, to explain why there were low CI but high RI values in his study. This reminded Guido of a set of analyses he started about a decade ago, using an arbitrary selection of plant morphological matrices he had access to.

The idea of that study was to advocate the use of networks for phylogenetic studies using morphological matrices, based on the two dozen data sets that he had at hand. The datasets were each used to infer trees and quantify branch support, under three different optimality criteria: least-squares (via neighbour-joining, NJ), maximum likelihood, and maximum parsimony. This study was was never wrapped up for a formal paper, for several reasons (one being that 10 years ago Guido had absolutely no idea which journal could possibly consider to publish such a paper, another that he struggled to find many suitable published matrices).

The signals detected in the collected matrices were quite different from each other. The set included matrices with very high matrix Delta Values (mDV), nontree-like signals, and astonishingly low mDVs, for a morphological matrix. Equally divergent were the CI and RI of the inferred equally most-parsimonious trees (MPT) and the NJ tree. The data for the MPTs and the primary matrices are shown in the first graph, as a series of scatterplots, where each axis covers the values 0-1. (Note: in most cases the NJ topologies are as optimal as the MPTs, and have similar CI and RI values.)

As you can see, the CI values (parsimony-uninformative characters not considered) are not correlated with either the RI or mDV values, whereas the latter two are highly correlated, with one exception.

The most tree-like matrix (mDV = 0.184, which is a value typically found for molecular matrices allowing for inference of unambiguous trees) was the one of Hufford & McMahon (2004) on Besseya and Synthyris. The number of MPTs was undetermined —using a ChuckScore of 39 steps (the best value found in test runs), PAUP* found more than 80,000 MPTs with a CI of 0.39 (third-lowest of all of the datasets), but an RI of 0.9 (highest value found).

A strict consensus network of the 80,003 equally parsimonious solutions, the network equivalent to the commonly seen strict consensus tree cladograms. Trivial splits are collapsed. Colours solely added for orientation (see next graph).

Oddly, the NJ tree had the same number of steps (under parsimony), but a much higher CI (0.69). The proportion of branches with a boostrap support of > 50% was twice as large in a distance-based framework than using parsimony.

Bootstrap consensus networks based on 10,000 pseudoreplicates each. Left, distance-based and inferred using the Neighbour-Joining algorithm; right, using a branch-and-bound search under parsimony as optimality criterion (one tree saved per replicate). Edge-lengths reflect branch support of sole or competing alternatives; alternatives found in less than 20% of the replicates not shown; trivial splits are collapsed. Same colour scheme than above for orientation.

The Neighbour-net based on this matrix has quite an interesting structure. Tree-like portions are clearly visible (hence, the low mDV) but the branches are not twigs but well developed trunks. The large number of MPTs is mainly due to the relative indistinctness of many OTUs from each other.

Neighbour-net based on simple mean (Hamming) morphological distances. Same colour scheme as above.
This distance-based 2-dimensional graph captures all main aspects of the tree inferences and bootstrap analyses, with one notable exception: B. alpina which is clearly part of the red clade in the tree-based analyses. We can see that the orange group, B. wyomingensis and close relatives, is (morphology-wise) less derived than the red species group. Although B. alpina is usually placed in a red clade, it would represent a morphotype much more similar to the orange cluster as it lacks most of the derived character suite that defines the rest of the red clade. In trees, B. alpina is accordingly connected to the short red root branch as first diverging "sister" with a very short to zero-long terminal branch, but in the network it is placed intermediate between the poorly differentiated but morphologically inhomogenous oranges and the strongly derived reds — being a slightly reddish orange. This reddishness may reflect a shared common origin of B. alpina and the other reds, in which case the tree-based inferences show us the true tree. Or just a parallel derivation in a member of the B. wyoming species aggregate, in which case the unambiguous clade would be a pseudo-monophylum (see also our recent posts on Clades, cladistics, and why networks are inevitable and Let's distinguish between Hennig and cladistics).

Interpretation, what does low CI but high RI stand for?

The distinction between the Consistency Index and the Retention index has been of long-standing practical importance in phylogenetics. For a detailed discussion, you can consult the paper by Gavin Naylor and Fred Kraus (The Relationship between s and m and the Retention Index. Systematic Biology 44: 559-562. 1995).

For each character, the consistency index is the fraction of changes in a character that are implied to be unique on any given tree (ie. one change for each character state): m / s, where m = the minimum possible number if character-state changes on the tree, and s = the observed number if character-state changes on the tree. The sum of these values across all characters is the ensemble consistency index for the dataset (CI).

The retention index (also called the homoplasy excess ratio) for each character quantifies the apparent synapomorphy in the character that is retained as synapomorphy on the tree: (g - s) / (g - m), where g = the greatest amount of change that the character may require on the tree. Once again, the sum of these values across all characters is the ensemble retention index for the dataset (RI).

Both CI and RI are comparative measures of homoplasy — that is, the degree to which the data fit the given tree. However, CI is negatively correlated with both the number of taxa and the number of characters, and it is inflated by the inclusion of parsimony-uninformative characters. RI is less sensitive to these characteristics. However, RI is inflated by the presence of unique states in multi-state characters that have some other states shared among taxa and, therefore, are potentially synapomorphic.

It is these different responses to character-state distributions (among the taxa) that apparently create the situation noted above for morphological data. Neither CI nor RI directly measures tree-likeness, but instead they are related to homoplasy. So, it is the relative character-state distributions among the taxa that matter in determining their values, not just the tree itself.

For example, increasing the number of states per character will, in general, increase CI faster than RI. Increasing the number of states that per character that occur in only one taxon will, in general, increase RI faster than CI.

Take-home message

This is just another example demonstrating that morphological data sets should not be used to infer (parsimony) trees alone, but analysed using a combination of Neighbour-nets and support Consensus Networks. No matter which optimality criterion is preferred by the researcher, the signal in such matrices is typically not trivial. It calls for exploratory data analysis, and inference methods that are able to capture more than a trivial sequence of dichotomies.

Monday, April 2, 2018

Things you can learn in a blink about your data

As phylogeneticists, we commonly have to deal with data that we don't initially understand. In this post, I'll use a recently published 8-gene dataset on lizards to show how much can be learned prior to any deeper analysis, just from producing a few Neighbour-nets.

The data

Solovyeva et al. (Cenozoic aridization in Central Eurasia shaped diversification of toad-headed agamas, PeerJ, 2018) sampled species of toad-headed agamas (lizards) across their natural range (north-western China to the western side of the Caspian Sea), to study their genetic differentiation in time and space. To do so they used two datasets. The mitochondrial data covers four gene regions: coxI, cytB, nad2, and nad4, and are complemented by four nuclear gene regions: AKAP9, NKTR, BDNF, RAG1.

This caught my eye, because the authors' preferred trees have a bunch of low branch-support values, so that this would be a good opportunity to advocate some Consensus networks. They also report only values above a certain threshold, as apparently recommended by several reviewers. My reviewers not rarely recommended the same, but I always ignored this — I believe we should give the value, because it makes a difference if its just below the threshold (e.g. bootstrap support, BS, of 49), or non-existent (BS < 5). The authors also note that their mitochondrial and nuclear genealogies are not fully congruent. In short, the signal from their matrix is probably not trivial, but could be interesting.

In contrast to many other journals, PeerJ has a strict open-data policy. Solovyeva et al. provide each gene as FASTA-formatted alignment as Supporting Information. So let's have some quick-and-dirty Neighbour-nets.

Using Neighbour-nets to decide on an analysis strategy

A comprehensive outgroup sampling can avoid outgroup-rooting artefacts, but adding very distant outgroups comes at a price. We need to invest much more computational effort, because the inference programmes not only try to optimize our focus group, but the entire taxon set. Another principal question is: what can an outgroup taxon provide as information for rooting an ingroup, while being completely different? Furthermore, when we do an ML (or Bayesian) analysis, e.g. with RAxML, we leave it to the program to optimize a substitution model (even when we predefine a model, its parameters will usually be optimized by the inference software on the fly). By adding distant outgroups, we optimize a model for them plus our focus group — by not using any outgroup, we optimize a model suiting just the situation in our focus group.

Fig. 1 shows the neighbour-net (uncorrected, codon-naive p-distances) for the first of the mitochondrial genes, coxI (the others are similar), which and tells us a lot about the data to be used for the tree inferences.

Fig. 1 Neighbour-net based on mitochondrial (coxI) uncorrected p-distances. The diffuse, non-treelike signal expressed in the A and B fans will be a hard nut for the tree inference, and will have little influence on questions dealing with the focal genus.
We can see that outgroup diversity is much higher than for the focus group, and that most outgroup taxa are very distinct from the ingroup. Looking at the closest outgroups (Stellagama, Agama, Laudakia, Paralaudakia, Xenagama, Pseudotrapelus), we see that finding an unambiguous sister taxon to the focal genus will be difficult. And we can realize that including more-distant taxa just gives the algorithm much more work (note the A and B bushes), but hardly will have any benefit for rooting the ingroup.

We also can see that the 3rd codon position is probably saturated to some degree, and that we will be dealing with a high level of stochasticity (randomly distributed mutation patterns) here — all terminal edges are long to very long. Since the same thing holds for the other three mitochondrial regions, it would not be a bad idea to do an additional inference including only the 1st and 2nd codon positions, in case all taxa should be included.

Using Neighbour-nets to understand the basic signal properties of your data

Fig. 2 shows the Neighbour-net (again, uncorrected p-distances) for one of the nuclear genes, AKAP9. The outgroup sample is somewhat different, but we can immediately see that this gene has more potency to infer unambiguous phylogenetic relationships among the sampled taxa — the graph has distinctly tree-like portions. We also see that saturation of 3rd codon position is much less of an issue here, compared to the cox1 gene (Fig. 1) — the terminal edges are comparatively short, with respect to the central edge bundles. [Nonetheless, it is never wrong to analyze coding gene data partitioned: 1st and 2nd codon positions vs. 3rd codon position.]

Fig. 2 Neighbour-net based on the nuclear (AKAP9) genetic distances. Note the much more treelike structure of the graph, the generally shorter terminal edges, and last-but-not-least the notable difference between ingroup (focal genus) and outgroup taxa.
For the general differentiation patterns, compare the minute extent of the focal group, green background in Fig. 2 vs. the prominent bush in Fig. 1. It is clear that including distant outgroups will not have any benefit. We may even consider reducing the outgroup sample (if one has to include an outgroup at all) to the two genetically closest genera Stellagama and Paralaudakia.

Similarly structured graphs are found for the other three nuclear genes.

Producing some quick Neighbour-nets doesn't hurt

Sometimes reviewers will pick on them — "distance-based phenetic method" is something I used to get a lot. In this case, you can still produce them just to get some basic impressions on your data set. This will help you to understand the results of your tree inferences, including why some of your branches have ambiguous support.

It comes as little surprise that the taxa one can identify, in these networks, as likely sister genera of the focal genus, come up as sister taxa in the explicit phylogenetic analyses done by Soloveya et al. — e.g., their fig. 2 showing the combined mitochondrial tree, and their fig. 3, showing the combined nuclear tree.

Soloveya et al. (2018) performed some incongruence tests (AU-topology test) using single-gene inferences (going further than many other studies), but did not dig deeper. One of the authors answered my question about potential signal issues that may cause topological incongruence between ML and Bayesian trees, as well as ambiguous support, but he considers this to be a solely a problem with methods — different algorithms prefer different phylogenies. Having looked at the basic differentiation pattern in the gene regions using Neighbour-nets, it may be more than just an issue with methods — ML and Bayesian analysis should always support the same splits when using the same or similar substitution models.

Like many other studies, the authors also use the data for Bayesian dating and dating-dependent biogeographic analysis. Lacking any ingroup fossils, the authors could only constrain nodes within the outgroup subtree, which are nodes far from those that they discuss and estimate. I have my doubts that we can put much faith in the uncorrelated clock process to handle such extreme differences between focus group (ingroup) and (constrained) outgroup-taxon lineages as seen in Fig. 2. Estimates for rate shifts between outgroup and ingroup usually render ingroup age estimates to be too young, compared to age estimates obtained with ingroup fossils. This is something that can be directly deduced from a graph like the one in Fig. 2.

Data and networks can be found at figshare

The original paper provides a comprehensive supplement with a lot of interesting information, but the FASTA-files, each comprising a single gene region and a few editing issues, are not yet ready to use. Hence, I transformed them into NEXUS-files, and generated a combined data matrix. The files and the Neighbour-nets for each gene region (and a full single-gene maximum likelihood analysis) can be found on figshare.