Monday, April 30, 2018

Stratification: how linguists traditionally identify borrowings


In my previous blog post, I illustrated how important it is to take the systemic aspects of sound change into account when comparing languages. What surfaces as a surprisingly regular process is in fact a process during which the sound system of a language changes. Since the words in a given language are derived from the sound system, a change in the system will necessarily change all words in which the respective sound occurs.

On one hand, this makes it much more difficult for linguists to identify homologous words across languages. On the other hand, however, it enables us to identify borrowings, by searching for exceptions to regular sound correspondences. I will be discussing the latter here.

Sound changes and borrowing

In order to illustrate how this can be done in practice, consider the examples of 15 cognates between German and English in the following table:

No. German  English
1 Dach  thatch
2 Daumen  thumb
3 Degen  thane
4 Ding  thing
5 drei  three
6 Durst  thirst
7 denken  think
8 Dieb  thief
9 dreschen  thresh
10 Drossel  throat

When comparing these words quickly, it is easy to see that in all cases where German has a d as the initial sound, English has a th. This sound correspondence, as we call it in historical linguistics, reflects a very typical systematic similarity between English and German, which we can identify for all related words in English and German which go back to Proto-Germanic θ-, a very regular sound change which is well accounted for in Indo-European linguistics.

Not all homologous words between English and German, however, show this correspondences, as we can easily see from the five examples provided in the next table:

No. German English
11 Dill dill
12 dumm dumb
13 Damm dam
14 Dunst dunst
15 Dollar dollar

It is easy to see that these words don't fit our expected pattern (d matching th as the first consonant). It is also clear from the overall similarity of the words that it is rather unlikely that they trace back to different words, and thus turn out to be not cognate at all. One of the simplest possible explanations for the divergence from our initial d in German corresponding to θ in English, which now surfaces as d = d, is borrowing, be it from German to English, from English to German, or from some third language.

Among the five examples, the final one, Dollar is the easiest to explain, as we are dealing with a recent borrowing of the name of the U.S. currency. English dollar itself has another cognate with German, namely German Taler, the name of a currency from ancient times (see here for the full etymology, based on Pfeifer 1993).

The other four terms in the table may seem less straightforward to explain as borrowings, as they are by no means of recent origin; but we can confirm their exceptional status by contrasting them with older Middle High German readings (11-14th century), which are listed in the following table for all 15 of our examples:

No. German English Middle High German
1 Dach thatch dah
2 Daumen thumb dūm
3 Degen thane degan
4 Ding thing ding
5 drei three drī
6 Durst thirst durst
7 denken think denken
8 Dieb thief diob
9 dreschen thresh dreskan
10 Drossel throat drozze
11 Dill dill tilli
12 dumm dumb tumb
13 Damm dam tam
14 Dunst dunst tunst
15 Dollar dollar

As can be easily seen from this table, examples 11-14 all have a t as the initial consonant in Middle High German, and not d, as in the other cases. The change from original Proto-Germanic d to t in German is a well-attested sound change, for which we have many examples in the form of sound correspondences (cf. day vs. Tag, do vs. tun, etc.). We can therefore conclude that the Middle High German readings like tilli vs. English dill reflect the readings we would expect if all words had changed according to the rules. Since no regular change from t in Middle High German to d in Standard High German can be attested, it is furthermore safe to assume that the words have been modified under the influence of contact with other Germanic language varieties.

Here, English is not the most obvious candidate for contact; and the influence is rather due to contact with neighboring language varieties in the North-West of Germany, such as Frisian or Dutch. Similar to English, they have retained the original d (cf. Dutch dille vs. English dill). If speakers of High German varieties borrowed the term from speakers of Low German varieties, they would re-introduce the original d into their language, as we can see in our examples 11-14.

Why some of these borrowings took place and some did not is hard to say. That people in the North-West, living on the coast, know more about the building of dams, for example, is probably a good explanation why High German borrowed the term: obviously, the High German speakers did not use the word tam all that frequently, but instead heard the word dam often in conversations with neighboring varieties closer to the coast. For the other words, however, it is difficult to tell what was the reason for the success of the alternative forms.

Conclusions

Despite its important role for historical language comparison, the kind of analysis described here, by which linguists infer exceptional patterns in order to identify borrowings, is not well documented, either in handbooks of historical linguistics or in the journal literature. Following Lee and Sagart (2008), it is probably best called stratification analysis, since linguists try to identify the layers of contact and inheritance which surface in the form of sound correspondences. If these layers are correctly identified, linguists can often not only determine the direction in which a borrowing occurred, but also the relative time window in which this borrowing must have happened. This is the reason why linguists can often give very detailed word histories, which show where a word was first borrowed and how it then traveled through linguistic landscapes.

As for so many methods in historical language comparison, it is difficult to identify a straightforward counterpart of this technique in biology. What probably comes closest is the usage of GC content as a proxy for the inference of directed networks of lateral gene transfer (as described in, for example, Popa et al. 2011). In contrast to lateral gene transfer in biology, however, our linguistic word histories are often much more detailed, especially in those cases where we have well-documented languages.

For the future, I hope that increased efforts to formalize the process of cognate identification, cognate annotation, and phonetic alignments in computer-assisted frameworks to historical language comparison may help to improve the way we infer borrowings in linguistics. There are so many open questions about lateral word transfer in historical linguistics that we cannot answer by sifting manually through datasets. We will need all the support we can get from automatic and semi-automatic approaches, if we want to shed some light on the many mysterious non-vertical aspects of language evolution.

References

Lee, Y.-J. and L. Sagart (2008) No limits to borrowing: The case of Bai and Chinese. Diachronica 25.3: 357-385.

Pfeifer, W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.

Popa, O., E. Hazkani-Covo, G. Landan, W. Martin, and T. Dagan (2011) Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes. Genome Research 21.4: 599-609.

Monday, April 23, 2018

A (wal)nut to crack – what a network tells you that no tree can


In this post, I will show a network that I generated some time ago as illustration of a point: morphological data should not be used to infer trees, but networks, instead — especially when the goal is to place some fossils in a modern-day phylogenetic framework.

In 2007, Manos et al. (Systematic Biology 56:412–430) published an interesting phylogenetic study that provided a phylogenetic framework to place some enigmatic fossils of the Juglandaceae, the walnut family. Following my preferred procedure (presumably without realizing it), they recruited a palaeobotanical expert to erect a morphological partition.

Given the high quality of the matrix, this is an ideal example to demonstrate the utility of networks in (palaeo)phylogenetic research and to discuss the question of potential ancestor-descendant relationships, and their poor representation in trees (especially cladograms). Phylogenetic relationships within modern Juglandaceae are relatively well resolved. Rhoiptelea, a relict genus found in the mountains of northern Vietnam and south-western China, is sister to the remainder of the family — it is now subfamily Rhoipteleoideae, but was traditionally its own family. Rhoiptelea is an living fossil: flowers with fitting in-situ pollen and seeds have been found in the Late Cretaceous (Heřmanová et al. 2011, IJPS 172: 285–293; cryptically named Budvaricarpus serialis, the "Serial Budvarseed", because one is not allowed to use a modern-day genus for naming a 85–90 million year old angiosperm, even when it looks the same). The remainder of the Juglandaceae falls into two main clades, recognized as subfamilies:
  1. the Juglandoideae — the walnuts (Juglans) and their closest relatives: the (eastern) North American-East Asian disjunct genus Carya, the Eurasian relict genus Pterocarya (mainly Transcaucasia, East Asia), and the monotypic genera Cyclocarya and Platycarya.
  2. the Engelhardioideae — a group of tropical-subtropical, mostly relict genera: Alfaroa + Oreomunnea in the equatorial regions of the New World; and South East Asian-Malesian genus Engelhardia and the, probably monotypic, Alfaropsis widespread in China (sometimes still included in Engelhardia; e.g. current Flora of China, despite unambiguous molecular and morphological evidence).
Juglandaceae produce (winged) seeds and pollen that are relatively easy to identify. They are well-known and very common companions of palaeontologists during much of the Cenozoic, especially the (today geographically very restricted) Engelhardioideae. But in addition to the modern genera, the family includes some very interesting, unique fossils — the idea is to place these in a phylogenetic framework.

Results of the study of Manos et al. (2007).
Arrows indicate the position of the fossils. a) A majority rule consensus cladogram using a cut-off of 50 based on the morphological partition; b) the total evidence counterpart.

As can be seen from the above trees (taken from the paper), morphology reflects some of the molecular phylogenetic relationships — the Juglandoideae are supported as a clade, as are most genera (except for Engelhardia and Oreomunnea). Two fossils, Pal(a)eoplatycarya and Platycarya americana were resolved as sister taxa to their modern counterpart, Platycarya strobilacea; and the two enigmatic fossils Polyptera (the "many-winged one") and Cruciptera (the "cross-winged one") could be associated with the Juglandoideae. The total evidence approach indicated that Cruciptera is part of the "crown-group" Juglandoideae, in contrast to Polyptera, that appears at a more "basal" (root-proximal) position in this subclade. A sixth fossil, Pal(a)eooreomunnea could not be resolved with certainty (placed as sister to all Juglandoideae in the total evidence tree). As the name indicates, literally the "Ancient Oreomunnea", we would have expected it to group with the Engelhardioideae, which form a clade in the total evidence tree.

This is okay so far as it goes but, beyond potential sister relationships, these cladograms show very little. When I place a fossil such as Cyclocarya in the phylogeny, I would like to know whether it is more closely related to Juglans, Pterocarya or Cyclocarya. Is it an early sister lineage of all of these, or even a precursor? Cladograms cannot answer such questions.

The persistent issue of pseudo-clades

It has been pointed out in earlier posts that clades/grades are not necessarily synonyms of Hennig's concepts of monophyly and paraphyly, mainly because of convergent evolution creating data splits that are incongruent with the true tree. Parsimony-based analyses are especially vulnerable, because each change represents a step to be optimized.

One alternative method to place fossils in a (molecular-based) phylogenetic framework is the evolutionary placement algorithm (EPA; Berger & Stamatakis 2010, AICCSA conference paper). This changes to a probabilistic framework, and queries each fossil alone using its morphological partition but using the molecular-based tree as framework.

Summarized result of the evolutionary placement algorithm as implemented in RAxML.
The number represents a probability to join the fossil at the according branch using maximum likelihood as optimality criterion.

This gives the above tree as the result for the Walnut data set. Palaeooreomunnea is now unambiguously linked to one of the two included species of Oreomunnea, O. mexicana. Cruciptera is associated (again unambiguously) with Cyclocarya. Furthermore, not only are Palaeoplatycarya and the extinct North American Platycarya relatives of the modern-day Platycarya, but also Polytera. This, according to the original analysis, is the first-branching member of the remainder of the Juglanoideae, ie. all genera except Platycarya.

And the network shows us why

The most important problem with morphological data sets is that their signals are complex, and usually not very tree-like. Hence, whenever we optimize fossils along a tree (either by directly analyzing the morphological data or by some form of total evidence approach), the analysis has to fit in this odd little OTU at all cost, even when it means collapsing an entire clade. Simultaneous optimisation of two or more fossils triggers further branching artifacts, and may decrease branch support, because we have no molecular data compensating for eventual branch attraction conflicting with the actual phylogeny.

Let's take the Polyptera as an example. If we de-root the trees, the original total evidence placement and the ML-EPA are not that different from each other: Polyptera is just moved one node. A easily inferred Neighbour-net, which is not 1-dimensional like a phylogenetic tree, but 2-dimensional, shows the reason why (and only by using the morphological data partition).

The neighbour-net based on the Manos et al.'s morpho-data partition.
Numbers at branches represent nonparametric boostrap support (Least-squares and Maximum parsimony criteria) and Bayesian posterior probabilities.

  • We can see that Polyptera has a unique morphology (it shows the longest terminal edge of all fossils), making it equally similar to Platycarya and the remaining Juglandoideae: Juglans, Pterocarya, Cyclocarya, and Carya (Annamocarya is a not-widely-accepted Chinese genus, genetically indistinct from other East Asian Carya). This explains its instability in tree-based reconstructions. Assuming that Rhoiptelea points to the actual root, one could use the relatively high branch support values as an argument to say that Polyptera evolved after Platycarya split from the remainder of the Juglandoideae. But the network shows that the signal is not that straightforward, and Polyptera may just be a third lineage within the Juglandoideae (note the short orange edge bundle in contrast to the large red and green ones). A crucial question to check, also regarding the ML-EPA result, is whether the orange-edge clade (including Polyptera) is supported by uniquely shared characters and not just a tree-branching artifact because of the distinctness of the Platycarya group. Being substantially distinct (genetically and morphologically) from the remainder of the Juglandoideae, they must be placed as sister taxa. Being a fossil Polyptera is not that distinct, hence, placed in the Juglandoideae core clade. Distance-based and parsimony methods are more vulnerable to long-branch attraction (or short-branch culling) than is ML; and Bayesian analysis optimizes to a tree best comforting all signals in the data (compatible or not).
  • Cruciptera is more similar to Cyclocarya and Pterocarya than to Juglans, and represents a more primitive (ancestral) form. Based on the position of Cyclocarya and Pterocarya, we can directly conclude that they are morphologically less derived than Juglans, their sister taxon. Hence, one should be careful interpreting Cruciptera as a precursor of eg. Pterocarya, but would have to go back into the matrix and assess which characters differentiate within this part of the graph, in order to decide whether the similarity between them is a genuine representation of shared (common) origin, and not just due to symplesiomorphies.
  • The fossil counterparts of modern-day Platycarya span a quite prominent box-like structure in the network, but the blue edge has little support from tree-based analyses. A simple explanation would be that these two more ancient members of the Platycarya lineage, and are less derived than their modern counterpart and the other Juglandoideae.
  • Palaeooreomunnea is placed as one would expect for an ancestral form of the Engelhardoideae. It is clearly closer to the New World pair Alfaroa and Oreomunnea than to the Old World Alfaropsis and Engelhardia.
Data & software for EPA

The data matrix that I used for the ML-EPA, the Neighbour-net and the competing branch support analyses can be found in the supplementary information of the original paper.

EPA is implemented in RAxML since Version 7 and usually used to place environmental short sequence reads (Berger et al. 2011, Syst. Biol. 60:291–302). For a published application of EPA to place fossils, see e.g. Bomfleur et al. 2015, BMC Evol. Biol. 15:126.

Monday, April 16, 2018

Networks in the news, at last


Phylogenetic networks do not always fare very well in the traditional media. The general public has enough troubles dealing with a phylogenetic tree, let alone networks. For example, many people still consider that Darwin claimed that monkeys are our ancestors (a chain-based relationship) rather than our cousins (a tree-based relationship) — who knows what they must think about humans inter-breeding with Neandertals (a network-based relationship).

Nevertheless, a few news reports about a recent network-based paper have suggested that the situation might be improving.


The paper in question is:
Úlfur Árnason, Fritjof Lammers, Vikas Kumar, Maria A. Nilsson, Axel Janke. Whole-genome sequencing of the blue whale and other rorquals finds signatures for introgressive gene flow. Science Advances 4: eaap9873.
This paper details extensive genomic admixture among six species of Baleen whales. The phylogenetic scenarios involving gene flow cannot be represented by a tree, of course, so the authors include the following set of networks (along with a Median network).


News reports have appeared in at least two places, reporting on this paper, that discuss the difference between networks and "Darwinian trees", and do quite a good job of it.

For example, this quotation is from the New York Times ("Baleen Whales intermingled as they evolved, and share DNA with distant cousins"):
The relationships are so complicated, however, that the senior researcher Axel Janke said "family tree" is too simple a metaphor. Instead, the species, all part of a group called rorquals, have evolved more into a network, sharing large segments of DNA with even distant cousins. Scientists expressed surprise that there had been so much intermingling of baleen whales, given the variety of sizes and shapes.
This quotation is from Popular Science ("A new study on whales suggests Darwin didn't quite get it right"):
Evolutionary network analysis takes the tree metaphor and turns it into a complex web, which acknowledges the different kinds of familial connections shown by whole-genome sequencing. Comparing the whole genomes of rorquals shows that genetics is much more fluid than the Darwinian “tree” model, Janke says.
"Gene flow and hybridization is more common than biologists usually think," Janke says. Analysis of the rorquals’ genes shows that they've interbred in different ways at various times in their evolutionary history. This doesn't make much sense if you rely only on Darwin's model, where branches of the family tree never touch again after they separate.
I think that these give us all a reason for optimism.

Monday, April 9, 2018

The curious case(s) of tree-like matrices with no synapomorphies


(This is a joint post by Guido Grimm and David Morrison)

Phylogenetic data matrices can have odd patterns in them, which presumably represent phylogenetic signals of some sort. This seems to apply particularly to morphological matrices. In this post, we will show examples of matrices that are packed with homoplasious characters, and thus lead to trees with a low Consistency Index (CI), but which nevertheless have high tree-likeness, as measured by a high Retention Index (RI) and a low matrix Delta Value (mDV). We will also try to explore the reasons for this apparently contradictory situation.

Background

A colleague of ours was recently asked, when trying to publish a paper, to explain why there were low CI but high RI values in his study. This reminded Guido of a set of analyses he started about a decade ago, using an arbitrary selection of plant morphological matrices he had access to.

The idea of that study was to advocate the use of networks for phylogenetic studies using morphological matrices, based on the two dozen data sets that he had at hand. The datasets were each used to infer trees and quantify branch support, under three different optimality criteria: least-squares (via neighbour-joining, NJ), maximum likelihood, and maximum parsimony. This study was was never wrapped up for a formal paper, for several reasons (one being that 10 years ago Guido had absolutely no idea which journal could possibly consider to publish such a paper, another that he struggled to find many suitable published matrices).

The signals detected in the collected matrices were quite different from each other. The set included matrices with very high matrix Delta Values (mDV), nontree-like signals, and astonishingly low mDVs, for a morphological matrix. Equally divergent were the CI and RI of the inferred equally most-parsimonious trees (MPT) and the NJ tree. The data for the MPTs and the primary matrices are shown in the first graph, as a series of scatterplots, where each axis covers the values 0-1. (Note: in most cases the NJ topologies are as optimal as the MPTs, and have similar CI and RI values.)


As you can see, the CI values (parsimony-uninformative characters not considered) are not correlated with either the RI or mDV values, whereas the latter two are highly correlated, with one exception.

The most tree-like matrix (mDV = 0.184, which is a value typically found for molecular matrices allowing for inference of unambiguous trees) was the one of Hufford & McMahon (2004) on Besseya and Synthyris. The number of MPTs was undetermined —using a ChuckScore of 39 steps (the best value found in test runs), PAUP* found more than 80,000 MPTs with a CI of 0.39 (third-lowest of all of the datasets), but an RI of 0.9 (highest value found).

A strict consensus network of the 80,003 equally parsimonious solutions, the network equivalent to the commonly seen strict consensus tree cladograms. Trivial splits are collapsed. Colours solely added for orientation (see next graph).

Oddly, the NJ tree had the same number of steps (under parsimony), but a much higher CI (0.69). The proportion of branches with a boostrap support of > 50% was twice as large in a distance-based framework than using parsimony.

Bootstrap consensus networks based on 10,000 pseudoreplicates each. Left, distance-based and inferred using the Neighbour-Joining algorithm; right, using a branch-and-bound search under parsimony as optimality criterion (one tree saved per replicate). Edge-lengths reflect branch support of sole or competing alternatives; alternatives found in less than 20% of the replicates not shown; trivial splits are collapsed. Same colour scheme than above for orientation.

The Neighbour-net based on this matrix has quite an interesting structure. Tree-like portions are clearly visible (hence, the low mDV) but the branches are not twigs but well developed trunks. The large number of MPTs is mainly due to the relative indistinctness of many OTUs from each other.


Neighbour-net based on simple mean (Hamming) morphological distances. Same colour scheme as above.
This distance-based 2-dimensional graph captures all main aspects of the tree inferences and bootstrap analyses, with one notable exception: B. alpina which is clearly part of the red clade in the tree-based analyses. We can see that the orange group, B. wyomingensis and close relatives, is (morphology-wise) less derived than the red species group. Although B. alpina is usually placed in a red clade, it would represent a morphotype much more similar to the orange cluster as it lacks most of the derived character suite that defines the rest of the red clade. In trees, B. alpina is accordingly connected to the short red root branch as first diverging "sister" with a very short to zero-long terminal branch, but in the network it is placed intermediate between the poorly differentiated but morphologically inhomogenous oranges and the strongly derived reds — being a slightly reddish orange. This reddishness may reflect a shared common origin of B. alpina and the other reds, in which case the tree-based inferences show us the true tree. Or just a parallel derivation in a member of the B. wyoming species aggregate, in which case the unambiguous clade would be a pseudo-monophylum (see also our recent posts on Clades, cladistics, and why networks are inevitable and Let's distinguish between Hennig and cladistics).

Interpretation, what does low CI but high RI stand for?

The distinction between the Consistency Index and the Retention index has been of long-standing practical importance in phylogenetics. For a detailed discussion, you can consult the paper by Gavin Naylor and Fred Kraus (The Relationship between s and m and the Retention Index. Systematic Biology 44: 559-562. 1995).

For each character, the consistency index is the fraction of changes in a character that are implied to be unique on any given tree (ie. one change for each character state): m / s, where m = the minimum possible number if character-state changes on the tree, and s = the observed number if character-state changes on the tree. The sum of these values across all characters is the ensemble consistency index for the dataset (CI).

The retention index (also called the homoplasy excess ratio) for each character quantifies the apparent synapomorphy in the character that is retained as synapomorphy on the tree: (g - s) / (g - m), where g = the greatest amount of change that the character may require on the tree. Once again, the sum of these values across all characters is the ensemble retention index for the dataset (RI).

Both CI and RI are comparative measures of homoplasy — that is, the degree to which the data fit the given tree. However, CI is negatively correlated with both the number of taxa and the number of characters, and it is inflated by the inclusion of parsimony-uninformative characters. RI is less sensitive to these characteristics. However, RI is inflated by the presence of unique states in multi-state characters that have some other states shared among taxa and, therefore, are potentially synapomorphic.

It is these different responses to character-state distributions (among the taxa) that apparently create the situation noted above for morphological data. Neither CI nor RI directly measures tree-likeness, but instead they are related to homoplasy. So, it is the relative character-state distributions among the taxa that matter in determining their values, not just the tree itself.

For example, increasing the number of states per character will, in general, increase CI faster than RI. Increasing the number of states that per character that occur in only one taxon will, in general, increase RI faster than CI.

Take-home message

This is just another example demonstrating that morphological data sets should not be used to infer (parsimony) trees alone, but analysed using a combination of Neighbour-nets and support Consensus Networks. No matter which optimality criterion is preferred by the researcher, the signal in such matrices is typically not trivial. It calls for exploratory data analysis, and inference methods that are able to capture more than a trivial sequence of dichotomies.

[Update 10/9/2018] Related data files can now be found in my Collection of morphological matrices (some including extinct taxa) and related phylogenetic inferences (Version 2) on figshare

Monday, April 2, 2018

Things you can learn in a blink about your data


As phylogeneticists, we commonly have to deal with data that we don't initially understand. In this post, I'll use a recently published 8-gene dataset on lizards to show how much can be learned prior to any deeper analysis, just from producing a few Neighbour-nets.

The data

Solovyeva et al. (Cenozoic aridization in Central Eurasia shaped diversification of toad-headed agamas, PeerJ, 2018) sampled species of toad-headed agamas (lizards) across their natural range (north-western China to the western side of the Caspian Sea), to study their genetic differentiation in time and space. To do so they used two datasets. The mitochondrial data covers four gene regions: coxI, cytB, nad2, and nad4, and are complemented by four nuclear gene regions: AKAP9, NKTR, BDNF, RAG1.

This caught my eye, because the authors' preferred trees have a bunch of low branch-support values, so that this would be a good opportunity to advocate some Consensus networks. They also report only values above a certain threshold, as apparently recommended by several reviewers. My reviewers not rarely recommended the same, but I always ignored this — I believe we should give the value, because it makes a difference if its just below the threshold (e.g. bootstrap support, BS, of 49), or non-existent (BS < 5). The authors also note that their mitochondrial and nuclear genealogies are not fully congruent. In short, the signal from their matrix is probably not trivial, but could be interesting.

In contrast to many other journals, PeerJ has a strict open-data policy. Solovyeva et al. provide each gene as FASTA-formatted alignment as Supporting Information. So let's have some quick-and-dirty Neighbour-nets.

Using Neighbour-nets to decide on an analysis strategy

A comprehensive outgroup sampling can avoid outgroup-rooting artefacts, but adding very distant outgroups comes at a price. We need to invest much more computational effort, because the inference programmes not only try to optimize our focus group, but the entire taxon set. Another principal question is: what can an outgroup taxon provide as information for rooting an ingroup, while being completely different? Furthermore, when we do an ML (or Bayesian) analysis, e.g. with RAxML, we leave it to the program to optimize a substitution model (even when we predefine a model, its parameters will usually be optimized by the inference software on the fly). By adding distant outgroups, we optimize a model for them plus our focus group — by not using any outgroup, we optimize a model suiting just the situation in our focus group.

Fig. 1 shows the neighbour-net (uncorrected, codon-naive p-distances) for the first of the mitochondrial genes, coxI (the others are similar), which and tells us a lot about the data to be used for the tree inferences.

Fig. 1 Neighbour-net based on mitochondrial (coxI) uncorrected p-distances. The diffuse, non-treelike signal expressed in the A and B fans will be a hard nut for the tree inference, and will have little influence on questions dealing with the focal genus.
We can see that outgroup diversity is much higher than for the focus group, and that most outgroup taxa are very distinct from the ingroup. Looking at the closest outgroups (Stellagama, Agama, Laudakia, Paralaudakia, Xenagama, Pseudotrapelus), we see that finding an unambiguous sister taxon to the focal genus will be difficult. And we can realize that including more-distant taxa just gives the algorithm much more work (note the A and B bushes), but hardly will have any benefit for rooting the ingroup.

We also can see that the 3rd codon position is probably saturated to some degree, and that we will be dealing with a high level of stochasticity (randomly distributed mutation patterns) here — all terminal edges are long to very long. Since the same thing holds for the other three mitochondrial regions, it would not be a bad idea to do an additional inference including only the 1st and 2nd codon positions, in case all taxa should be included.

Using Neighbour-nets to understand the basic signal properties of your data

Fig. 2 shows the Neighbour-net (again, uncorrected p-distances) for one of the nuclear genes, AKAP9. The outgroup sample is somewhat different, but we can immediately see that this gene has more potency to infer unambiguous phylogenetic relationships among the sampled taxa — the graph has distinctly tree-like portions. We also see that saturation of 3rd codon position is much less of an issue here, compared to the cox1 gene (Fig. 1) — the terminal edges are comparatively short, with respect to the central edge bundles. [Nonetheless, it is never wrong to analyze coding gene data partitioned: 1st and 2nd codon positions vs. 3rd codon position.]


Fig. 2 Neighbour-net based on the nuclear (AKAP9) genetic distances. Note the much more treelike structure of the graph, the generally shorter terminal edges, and last-but-not-least the notable difference between ingroup (focal genus) and outgroup taxa.
For the general differentiation patterns, compare the minute extent of the focal group, green background in Fig. 2 vs. the prominent bush in Fig. 1. It is clear that including distant outgroups will not have any benefit. We may even consider reducing the outgroup sample (if one has to include an outgroup at all) to the two genetically closest genera Stellagama and Paralaudakia.

Similarly structured graphs are found for the other three nuclear genes.


Producing some quick Neighbour-nets doesn't hurt

Sometimes reviewers will pick on them — "distance-based phenetic method" is something I used to get a lot. In this case, you can still produce them just to get some basic impressions on your data set. This will help you to understand the results of your tree inferences, including why some of your branches have ambiguous support.

It comes as little surprise that the taxa one can identify, in these networks, as likely sister genera of the focal genus, come up as sister taxa in the explicit phylogenetic analyses done by Soloveya et al. — e.g., their fig. 2 showing the combined mitochondrial tree, and their fig. 3, showing the combined nuclear tree.

Soloveya et al. (2018) performed some incongruence tests (AU-topology test) using single-gene inferences (going further than many other studies), but did not dig deeper. One of the authors answered my question about potential signal issues that may cause topological incongruence between ML and Bayesian trees, as well as ambiguous support, but he considers this to be a solely a problem with methods — different algorithms prefer different phylogenies. Having looked at the basic differentiation pattern in the gene regions using Neighbour-nets, it may be more than just an issue with methods — ML and Bayesian analysis should always support the same splits when using the same or similar substitution models.

Like many other studies, the authors also use the data for Bayesian dating and dating-dependent biogeographic analysis. Lacking any ingroup fossils, the authors could only constrain nodes within the outgroup subtree, which are nodes far from those that they discuss and estimate. I have my doubts that we can put much faith in the uncorrelated clock process to handle such extreme differences between focus group (ingroup) and (constrained) outgroup-taxon lineages as seen in Fig. 2. Estimates for rate shifts between outgroup and ingroup usually render ingroup age estimates to be too young, compared to age estimates obtained with ingroup fossils. This is something that can be directly deduced from a graph like the one in Fig. 2.

Data and networks can be found at figshare

The original paper provides a comprehensive supplement with a lot of interesting information, but the FASTA-files, each comprising a single gene region and a few editing issues, are not yet ready to use. Hence, I transformed them into NEXUS-files, and generated a combined data matrix. The files and the Neighbour-nets for each gene region (and a full single-gene maximum likelihood analysis) can be found on figshare.