Showing posts with label Consensus network. Show all posts
Showing posts with label Consensus network. Show all posts

Monday, November 18, 2019

Why the emporer has no clothes on – conflict or not?


In the final part of this series dissecting angiosperm gene trees (see: Why the emporer has no clothes on — part 1 and part 2), we will enter muddy ground. Using our example data set, we will try to make a call on whether or not there has been any (detectable) major reticulation in the deep branches of the angiosperm tree.

What triggers conflicting gene histories

Before we look at the data, it may be a good idea to set the scene using simple theoretical examples of what we may look at.


Our two genes, represented by circle and pentagon (could be multigene regions or entire genomes), both follow the same evolutionary history (the gray background tree). In the left lineage, we have a bit of incomplete lineage sorting, because the ancestor was polymorphic for the circles. In the right lineage, we have different fixation rates: the circles evolve faster than the pentagons. With molecular data we usually don't have the ancestors, making any inference straightforward; we only have the tips.


Because of incomplete lineage sorting and different fixation rates in the left and right lineages, the circle gene tree gets the phylogeny pretty wrong. The pentagon gene tree comes closer to the reality – we only infer two sister clades where there is a grade. (With real-world data, the branch support values could give one a clue that three of the inferred blue clades have a higher quality than the fourth supporting a pseudo-monophylum.) The circle and pentagon trees are largely incongruent despite sharing the same history; and we may infer a pseudo-hybrid (the first diverging lineage within the right clade).

Combining these data may allow us to infer a tree that fits the real tree much better. In the left clade the trivial pentagon signal can out-compete the misleading circle signal, and avoid the misplacement of the first diverging lineage of the right clade. In the right clade, the circle signal can help to correct for the pseudo-clade.

Now we can add a late reticulation, and re-infer the gene trees.


Because of the reticulation (the circles are biparentally inherited, the pentagons maternally), the gene trees are more congruent then in the example above (circle and pentagon get it a bit wrong in the left clade), except for the hybrid and its pseudo-hybrid parent. The gene conflict in placing the lineage cross (part of the left clade in the circle-based tree, part of the right clade in the pentagon tree) well reflects its hybrid origin.

Different histories of nuclear genes vs. plastid / mitochondrial genes?

The easiest way to catch reticulation is to compare trees based on plastid / mitochondrial data (maternally inherited) vs. nuclear data (biparentally inherited). If reticulation happened in the past, we can expect that the maternal and biparental genealogy diverge from each other (see part 2).

Strict Consensus network of the plastid (data from 3 protein-coding genes +1 partly coding gene region), mitochondrial (3 protein-coding genes) and nuclear trees (2 nrDNAs). The bold lines represent generally accepted phylogenetic splits (APG IV tree, see also Steven's comprehensive Angiosperm Phylogeny Website).

This network is much more box-like compared to what one would have expected based on the combined tree that can be inferred from the data (Part 1). But are we looking on largely decoupled histories?

This mess is hardly surprising. The combined tree is constrained by the plastid tree, specifically by the signal from the matK gene (Part 1), while the remaining plastid genes (from a different part of the plastome) fall into line. The mitochondrial tree combines genes that on their own inform poorly resolved trees riddled with branching artifacts (Part 2). The nuclear tree, on the other hand, combines the most and least divergent nuclear genes widely known. Because of this, they show topological conflict between each other.

18S-25S rDNA tanglegram. The branch numbers show each gene's bootstrap support (BS) deviating from the combined BS support for the respective branch (indicated by line thickness): green, increased BS support when combining both genes, red, decreased BS support.

However, they are part of the same multi-copy coding unit (the 35S nuclear rDNA) that has very particular evolutionary constraints, such as structural constraints, affected by completeness of concerted evolution and intra-genomic recombination. Polyploid grasses, for example, can have up to three different collections of 35S rDNA, reflecting four different evolutionary origins, being part of the A, B, C or D genomes. You end up with what is called a multi-labelled tree: the A, B, C and D-genome variants of the same taxon pop up (consistently) in different parts of the tree, and you can have recombinants. If we look into the 18S vs. 25S data, however, we find no consistent sequence patterns supporting the topological conflicts between the two trees, or examples for recombination.

As in our theoretical example, each of the trees has certain strengths, and its own set of weaknesses, some of which can be overcome when combining the data (eg. branches with increased combined support in the 18S-25S tanglegram)

Bootstrap (BS) Consensus networks for the combined cp (upper left), mt (upper right), nc (lower left) and full data (lower right). Branches without numbers: BS = 100. Splits conflicting with those present based on the full data highlighted by red font (all with BS < 100).

In contrast to the boxy network appearance and the substantial conflict between the single gene trees (Part 2), most of the relationships (eg. the major clade roots but also many intra-clade relationships) receive high or unambiguous support in all three trees*. Aside from the disparate signals, the data seem to converge on a coalescent. If the genomes had different histories, they wouldn't converge so easily. Also, we would expect to see more consistent conflict between the "genome" trees than between the single-gene trees of the same genome, since the nuclear rDNA is biparentally inherited while the plastid and mitochondrial DNAs are passed on via the mothers only. Many of the angiosperms in our data reproduce sexually.

So far, no conclusive evidence for reticulation

Mere gene-tree incongruence is a poor basis to conclude about decoupled gene histories. We need to dig for sequence-based evidence for reticulation and recombination. For instance, we might find a clearly derived sequence pattern exclusive to the right clade in a member of the left clade.

The importance of rare genomic changes when interpreting conflicting gene trees. The left and right clades obtained a unique and conserved gene or sequence feature before they diversified. The hybrid is the only taxon showing both.

This is where the Walker et al. (2019) and Sullivan et al. (2017) studies seem to fall short — they don't give any example, gene, gene region, or recognizable lineage-diagnostic sequence pattern that could be used as direct evidence for decoupled gene histories and/or reticulation.

For my data set, I cannot pinpoint such evidence either. All high(er)-supported conflict seems to be related to lineage sorting and data/signal issues, the inability of certain gene regions to resolve relationships in parts of the angiosperm tree, or falling prey to (more local than global) long-branch attraction. When looking at the sequences, there's no reason to question, for example, the assumed monophyly of the main lineages and orders, in spite of the topological conflict we face when analyzing these data. If there was reticulation between the ancestors of angiosperm lineages, or later on between the already formed lineages, it left no obvious imprint in the data.

Thus, after having investigated aspects of the seeming conflict by going back to the data (checking highly divergent and conserved sequence patterns, tabulating the partly competing BS support of the single genes, and minus-one gene analyses), I did not hesitate to combine these data and use a Bayesian total-evidence dating procedure. (We never published the results because mid-Cretaceaous angiosperm fossils have much too derived morphologies for total evidence dating; when left unconstrained, MrBayes optimized towards an angiosperm root age of 4.5 Ba, which was the in-built maximum).

A total-evidence Bayes tree based on the full data set. Stars indicate the position of fossil taxa (mid-Cretaceaous). Note their relative long terminal branches, a situation total-evidence dating cannot handle. The matrix can be found at figshare: A basic total evidence matrix for basal angiosperms — combining Soltis et al (2011) with Doyle & Endress (2010).

An example for actual reticulation resulting in gene tree conflict

Working at the coal-face of evolution, I have encountered examples of apparently real reticulation (when analysing biparentally inherited nuclear data). The most compelling was probably the ancient relictual genotypes and pseudogenes that point towards ancient reticulation in the widely known plane trees, Platanus. Platanus subgenus Platanus (which includes all but one species, P. kerrii, a relict of a distant lineage growing in tropical-hot subtropical lowland forests of North Vietnam) falls into two main lineages characterized by unique sets of genotypes, the ANA clade (Atlantic-facing North and Mesoamerica) and the PNA-E clade (NW. Mexico, California and Mediterranean).

Haplo/-genotypic composition of Platanus (Grimm & Denk, Taxon, 2010, ES2 [PDF]). Platanus kerrii represent the sole surviving relative within the Platanaceae (genetically very distinct), an old lineage of angiosperm trees (going back deep into the Cretaceous). Their next kin today are, according to angiosperm molecular trees, the enigmatic Proteaceae, a Gondwanan relict (represented in our angiosperm data by Petrophile). For an even more comprehensive genotypic study that also covers plastid markers check out De Castro et al., Ann. Bot., 2013 [open access])

Individuals in the contact zone between species of the two main lineages (including hybrids) can be heterozygotic / polymorphic for at least one of the sequenced nuclear regions, so that identification of recent hybrids is straightforward. Beyond this, genetically inconspicious members of the ANA clade may show ITS pseudogenes from the PNA-E clade (stippled line in the figures above and below). Furthermore, two of the ANA clade species show (predominately), a PNA-E LEAFY genotype — P. palmeri (pa) and P. rzedowskii (rz), which grow closest to the populations of the PNA-E clade. However, this is not the genotype found in the close-by American PNA-E species (ra, ge), which is one that's sequence is phylogenetically closer to the Mediterranean species, P. orientalis (or), on the other side of the globe.

Overlay of the LEAFY, 5S-IGS and ITS histories in Platanus. This doodle is based on tree- and network-inferences coupled with PCR-RFLP-based genotyping and in-depth analysis of mutation patterns in length-polymorphic sequence regions (Grimm & Denk 2010, ES1). P. x hispanica is the well-known ornamental alley/park tree, the 'London plane'. A cultivated historical hybrid (mid 18th century) of the most hardy North American plane, P. occidentalis, and the frost-vulnerable Mediterranean plane, P. orientalis. In the Mediterranean, due to frequent backcrossing, one can find morphologically mixed individuals showing only the P. orientalis genotypes or homogenous (American or European) type individuals showing occidenatlis and orientalis genotypes (see eg. Pilotti et al., Euphytica, 2009

Further reading

An animal example, of seemingly incongruent single-gene trees that may well be the product of a largely shared evolutionary history, is the autosomal intron data compiled for bears by Kutschera et al. (2014. Bears in a forest of gene trees: Phylogenetic inference is complicated by incomplete lineage sorting and gene flow. Mol. Biol. Evol. 31:2004–2017). Rather than a "forest of trees", each gene tree is poorly resolved but, when combined, allows inferring a phylogeny that matches quite well the parental genealogy based on Y-chromosome data, both in strong conflict with the maternal genealogy inferred from mitochondriomes (see Part 2).

In Supplement File S6 [PDF] of Grímsson et al. (2018, Grana 57:16–116), I outline how ambiguous signal from combined gene regions relate to the poor support of critical branches in the Loranthaceae tree; see also the related posts: Using consensus networks to understand poor roots and Trivial but illogical – reconstructing the biogeographic history of the Loranthaceae (again). Some gene-tree conflicts are possibly linked to different histories (nuclear vs. chloroplast data), while others are a mix of insufficient signal and missing data (between chloroplast genes).

In a previous post (All solved a decade ago: the asterisk branch in the Fagales phylogeny), I give another example using an old Fagales matrix, which resulted in a tree that, even today, is the gold standard of Fagales phylogeny. The matrix combines a highly conserved nuclear gene (18S) conflicting with the plastid genes and complemented by an entirely uninformative mitochondrial gene (matR) to provide a "tree based on all three genomes". Also in this case the three-genome tree is essentially the matK tree.



* That doesn't mean that all highly supported, unconflicted relationships must be true. Note that just by combining a few genes, we obtain a near-unambiguous support for the split between Mesangiosperms and the ANA-grade + gymnosperms, one of the splits defining the root and "basal" part of the angiosperm tree. The outgroup-inferred root is well fixed. Even when using nuclear data, despite the fact that the 18S signal (the one showing the least ingroup-outgroup genetic distance) doesn't support such a root but the 25S does (see part 2), being more divergent and prone to ingroup-outgroup long branch attraction (LBA). That we have LBA issues with the data is obvious from a tiny detail: Ginkgo is supported with BS > 70 as sister of Podocarpus, which is wrong, based on all we know about gymnosperms,(see also Earle's gymnosperm database and literature cited therein). The likely correct split, Ginkgo as sister to Cycas, is present in the nc tree, but represents a much less supported alternative (BS <= 25). It is also obvious when one looks at the alignment(s): Cycas and Ginkgo share some potential genetic 'synapomorphies' in the low-divergent, generally conserved regions (eg. 18S, stem-regions of 25S), but there are essentially none for Gingko + Podocarpus.

Monday, November 4, 2019

Why the emperor has no clothes on – a thicket of trees


A critical question in phylogenetics, and this applies to both the detection and inference of reticulation, is: How much trust do we put in the inferred tree? A phylogenetic tree is just the simplest of all possible phylogenetic networks. Let's assume that there was some phylogenetic reticulation in the past (lineage mixing and crossing), then, in the best-case scenario, our inferred tree shows one of the intertwining pathways but misses the tangles, the crossroads.

An example of simple reticulate evolution: pink is the product of very recent lineage crossing between an early diverged (and otherwise lost) member of the blue lineage and the more recently, hence genetically more coherent, red lineage. Bold lines show the tree we would likely infer in such a situation.

In the worst case, summarizing data with substantially different signals will give us branching artifacts such as:
  • terminal branches that are too long,
  • too long internal branches with conspicuously low support (ie. BS << 100, PP < 1.0),
  • artificial branches representing the least-conflicting solution for the conflicting data,
  • low branch support in general.
See eg. the bear data we used as a real-world example for our Intertwining trees and networks paper (Schliep et al. 2017, open access).

Three possible trees for bears, (a), Y-chromosome, paternal, and (c), nuclear-encoded autosomal introns, biparentally inherited, are congruent but disagree with the maternal genealogy (b), based on the mitochondrial genes. When fusing all three data sets, we get a (low) supported sister relationship for Sloth and Sun bears (red clade), not supported by any of the three fused data set – a branching artifact.

Topological incongruence between gene trees and parental genealogies (as above) is commonly taken as evidence for reticulation. If one gene provides high support for taxon A as sister to B, and another gene has high support for B as sister to C, then B is likely the product of reticulation (eg. hybridization)

One simple possibility to put together a phylogenetic network is to summarize all of the trees in the form of a Consensus network, as shown next. (Technically this is a splits graph, it becomes a phylogenetic network as soon as we determine a root, which, here, would be at the edge leading to the Giant Panda.)

A strict Consensus network of the paternal, biparental, and maternal bear genealogies.
The numbers show the non-parametric bootstrap support for each (competing) split.

In this case, low support for a branch in a combined tree (the values on top) can result from strong conflict. For instance, the brownish splits, which are poorly supported using the combined data (BS = 21, 29), receive near unambiguous support from the mitochondrial genes, but are largely or entirely rejected by the Y-chromosome and nc-intron data. In the combined tree, this deep conflict is resolved by introducing the artificial red clade, with similarly low support: the signal in the data is ambiguous and they support splits between equally possible alternatives.

We know lineage crossing took place in bears (the mitochondrial and Y-chromosome tree are very much in conflict). However, does the above mean that earliest bear-ish creatures hybridized, too? Note that the conflict is associated with a short-branched part of the graph, where apparently little evolution happened. Fast ancient radiations usually come with incomplete lineage sorting and diffuse signals. The only data set producing longer roots, but with notably lower support, are the biparentally inherited introns.

We are closing in our own tail and have to ask again: Is this low support in the autosomal intron tree due to internal conflict, (sets of) introns preferring different topologies, supporting an ancient mixing hypothesis, or just reflecting lack of resolution? Check out the original paper by Kutschera et al. (2014, Mol. Biol. Evol. 31: 2004–2017), and make up your own mind.

On to the angiosperms

In my last post, I exemplified what Walker et al. (PeerJ, 2019) found in their angiosperm study: when we look at a plastome tree we are not looking at a summary of all gene trees but instead at a topology forced by very few of the genes in the chloroplast genome, such as the matK. We also have seen that one misplaced sequence (outgroup Podocarpus-matK) doesn't affect at all the combined analysis — it didn't even reduce the ingroup - outgroup split support. Also, I noted that the low-supported part of the combined tree goes hand in hand with lack of decisive signal from the matK.

It's time to take a look at what the other genes in this example data set come up with.

The eight gene trees. Terminal subtrees collapsed. Scales fit to size, scale bar = 0.1 expected substitutions per site. Upper left, matK tree which is very similar to the combined tree using all gene regions (cp = chloroplast, mt = mitochondrial, nc = nuclear genes). Note the low performance of the mt genes.

One thing is obvious: for most genes (except the nuclear-encoded rRNA genes) including the outgroup taxa adds little ingroup information of use — they are just too distant to any of the ingroup taxa. Outgroup rooting is tricky for angiosperms. Outgroup taxa will always be attracted to the ingroup taxon that is the least similar to any other part of the ingroup: Amborella in this case.

Generally, all of the ANA-grade water plants are genetically distinct and topologically isolated; any outgroup-inferred root must be placed in this part of the tree (all other living seed plants are very distant relatives of angiosperms looking back at, at least, ~250 million years of independent evolution, see eg. Age of Angiosperms... and What is an angiosperm pt. 2). The relatively conserved plastid rbcL and mitochondrial matR prefer an Amborella-Nympheales clade as sister to all other angiosperms, while the more divergent atpB, plastid, nad5, mitochondrial, and 25S, nuclear, prefer the Amborella-root — this is a direct indication for ingroup-outgroup long-branch attraction. Any other placement of the outgroup subtree within the ingroup would necessarily decrease the likelihood of the tree (but note the position of the root in the 18S tree, lower-left the tree based on the most-conserved, evolution-constrained gene in our sample; see also All solved a decade ago..., fig. 4A).

We can look at these trees with the strict consensus network, using uninformed edge lengths— that is, the network counterpart to the strict consensus cladograms still common in plant phylogenetic literature.



This is a nice piece of computer-art, but is scientifically quite useless (the boxiness and general graph structure is, however, reminiscent of strict consensus networks of most-parsimonious tree samples inferred for extinct animals, one example, and plants).

We can add some discriminatory information by counting how often each split occurs in the set of gene trees.

Same set of tree, different way of summarizing it. Note how the main clades emerge: one or two genes may have misplaced the one or other OTU but the others get it right.

Alternatively, we can average the actual tree branch lengths to inform the edge length of the consensus network.

The light green, sand-colored, light brown and dark olive (clockwise) splits are likely branching artifacts. The light blue split is the one that supports the ANA-grade when the (combined) tree is rooted with the very distant outgroups.

A pretty little thicket of trees. Some agreement is found towards the leaves, but even here we have conflict among the gene trees. In some trees, there are long branches grouping non-related OTUs, obvious tree inference artifacts. The general rule is that the deeper we go (ie. the farther back in time), the messier it gets. Adding to this is that, irrespective of which gene is used, some OTUs are much closer to the hypothetical common ancestor (of Mesangiosperms, ie. all but ANA grade) than others – in the eudicots, the least-evolved taxa are Platanus (very old tree genus) and Euptelea (the basalmost Ranunculales); in the Magnoliales, the only angiosperm clade that lacks synapomorphies, it's Magnolia and Liriodendron (again, very old and primitive tree genera). Darwin's Abominable Mystery, the sudden appearance and quick dominance of angiosperms, resulted in an abominable chaos of gene trees and signals. How can they possibly converge to a single tree with amply high support along most branches?

The combined tree from the first post.
When compared to the bears, the answer may well be: because there has been very little to no reticulation between these lineages. Our thicket may be not a forest of trees but just a poorly trimmed, wildly overgrown bush. They genes share the same history, but when being analyzed one-by-one, each of their trees get some aspects right, and some others (severely) wrong. Misplacing one OTU (e.g. the light green, dark olive, sand-colored and dark yellow splits in the averaged Consensus network) may have further topological effects; it didn't matter for the matK gene, because we misplaced only one very alien OTU in a data set that otherwise is hardly affected by adding or removing OTUs.

I argue here that, if there had been substantial reticulation messing up the signal of contemporary lineages and reflecting decoupled histories (like in the case of bears), we would expect at least some (artificial) branching patterns with low support in the combined tree, as well. This would also be the case looking at the gene-tree consensus networks, not only in the deepest parts but also closer to the leaves.

We will be explore this alternative hypothesis in the next (and final) post of this mini-series.

Monday, October 21, 2019

Why the emperor has no clothes on – the mighty matK


In a recent paper published in PeerJ, Walker et al. (2019) take a close look at the complete plastome data of angiosperms. Although they don't find anything fundamentally new — well, at least not for those of us who have looked at the oligogene datasets we worked with — it's nice to see that somebody has been willing to do it in a very comprehensive way, and thereby published what some of us have long known:
  • A combined tree is not the sum of the genes that have been combined;
  • Single-gene trees can tell you very different stories.
Even if the overall branch support is pretty high, we always should be aware of internal data conflict.

When looked at closely, the emperor, in this case the Angiosperm Phylogeny Group (APG) complete plastome tree, maybe not be entirely naked, but is clothed in very few of the many garments at his disposal. Effectively the branches in the plastome reference tree draw their support from very few of the 79 genes/gene regions in the plastome.As Walker et al. note:
"Of the most commonly used markers, matK, greatly outperforms rbcL; however, the rarely used gene rpoC2 is the top-performing gene in every analysis. We find that rpoC2 reconstructs angiosperm phylogeny as well as the entire concatenated set of protein-coding chloroplast genes."

Fig. 1 from Walker et al. showing the (lack of) individual gene support for the angiosperm reference phylogeny.

However, there is one aspect of the paper that calls for a network-based blog post:
"Following the typical assumptions of chloroplast inheritance [i.e. that the entire plastome shares a common history being passed on solely by the mother in angiosperms], we would expect all genes in the plastomes to share the same evolutionary history. We would also expect all plastid genes to show similar patterns of conflict when compared to non-plastid inferred phylogenies ... Our results, however, discussed below, frequently conflict with these common assumptions about chloroplast inheritance and evolutionary history."
Getting incongruent branches in the single-gene trees, including a few highly supported ones, is taken as evidence for different histories potentially mixed within the plastome. Walker et al. give references for (potential) recombination and and reticulation in plastomes.

I asked a question about whether this logic isn't a bit naive about tree inference. In their response, they pointed to the paper by Sullivan et al. (Mol. Biol. Evol. 2017) — these authors made test for recombination in Picea (spruce) plastomes, then split the complete plastomes into three structural units, and found two embedded conflicting phylogenies, as shown in the next figure.

Fig. 4 from Sullivan et al. (2017). F1 and F2 are structural regions comprising most of the large single-copy unit, the F3 the two (duplicate) inverted-repeat regions and the small single-copy unit of the Picea plastomes.

This seems to be a compelling case (but note the BS < 100 for conflicting critical branches). It is also quite possible, since gymnosperm plastomes, in contrast to angiosperms, may be paternally or bipartentally inherited. But, is it a valid assumption that each single-gene tree (or, in Sullivan et al.'s case, trees based on multigene regions) reflects the true tree of that gene or gene complex? That is, even if I assume that all of the genes in my matrix share the same history, must they support the same inferred tree?

Since I have on worked a lot a taxonomic groups, and often with other people's (plastid) data (during my entire career, I remained faithful to the nuclear-encoded ribosomal DNA spacers), my spontaneous answer would be: Absolutely not! Topological conflict may hint towards decoupled gene histories — it is a neccessary criterion but not a sufficient criterion.

There are quite a lot evolutionary scenarios that will lead to data inevitably supporting wrong branches, or false positives (see also Walker et al.'s discussion). Even if evolution is a strictly dichotomous process (which it clearly isn't):
  • low divergence may result in primitive (underived) sequences ('genetic symplesiomorphies') being shared by distant taxa
  • high divergence may result in saturation, which ultimately triggers branching artifacts
  • long isolation coupled with small active population sizes, repeated bottleneck / massive extinction events and/or lack of radiations will lead to sequences that are different from anything else in our data (in angiosperms, this phenomenon has a name: Ceratophyllum).
In fact, the very argument for angiosperm molecular phylogeneticists to move away from using single-gene phylogenies was that these first single-gene trees had branches that made little sense, especially when based on plastid data.

Single-gene trees will get things wrong. The more signal we add, usually by adding additional gene regions, the more we will reduce these errors (this is best-case scenario, but see Delsuc et al., Nature Rev. Genet. 2005). Thus, if some gene-trees conflict more with the combined tree than do others, it can be for two possible reasons:
  1. The conflicting genes had indeed different evolutionary histories. However, this would have to involve intra-plastome recombination and heteroplasmy, which so far have been very rarely documented in angiosperms.
  2. All genes had the same evolutionary history, but some of the data get more aspects of this true tree right than do others (and, of course, some are wrong that others get right).

And the matK said: "I'm your lord, follow my lead"

Walker et al. (all their scripts and results files can be found on github) find that it's only a few of the genes that essentially make up the combined tree. One of them is an old reliable pal of angiosperm phylogeneticists, the chloroplast matK gene. The literature is full of "multigene" trees that are effectively matK gene-trees using enlarged matrices. The matK determines a topology, and by adding genes that cannot compete with it (being too conserved, too variable or just inconsistently different), we re-inforce this topology. Only branches unresolved by matK will be further optimized using the added data.

Let's look at an example.

For the purpose of this post (and the follow-up), I'll use an old angiosperm matrix on stock (I know the quirks of this matrix). For analysis, I eliminated all of the OTUs with missing gene partitions, mainly to make sure that all of the trees and bootstrap (BS) pseudoreplicate trees have the same set of leaves, so I can summarize the tree samples using consensus networks.

Here's the my combined tree, unpartitioned.

Gray – current APG IV classification, "gold tree" (primary relationships within Mesangiospermae still a matter of debate)
And here is the fully partitioned one (over-parametrized; with each gene/codon position treated as data partition).

Essentially the same tree (some branches elongated, others shortened), eudicot clade and the Ceratophyllum-monocot clade swapped positions. Both trees have the same scale.

Even though my matrix includes only relatively few genes (just 21,550 sites), the tree gets the main aspects of the APG IV standard tree. The support for most of the branches is nearly unambiguous (irrespective of data partition), with the exception of some deep-down relationships within the Mesangiospermae (a long-standing issue, called the "dirty dozen"). The fact that the unpartioned and partitioned analysis agree for most part, indicates the signal in my matrix has no model-related issues (at least, none we could fix by using "better" models).

And the matK tree mirrors the fully partitioned tree, as shown here.

A tanglegram of the matK and the combined trees. Shown is the matK BS support for shared and conflicting edges. Orange asterisks, the monocot subtrees have the same structure but when using only matK, the conifer outgroup Podocarpus is nested deep within.

The similarity is indeed striking, in particular since the gene sample in the matrix comprises data from:
  • two of the nuclear-encoded ribosomal RNA genes (18S, 25S; biparentally inherited) that did follow partly different evolutionary trajectories, as e.g. well-studied in the case of Fagales (being a derived eudicot, not included in my matrix)
  • six chloroplast genes/gene regions (maternally inherited including the classics rbcL and matK but also the rpoC2, the most informative gene identified by Walker et al.)
  • three mitochondrial genes (also maternally inherited, but most mutations are, amino-acid-wise, synomymous, being concentrated at the third codon position).
The main things that matK get's wrong* in contrast to the combined tree are deep divergences represented by (very) short branches, in the part of the graph following the (very rapid) split of the mesangiosperm common ancestor (known as "Darwin's abominable mystery").

Also, it nests Podocarpus, the conifer in the outgroup, with unambiguous support in the monocots — which clearly is wrong, a false positive. Looking into the alignment, we can see that the reason for this is a mix of moderate-LBA (long-branch attraction) with missing-data-culling. To minimize LBA artifacts in the matrix originally used, I blanked out parts of the matK in the outgroup (which included a more derived conifer, Pinus, but also the extremely divergent gnetophytes); parts that were not straightforwardly alignable with the angiosperm matK.

The best way to illustrate internal signal conflict is, however, to directly show the BS Consensus network, not mapping support on two alternative topologies as seen in the tanglegram.

BS Consensus network based on 150 matK BS pseudoreplicates (numbers of necessary BS replicates determined by Pattengale et al.'s extended majority rule bootstop criterion implemented in RAxML)

When looking at BS << 100 and the boxes of competing splits in BS-support networks, it is important to keep in mind that low support can have two reasons:
  • Lack of decisive signal, because the BS pseudoreplicates will have (semi-)random or biased branching patterns; in the tree this surfaces usually as low (when random) to moderately high (when biased) support associated with (very) short branches.
  • Conflicting signals, ie. signals incompatible with a single tree; depending which site is eliminated or duplicated during resampling, the BS pseudoreplicate will show one or another topology; strong, deep conflict can surface in a tree by low support associated with (normally) long internal branches but also relatively high support for one alternative topology, the other only manifesting in very long terminal branches.
Regarding Walker et al.'s results, we now need to ask:
  1. Are the non-conflicted branches in the combined tree (major clades equal to the gold tree) the result of shared history of all of the included genes, or just that of the matK?
  2. Is the conflict with the combined tree and locally ambiguous signals due to a different history of the matK, located in the large single-copy unit, and the other genes, or just matK's inability to get certain things right?
In this case, all relatively high-supported conflicting matK splits are associated either with: (i) very short internal branches in the tree, the non-discriminative product of a fast ancient radiation, or (ii) are the result of an obvious data/branching artifact, ie. the misplaced Podocarpus.

So far, nothing challenges the assumption that the combined genes didn't follow the same history. Whether the other genes reveal something else, we'll see in my next post.



* or right: APG IV treats Ceratophyllum as the "probable sister of the eudicots" (see also Stevens' Angiosperm Phylogeny Website).

Monday, September 2, 2019

Losing information in phylogenetic consensus


Any summary loses information, by definition. That is, a summary is used to extract the "main" information from a larger set of information. Exactly how "main" is defined and detected varies from case to case, and some summary methods work better for certain purposes than for others.

A thought experiment that I used to play with my experimental-design students was to imagine that they were all given the same scientific publication, and were asked to provide an abstract of it. Our obvious expectation is that there would be a lot of similarity among those abstracts, which would represent the "important points" from the original — that is, those points of most interest to the majority of the students. However, there would also be differences among the abstracts, as each student would find different points that they think should also be included in the summary. In one sense, the worst abstract would be the one that has the least in common with the other abstracts, since it would be summarizing things that are of less general interest.

The same concept applies to mathematical summaries (aka "averages"), such as the mean, median and mode, which reduce the central location of a dataset to a single number. It also applies to summaries of the variation in a dataset, such as the variance and inter-quartile range. (Note that a confidence interval or standard error is an indication of the precision of the estimate of the central location, not a summary of the dataset variation — this is a point that seems to confuse many people.)

So, it is easy to summarize data and thereby lose important information. For example, if my dataset has two exactly opposing time patterns, then the data average will appear to remain constant through time. I might thus conclude from the average that "nothing is happening" through time when, in fact, two things are happening. I will never find out about my mistake by simply looking at the data summary — I also need to look at the original data patterns.


So, what has this got to do with phylogenetics? Well, a phylogenetic tree is a summary of a dataset, and that summary is, by definition, missing some of the patterns in the data. These patterns might be of interest to me, if I knew about them.

Even worse, phylogenetic data analyses often produce multiple phylogenetic trees, all of which are mathematically equal as summaries of the data. What are we then to do?

One thing that people often do is to compute a Consensus Tree (eg. the majority consensus), which is a summary of the summaries — that is, it is a tree that summarizes the other trees. It would hardly be surprising if that consensus tree is an inadequate summary of the original data. In spite of this, how often do you see published papers that contain any evaluation of their consensus tree as a summary of the original data?

This issue has recently been addressed in a paper uploaded to the BioRxiv:
Anti-consensus: detecting trees that have an evolutionary signal that is lost in consensus
Daniel H. Huson, Benjamin Albrecht, Sascha Patz, Mike Steel
Not unexpectedly, given the background of the authors, they explore this issue in the context of phylogenetic networks. As they note:
A consensus tree, such as the majority consensus, is based on the set of all splits that are present in more than 50% of the input trees. A consensus network is obtained by lowering the threshold and considering all splits that are contained in 10% of the trees, say, and then computing the corresponding splits network. By construction and in practice, a consensus network usually shows the majority tree, extended by a number of rectangles that represent local rearrangements around internal nodes of the consensus tree. This may lead to the false conclusion that the input trees do not differ in a significant way because "even a phylogenetic network" does not display any large discrepancies.
That is, sometimes authors do attempt to evaluate their consensus tree, by looking at a network. However, even the network may turn out to be inadequate, because a phylogenetic tree is a much more complex summary than is a simple mathematical average. This is sad, of course.

So, the new suggestion by the authors is:
To harness the full potential of a phylogenetic network, we introduce the new concept of an anti-consensus network that aims at representing the largest interesting discrepancies found in a set of trees.
This should reveal multiple large patterns, if they exist in the original dataset. Phylogenetic analyses keep moving forward, fortunately.

Monday, January 14, 2019

Phylogenetic ambiguity: data gaps, indifference and internal conflict

A tweet by my favourite journal (not only because they insist that authors make their data available) pointed me to their most viewed paper of 2018, with a nice title (for a network-fan):
Genus-level phylogeny of cephalopods using molecular markers: current status and problematic areas, by Sanchez et al. (PeerJ, 6:e4331).
"Problematic areas" are exactly my cup of tea. However, the graphical representation of these falls a bit short. The authors show three maximum-likelihood phylograms, one for the Cephalopoda with support annotated at some branches (their Fig. 1), and one each for two of the constituent lineages, the Decabrachia (their Fig. 2) and the Octobrachia (Fig. 3, reproduced below, because we will take a look at the data behind it).

Original: "Figure 3: Maximum-likelihood tree of the Octobrachia under the
GTR + Gamma model with the morphological character set mapped onto the tree.
Taxa highlighted in red represents discrepancy to previously published studies."

Unfortunately, we don't know the actual support for each of the branches — there is a legend in the lower right, but no signatures etc. associated with it. You will find some information throughout the text, of course. For example:
The use of concatenated sequences of all markers (Fig. 2) resulted in a resolved topology for monophyly of the Octobrachia (BS = 58%), and strong support for monophyly of the Decabrachia (BS = 98%), with both clades strongly supported by the Bayesian approach with PP = 0.78 and 0.75 respectively
The latter is quite strange, as PP are expected (methodologically) to be ≥ BS.
Although monophyly was demonstrated for several families contained within both superorders, the relationships of the families contained within Octobrachia were better supported than those in Decabrachia (Fig. 2). Of the 37 nodes in the Octobrachia portion of the general tree containing all taxa, the majority were resolved above the 50% level (31 nodes with BS > 50%); but only 28 out of 80 nodes in the Decabrachia were resolved at BS >50%, most of which were located at family level.
BS = 51 could be lack of signal (all other alternatives BS ~ 0) or conflict (one alternative has a BS = 49).

What we can infer directly from the alignment

Let's have a look at the first three gene regions in the matrix provided, using Mesquite's bird-view option.


We can see from the alignment that the first gene (left; mitochondrial 12S rDNA) splits the taxon set (the taxon order seems to be arbitrary) into two (three if we include those with no data) main groups with substantially divergent 12S rDNAs. However, in the second, much more homogeneous gene, no such differentiation is obvious, with the exception of two accessions that remain very different from the rest. This is quite puzzling, because the second gene is the (close-by) mitochondrial 16S rDNA.

Without going into details, the 12S rDNA unambiguously supports (and enforces) an Octopodia core clade defined by a 12S rDNA entirely different from that of other taxa, and comprising five of this order's families, in which Amphioctopus and Octopus make up a subclade with strongly derivating 16S rDNA.

With respect to the tree, we also have to assume that the 12S rDNA of the Octopodia core clade is derived, strongly evolved, whereas it remained largely unmodified (ie. is primitive) in the other, earlier diverged (according to the tree) lineages. However, some of these lineages have equally long terminal branches: there has been more evolution going on in other genes.



The third gene, the nuclear-encoded gene for the 18S rRNA (18S nrDNA), shows another pattern (and quite typical). Large stretches with very little variation, hence, devoid of differentiating signal that would allow the tree algorithm to make a decision (and letting Bayes get lost in the treespace resulting in PP < 1.0).


For half of the taxa, no information is available, but this hardly matters because even genera with strongly different mt 12S rDNA have nearly the same 18S nrDNA. There is a little hickup in the second part in one accession (a gap in Cirrothauma with a small, off-alignment strand in between), but this could just be a sequencing artefact. Limited to a single taxon, it has no topological effect (we at least need four to make a call), it will only increase the length of the terminal branch.

The remainder of the matrix mirrors the situation in the first three partitions, eg. in the well-sampled (only six taxa missing) mt coI gene, Callistoctopus is visibly distinct from all other genera, while most general variation is concentrated at the 3rd codon position. All other mt-genes, accounting for 58% of the matrix' characters, are covered for four of the taxa (the sister taxon used to root, Vampyrotheutis, and three of the core Octopodia, hence, can only support a single split within this group and be used to test for its alternatives.



What networks could have shown

The matrix provided for the shown tree (made available via figshare) has 40 taxa and 16104 characters, quick to run these days. Here's the tree with branch support annotated along branches.

ML phylogram inferred from Sanchez et al.'s matrix, taxa ordered as in the original fig. 3. Members of the same taxon (order, superfamily, family, as annotated in Sanchez et al.'s fig. 3) colored accordingly. Values at branches indicate ML-BS  support using a single partition for the entire data ("unpart.") or using the gene-wise partition scheme provided in the figshare submission ("part.")

Even though I run an unpartitioned analysis, my tree is very similar to the original tree, with a near identical topology except for Ameloctopus being moved one node up and placed as sister to Hapalochlaena (ML, unpartioned-BS = 52 vs. 46[!] for the alternative seen in Sanchez et al.'s fig. 3). I never understood the fuzz about model and partition testing, when we usually work with data where any model will inevitably be suboptimal (see alignments). As a geneticist, I also believe data partitions should be informed by function, not computer programmes (eg. one for 1st and 2nd codon position, another for the 3rd codon position, and one for the rDNAs).

We have unambiguously supported branches (BS ~ 100), and others, the "problematic areas" (BS << 100). Ambiguity in support values for branches of a tree can have two reasons:
  1. Lack of signal, the data is indifferent regarding the placements of certain taxa and/or subtrees (PP < 1.0 are indicative for lack of signal).
  2. Conflicting signal, parts of the data (data partitions) prefer one topological alternative, others a (partly) conflicting one (keep in mind that even in the presence of substantial signal conflict, PP ~ 1).
Short branches with low (BS) support point to the former, long branches with low (BS) support are a direct indication of the latter. Two apparent sources of conflict would be that the data include gene regions from the biparentally inherited nucleome and the (usually maternally inherited, not sure how this is in squids) mitochondriome and combine protein-coding genes (amino-acids coded by codons) with rRNA genes (directly encoding a certain secondary, tertiary structure).

In our tree here, we notice a general correlation between the branch lengths and the support; the shorter the branch, the lower the support. With a few exceptions, eg. the Octopodida core clade, triggered by the unique, strongly diverged sequences of the 12S rDNA, has a long root branch with compartively low support (ML-BS = 63; collapses when using the authors' partitioning scheme that treats each gene region as individual partition).

Full BS Consensus network based on 450 ML pseudoreplicates (result of the unpartitioned analysis). Edge lengths are proportional to the BS support (frequency of the splits in the BS tree sample), trivial splits not collapsed. Arrow points to the root (cf. Sanchez et al.'s fig. 1).

The BS Consensus network shows us that some of the "problematic areas", ie. branches with ambiguous support, are not really problematic (alternatives have no to very little support), but others are. Including the 12S rDNA-based Octopodida core clade, and connected to this, the division of the Megaleledonidae, as annotated in Sanchez et al.'s fig. 3, into two clades (not discussed in the paper). A clade including all Megaleledonidae has a BSunpart./part. = 34/55 and competes with the 12S rDNA split (BS = 63/37) and the placement of Cistopus as sister to the Octopodida core clade (BS = 52/34). It doesn't conflict with the alternative topology placing Cistopus as sister to all of them (BS = 38/50). The reason for this is, of course, that by using a different partion for the highly divergent mt-12S rDNA, we allow RAxML to estimate high probabilities for all mutations, effectively down-weighting each mutation in this gene compared to those in other, more conservatively structured gene regions, which seem to prefer alternative splits.

Vice versa, the poorly supported sister relationship (BS = 45/21) of Bathypolypus with the Enteroctopodidae (light green) + part of the Argonautoidea (pink) stands unopposed, alternative splits have BSunpart. < 10. In the partitioned analysis, however, there is an equally poor supported alternative sticking out a bit: Bathypolypus as sister to the (all-including) Megaleledonidae clade (BSpart. = 23).

While we see little effect on the tree topology, partitioning affects some of the support values. An nice example is the structure of the Megaleledonidae s.str. subtree. The root is unambiguously supported, as is the sister relationship of Graneldone and Bentheledone. The remaining branches have ambiguous support.


Here, the partitioning scheme is a game changer. Unpartioned, the favored alternative is a Adelieledone-Pareledone-Megaeledone (APM) grade "basal" to Graneldone and Bentheledone (BS = 68/49); using the authors' partitioning scheme, the data favors an APM clade sister to the latter two (quite a difference, since we often equal clades with monophyly and grades with paraphyly).

It doesn't matter whether a clade has a BS support of 30, 50 or 70. We need to know, if the remaining 70%, 50%, or 30% of bootstrap replicates show random or the same alternative(s). When a tree has ambiguously support branches, BS Consensus networks should be obligatory.

Instead of reading sentences like this:
Benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) together with the Megaleledonidae (possessing a single row of suckers) formed a well-supported monophyletic group (BS = 72%, PP = 0.61).
we should read this:
A clade including all benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) and the Megaleledonidae (possessing a single row of suckers) received ambiguous support (BS = 72%, PP = 0.61), but potential alternatives received no support at all. The combination of a relative high BS but low PP points towards a faint, but consistent signal in the available data.
And include the Consensus networks at least in the supplement.

When we aim to map morphological traits (which a nice touch of Sanchez et al.'s paper), why not consider the topological alternatives we see there?

Running single-gene trees is never wrong, too. But, in the case of these data, that would be the topic of another post, using a different type of network: a super-network.

Final note. This post is not intended to criticize Sanchez et al.'s paper (my squid-expertise ends with having seen them in aquaria). My impression is they put a lot of effort into getting the matrix together. Having been forced to harvest molecular data myself in the past, I know how important and tedious this work is. Instead, this post stresses and shows, using an easy-to-access example that raised a lot of interest (attracted many views), that we often have to work with suboptimal data not providing trivial results in the form of fully resolved trees. This is a situation in which easy to generate networks offer a lot. No peer reviewer should, in such a case, be content with seeing just a tree (although they, to my experience, always are).

Monday, December 10, 2018

Please stop using cladograms!


I really like the journal PeerJ, not only because it is open access and publishes the peer review process, but also because it's one of the few that adhere to strict policies when it comes to data documentation. In my last (on my own) 2-piece post (part 1, part 2), I showed what networks could have offered for historical and more recent studies in Cladistics, the journal of the Willi Hennig Society. In this one, I'll illustrate why paleontology in general needs to stop using cladograms.

An example

In a recent article, Atterholt et al. (PeerJ 6: e5910, 2018) describe and discuss "the most complete enantiornithine from North America and a phylogenetic analysis of the Avisauridae". I'm not a paleozoologist and "stuff of legend", but their first 17 figures seem to make a good point about the beauty of the fossil and its relevance; and it is interesting to read about it. This makes me envy paleozoologists a bit — the reason I exchanged chemistry for paleontology was my childhood love for the thunder lizards; I specialized in zoology not botany for graduate biology courses, and I fell in love with social insects, especially bees; but then more general circumstances pushed me into plant phylogenetics.

The result of Atterholt et al.'s phylogenetic analysis is presented in their figure 18, as shown here.

Figure 18 of Atterholt et al. (2018): "A cladogram depicting the hypothetical phylogenetic position of Mirarce eatoni." [the beautiful fossil is highlighted in bold font]
This looks very familiar — graphs like this can be seen in many paleontological studies, not only those in Cladistics. However, this is a phylogeneticist's "nightmare" (but a cladist's "dream").

First, phylogenetic trees, especially those that were weighted post-analysis several times to get a more or less resolved tree, should be depicted as phylograms — trees with branch lengths. Phylogenetic hypotheses are not only about clades, and what is sister to what, but about the amount of (inferred) evolutionary change between the hypothetical ancestors, the internal nodes, and their descendants, the labelled tips. For example, we may want to know how long is the root of the clade (Avisauridae, Avisaurus s.l.) comprising the focus taxon compared to the lengths of the terminal branches within the clade. Prominent roots and short terminals are a good sign for monophyly (inclusive common origin), or at least a fossil well placed, whereas short roots and long terminals are not.

The above tree as phylogram (using PAUP*'s AccTran optimization). The beauty of cladistic classification is that the new specimen could have just been described as another species of Avisaurus (but read the author's discussion).

In this example, we seem to be on the safe side, although one may question the general taxonomic concept for extinct birds. Are the differences enough to erect a new genus for every specimen? This is hard to decide based on this matrix.

Second, a tree without branch support is just a naked line graph, telling us nothing about the quality (strengths and weaknesses) of the backing data. Neontologists are not allowed to publish naked trees. In molecular phylogenetics, we are not uncommonly asked by reviewers to drop all branches (internodes) below an arbitrary threshold: a bootstrap (BS) support value < 70 and posterior probability (PP) < 0.95. In palaentology, it has become widely accepted to not show support values at all. The reason is simple: the branch support is always low, because of data gaps and homoplasy. This is a problem the authors are well aware of:
The modified matrix consists of 43 taxa (26 enantiornithines, 10 ornithuromorphs) scored across 252 morphological characters [the provided matrix lists 253], which we analyzed using TNT (Goloboff, Farris & Nixon, 2008a). Early avian evolution is extremely homoplastic (O’Connor, Chiappe & Bell, 2011; Xu, 2018) thus we utilized implied weighting (without implied weights Pygostylia was resolved as a polytomy due to the placement of Mystiornis) (Goloboff et al., 2008b); we explored k values from one to 25 (see Supplemental Information) and found that the tree stabilized at k values higher than 12. In the presented analysis we conducted a heuristic search using tree-bisection reconnection retaining the single shortest tree from every 1,000 replications with a k-value of 13. This produced six most parsimonious trees with a score of 25.1. These trees differed only in the relative placement of five enantiornithines closely related to the Avisauridae, forming a polytomy with this clade in the strict consensus tree (Consistency Index = 0.453; Retention Index = 0.650; Fig. 18).
I've seen much worse CI and RI values in the paleophylogenetic literature (some of them are plotted in this post). For a phylogenetic inference, homoplasy equals internally incompatible signals — many characters show different, partly or fully conflicting, taxon bipartitions; or, in other words, they prefer different trees. The signal in the matrix is thus not tree-like — it doesn't fit a single tree. That's why we have to choose one using TNT's iterated reweighting procedures. (Note: an alternative "phenetic" Neighbor-joining tree has a computation time < 1s, and produces the same tree for the Ornithumorpha and the root-proximal, 'basal' part of the tree, except that Jeholornis is moved two nodes up; but it shuffles a lot in the Longirostravis–Avisauridae clade.)

Another point is that the more homoplasy we have, then the higher must have been the rate of change (here: visible anatomical mutation). The higher the rate of change, the higher the statistical inconsistency of parsimony.

In short, paleontologists (Atterholt et al. just follow the standard in paleophylogenetic publications) use data with tree-unlike signal to infer trees (see also David's last post on illogicality in phylogenetics) under a possibly invalid optimality criterion, which are then used to downweight characters (eliminate noise due to homoplasy) to infer less noisy, "better" trees.

The basic signal

We can't change the data, but we can explore and show its signal. And the basic signal from the unfiltered matrix is best visualized using a Neighbor-net splits graph.

Neighbor-net based on mean pairwise taxon distances. Thick edges correspond to branches in the published tree.

Some differentiation patterns that explain the clades in the tree can be traced, but it becomes difficult in the group that is of most interest: the (inferred) clade(s) comprising the newly described fossil. In the Neighbor-net this is placed close to another member of the Avisauridae, but not all. The matrix is not optimal for the task at hand.

The data properties

The matrix is a multistate matrix with up to six states in the definition line (although only five are used, as state "5" is not present). The taxa have variable gappyness (i.e. the proportion of completely undetermined cells), between 2% (extant birds: Anas and Gallus) and 94% (Intiornis, an Avisauridae) — the median is 56%, and the average close to it (54%). The "hypothetically" placed fossil Mirarce eatoni (in the matrix it is under its old designation: "Kaiparowits") lacks a bit more of the scored characters (61%). That may strike one as a lot, but note that the matrix has 253 characters! However, we may well ask: if I want to place a fossil for which I can score 99 characters, why bother to include another ~150 that tell me nothing about its affinity? (Note: paleobotanists struggle hard even to get such numbers, we usually have at best 50 characters.)

Its closest putative relatives, the Avisaurus s.l., lack 90% of the characters; leaving us with max. 25 characters supporting the relevant clade (assuming that the 10% are all found in Mirarce as well). Coverage is not much better in the next-closest relatives (phylogenetically speaking).

Data coverage in the phylogenetic neighborhood of Mirarce eatoni

The missing data percentage may have mislead the Neighbor-net a bit, because we will have fed it with unrepresentative or highly ambiguous pairwise distances. In the the network, the focus fossil comes close to Neuquenornis, the only other Avisauridae with some data coverage. Looking at the heat map below, we see that missing data is indeed a problem in this matrix — we have zero distances between several pairs that show different distances to the better-covered taxa.

The distance matrix drawn as a heat map: green = similar, red = dissimilar (values range between 0 and 0.8). Red arrows: taxa with too many (and ambiguous) zero pairwise distances.

The closest relative of Mirarce is, indeed, Avisaurus/Gettya gloriae, but the latter has zero distances to various other poorly covered taxa from the phylogenetic neighborhood, in contrast to the much better-covered Mirarce. Neighbor-nets are very good at getting the obvious out of a morphological matrix, but they don't perform miracles. However, why should we include poorly known taxa at all during phylogenetic inference? Wouldn't it be better to infer a backbone tree (or network showing the alternative hypotheses) based on a less gappy matrix, and then find the optimal position of the poorly known taxa within that tree (network)?

Estimating the actual character support

Some characters cover just 10–20% of the taxa, whereas others are scored for most of them — more than half of the characters are missing for more than half of the taxa. Using TNT's iterative weight-to-fit option means that we infer a tree, ideally one fitting the well-covered data (taxon- and character-wise), and then downweight all conflicting characters elsewhere to fit this tree. We then end up with a tree where we have no idea about actual character support. Since the matrix is a Swiss cheese, we only can re-affirm the first-inferred tree.

Let's check the raw character support, using non-parametric bootstrapping and maximum likelihood as the optimality criterion (corrected for ascertainment bias, as implemented in RAxML).

ML-BS Consensus Network (using Lewis' 2-parameter Mk+G model). Edge lengths are proportional to the BS support values of taxon bipartitions (= phylogenetic splits, internodes, branches in phylogenetic trees). Only splits are shown that occurred in at least 10% of 900 BS pseudoreplicates (number of necessary BS replicates determined by the Extended Majority Rule Bootstrap criterion), trivial splits collapsed. Thick edges correspond with branches in Atterholt et al.'s iterative parsimony tree; coloring as before.

The ML bootstrap Consensus Network bears not a few similarities to the distance-based Neighbor-net. The characters do not support the Avisauridae subtree, as depicted in the published TNT tree, but there are faint signals associating some of them to each other, despite the missing data. Keep in mind: a BS support of 20 for one alternative and < 10 for all others means (ideally) one fifth of the characters support the split, and the rest have no (coherent) information. Some sister pairs have quite high support (for this kind of data set), and Gettya gloriae is resolved as sister of Mirarce (unambiguously, with a BS support = 67). But, the matrix hardly has the capacity to resolve deeper relationships within the group of interest, the Enantiornithes — the polytomy with the next relatives seen in the tree and the corresponding clade dissolve. This confirms what we saw in the Neighbor-Net (despite missing data distortion).

The matrix and the tree show something that could have been deduced directly from the distance matrix: the poorly known Gettya (Avisaurus) gloriae is (literally) the closest relative of the enigmatic new genus / species Mirarce (morphological distance of 0.08 compared to 0.1–0.64 for all other taxa). But is this overall similarity enough to conclude Avisaurus, Gettya and Mirarce are a monophyletic group within the Avisauridae?

What the authors (and all paleontologists doing phylogenetics) should have done

(I would have skipped all trees, naturally, but peer reviewers and most readers probably need to see them.)

  • Trimmed the matrix to include only those characters preserved in the fossil of interest, in order to minimize missing data artefacts during inference.
  • Shown the Neighbor-net to visualize the primary signal situation, including and excluding poorly covered taxa. From the Neighbor-net it is already obvious that the fossil is an Enantiornithes, so any subsequent optimization / inference could have focussed on this group alone.
  • Then inferred a backbone tree excluding poorly covered taxa, and shown the resulting phylogram. In case one needs to test the Enantiornithes root (the Neighbor-net gives us two alternatives for the Enantiornithes root: Pengornis + Eopengornis or Protopteryx + Iberomesornis), there is no point in including the poorly covered Enantiornithes or the worst-covered taxa outside this clade.
  • Then optimized the position of the poorly covered taxa in the backbone tree. I recommend using RAxML's evolutionary placement algorithm (EPA) for this, but you can also do this in a parsimony framework if you wish. (EPA can also be used to test outgroup roots: here, one would search the branch at which all non-Enantiornithes fit best.)
  • Shown the resulting phylogram including all taxa — that is, read in the topology to the analysis, and then re-optimize branch lengths.
  • Shown a Support Consensus Network to illustrate the support for the branches in the preferred tree and their competing alternatives. (There may be one or more, as there are many options to estimate branch support.) How sure can we be about relationships within the Avisauridae and their relationships to other Enantiornithes?



Postscriptum. For those who are curious about how the ML tree would look like, here it is:


I have no idea about birds, but from a methodological point of view this is an equally (if not more, because unforced) valid hypothesis for the data set. And demonstrating its limitations: note the relatively long branches with very low support making up the backbone of the Enantiornithes clade. This is typical for matrices lacking coherent discriminatory signal and/or struggling with internal conflict.