Monday, January 14, 2019

Phylogenetic ambiguity: data gaps, indifference and internal conflict

A tweet by my favourite journal (not only because they insist that authors make their data available) pointed me to their most viewed paper of 2018, with a nice title (for a network-fan):
Genus-level phylogeny of cephalopods using molecular markers: current status and problematic areas, by Sanchez et al. (PeerJ, 6:e4331).
"Problematic areas" are exactly my cup of tea. However, the graphical representation of these falls a bit short. The authors show three maximum-likelihood phylograms, one for the Cephalopoda with support annotated at some branches (their Fig. 1), and one each for two of the constituent lineages, the Decabrachia (their Fig. 2) and the Octobrachia (Fig. 3, reproduced below, because we will take a look at the data behind it).

Original: "Figure 3: Maximum-likelihood tree of the Octobrachia under the
GTR + Gamma model with the morphological character set mapped onto the tree.
Taxa highlighted in red represents discrepancy to previously published studies."

Unfortunately, we don't know the actual support for each of the branches — there is a legend in the lower right, but no signatures etc. associated with it. You will find some information throughout the text, of course. For example:
The use of concatenated sequences of all markers (Fig. 2) resulted in a resolved topology for monophyly of the Octobrachia (BS = 58%), and strong support for monophyly of the Decabrachia (BS = 98%), with both clades strongly supported by the Bayesian approach with PP = 0.78 and 0.75 respectively
The latter is quite strange, as PP are expected (methodologically) to be ≥ BS.
Although monophyly was demonstrated for several families contained within both superorders, the relationships of the families contained within Octobrachia were better supported than those in Decabrachia (Fig. 2). Of the 37 nodes in the Octobrachia portion of the general tree containing all taxa, the majority were resolved above the 50% level (31 nodes with BS > 50%); but only 28 out of 80 nodes in the Decabrachia were resolved at BS >50%, most of which were located at family level.
BS = 51 could be lack of signal (all other alternatives BS ~ 0) or conflict (one alternative has a BS = 49).

What we can infer directly from the alignment

Let's have a look at the first three gene regions in the matrix provided, using Mesquite's bird-view option.

We can see from the alignment that the first gene (left; mitochondrial 12S rDNA) splits the taxon set (the taxon order seems to be arbitrary) into two (three if we include those with no data) main groups with substantially divergent 12S rDNAs. However, in the second, much more homogeneous gene, no such differentiation is obvious, with the exception of two accessions that remain very different from the rest. This is quite puzzling, because the second gene is the (close-by) mitochondrial 16S rDNA.

Without going into details, the 12S rDNA unambiguously supports (and enforces) an Octopodia core clade defined by a 12S rDNA entirely different from that of other taxa, and comprising five of this order's families, in which Amphioctopus and Octopus make up a subclade with strongly derivating 16S rDNA.

With respect to the tree, we also have to assume that the 12S rDNA of the Octopodia core clade is derived, strongly evolved, whereas it remained largely unmodified (ie. is primitive) in the other, earlier diverged (according to the tree) lineages. However, some of these lineages have equally long terminal branches: there has been more evolution going on in other genes.

The third gene, the nuclear-encoded gene for the 18S rRNA (18S nrDNA), shows another pattern (and quite typical). Large stretches with very little variation, hence, devoid of differentiating signal that would allow the tree algorithm to make a decision (and letting Bayes get lost in the treespace resulting in PP < 1.0).

For half of the taxa, no information is available, but this hardly matters because even genera with strongly different mt 12S rDNA have nearly the same 18S nrDNA. There is a little hickup in the second part in one accession (a gap in Cirrothauma with a small, off-alignment strand in between), but this could just be a sequencing artefact. Limited to a single taxon, it has no topological effect (we at least need four to make a call), it will only increase the length of the terminal branch.

The remainder of the matrix mirrors the situation in the first three partitions, eg. in the well-sampled (only six taxa missing) mt coI gene, Callistoctopus is visibly distinct from all other genera, while most general variation is concentrated at the 3rd codon position. All other mt-genes, accounting for 58% of the matrix' characters, are covered for four of the taxa (the sister taxon used to root, Vampyrotheutis, and three of the core Octopodia, hence, can only support a single split within this group and be used to test for its alternatives.

What networks could have shown

The matrix provided for the shown tree (made available via figshare) has 40 taxa and 16104 characters, quick to run these days. Here's the tree with branch support annotated along branches.

ML phylogram inferred from Sanchez et al.'s matrix, taxa ordered as in the original fig. 3. Members of the same taxon (order, superfamily, family, as annotated in Sanchez et al.'s fig. 3) colored accordingly. Values at branches indicate ML-BS  support using a single partition for the entire data ("unpart.") or using the gene-wise partition scheme provided in the figshare submission ("part.")

Even though I run an unpartitioned analysis, my tree is very similar to the original tree, with a near identical topology except for Ameloctopus being moved one node up and placed as sister to Hapalochlaena (ML, unpartioned-BS = 52 vs. 46[!] for the alternative seen in Sanchez et al.'s fig. 3). I never understood the fuzz about model and partition testing, when we usually work with data where any model will inevitably be suboptimal (see alignments). As a geneticist, I also believe data partitions should be informed by function, not computer programmes (eg. one for 1st and 2nd codon position, another for the 3rd codon position, and one for the rDNAs).

We have unambiguously supported branches (BS ~ 100), and others, the "problematic areas" (BS << 100). Ambiguity in support values for branches of a tree can have two reasons:
  1. Lack of signal, the data is indifferent regarding the placements of certain taxa and/or subtrees (PP < 1.0 are indicative for lack of signal).
  2. Conflicting signal, parts of the data (data partitions) prefer one topological alternative, others a (partly) conflicting one (keep in mind that even in the presence of substantial signal conflict, PP ~ 1).
Short branches with low (BS) support point to the former, long branches with low (BS) support are a direct indication of the latter. Two apparent sources of conflict would be that the data include gene regions from the biparentally inherited nucleome and the (usually maternally inherited, not sure how this is in squids) mitochondriome and combine protein-coding genes (amino-acids coded by codons) with rRNA genes (directly encoding a certain secondary, tertiary structure).

In our tree here, we notice a general correlation between the branch lengths and the support; the shorter the branch, the lower the support. With a few exceptions, eg. the Octopodida core clade, triggered by the unique, strongly diverged sequences of the 12S rDNA, has a long root branch with compartively low support (ML-BS = 63; collapses when using the authors' partitioning scheme that treats each gene region as individual partition).

Full BS Consensus network based on 450 ML pseudoreplicates (result of the unpartitioned analysis). Edge lengths are proportional to the BS support (frequency of the splits in the BS tree sample), trivial splits not collapsed. Arrow points to the root (cf. Sanchez et al.'s fig. 1).

The BS Consensus network shows us that some of the "problematic areas", ie. branches with ambiguous support, are not really problematic (alternatives have no to very little support), but others are. Including the 12S rDNA-based Octopodida core clade, and connected to this, the division of the Megaleledonidae, as annotated in Sanchez et al.'s fig. 3, into two clades (not discussed in the paper). A clade including all Megaleledonidae has a BSunpart./part. = 34/55 and competes with the 12S rDNA split (BS = 63/37) and the placement of Cistopus as sister to the Octopodida core clade (BS = 52/34). It doesn't conflict with the alternative topology placing Cistopus as sister to all of them (BS = 38/50). The reason for this is, of course, that by using a different partion for the highly divergent mt-12S rDNA, we allow RAxML to estimate high probabilities for all mutations, effectively down-weighting each mutation in this gene compared to those in other, more conservatively structured gene regions, which seem to prefer alternative splits.

Vice versa, the poorly supported sister relationship (BS = 45/21) of Bathypolypus with the Enteroctopodidae (light green) + part of the Argonautoidea (pink) stands unopposed, alternative splits have BSunpart. < 10. In the partitioned analysis, however, there is an equally poor supported alternative sticking out a bit: Bathypolypus as sister to the (all-including) Megaleledonidae clade (BSpart. = 23).

While we see little effect on the tree topology, partitioning affects some of the support values. An nice example is the structure of the Megaleledonidae s.str. subtree. The root is unambiguously supported, as is the sister relationship of Graneldone and Bentheledone. The remaining branches have ambiguous support.

Here, the partitioning scheme is a game changer. Unpartioned, the favored alternative is a Adelieledone-Pareledone-Megaeledone (APM) grade "basal" to Graneldone and Bentheledone (BS = 68/49); using the authors' partitioning scheme, the data favors an APM clade sister to the latter two (quite a difference, since we often equal clades with monophyly and grades with paraphyly).

It doesn't matter whether a clade has a BS support of 30, 50 or 70. We need to know, if the remaining 70%, 50%, or 30% of bootstrap replicates show random or the same alternative(s). When a tree has ambiguously support branches, BS Consensus networks should be obligatory.

Instead of reading sentences like this:
Benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) together with the Megaleledonidae (possessing a single row of suckers) formed a well-supported monophyletic group (BS = 72%, PP = 0.61).
we should read this:
A clade including all benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) and the Megaleledonidae (possessing a single row of suckers) received ambiguous support (BS = 72%, PP = 0.61), but potential alternatives received no support at all. The combination of a relative high BS but low PP points towards a faint, but consistent signal in the available data.
And include the Consensus networks at least in the supplement.

When we aim to map morphological traits (which a nice touch of Sanchez et al.'s paper), why not consider the topological alternatives we see there?

Running single-gene trees is never wrong, too. But, in the case of these data, that would be the topic of another post, using a different type of network: a super-network.

Final note. This post is not intended to criticize Sanchez et al.'s paper (my squid-expertise ends with having seen them in aquaria). My impression is they put a lot of effort into getting the matrix together. Having been forced to harvest molecular data myself in the past, I know how important and tedious this work is. Instead, this post stresses and shows, using an easy-to-access example that raised a lot of interest (attracted many views), that we often have to work with suboptimal data not providing trivial results in the form of fully resolved trees. This is a situation in which easy to generate networks offer a lot. No peer reviewer should, in such a case, be content with seeing just a tree (although they, to my experience, always are).