The Genealogical World of Phylogenetic Networks: More on networks for placing fossils, such as Eocene lantern fruits

Monday, September 3, 2018

More on networks for placing fossils, such as Eocene lantern fruits

A colleague pointed me to a paper published last year in Science about a spectacular fossil find: an Eocene Physalis-fruit with a preserved lampion. In an recent post, I advocated Neighbor-nets as nice and quick tools to place fossils phylogenetically. In this post, I'll will exemplify this once more, and argue why this would have been even more informative than what the authors showed as graphs.

The study and the data

In their 2017 paper, Wilf et al. (Science 355: 71–75) describe a new fossil find, which, by itself, rejects the often-too-young molecular dating estimates for Solanceae, the potato-tomato family, the "Nightshades". The Nightshades include many well-known plants, in addition to potato/tomato (the latter is phylogenetically a subclade of the potatoes) — we have e.g. the tobacco genus (Nicotiana), and also the genus Physalis, which includes several species commercialized as fruits (e.g. P. peruviana, also known as Cape gooseberry or goldenberry) and ornamental plants (e.g. P. alkekengi, the Chinese Lantern).

Just by looking at the pictures showing the fossil (Wilf et al.'s text-Fig. 1), anyone who ever ate a physalis, would agree that it was produced by a member of the genus. However, science is not usually about common sense, but about formal reconstructions. Thus, the authors placed their fossil using a total evidence tree approach: they scored 13 morphological traits as binary or ternary characters, concatenated these data with a molecular data set and inferred trees under maximum parsimony (their text-Fig. 2, below) and maximum likelihood (the tree can be found in the supporting information).

Wilf et al.'s total evidence tree showing the (quoted from the legend)
"Phylogenetic relationships of Physalis infinemundi sp. nov. and selected Solanaceae species" (their Fig. 2). Strict consensus of 2835 most parsimonious trees of 3510 steps (CI = 0.438, RI = 0.726)."

Based on the graph, one can confirm that the fossil (arrow; pictured, too) is part of the core Physalis, but its position within this core clade is unresolved. The Decay index shown indicates that moving the entire branch would require just one step more. Not overly re-assuring regarding the total length of the tree (3510 steps) and underlying data (the used matrix has 7070 characters!)

The molecular data were selected from an earlier study (Särkinen et al., BMC Evol. Biol., 2013), but the total evidence matrix is not provided (see this post on why we want to publish our phylogenetic data). But at least the "...morphological matrix developed in this paper is tabulated in the supplementary materials."

This file includes two sheets: the first shows the "raw scores", including four continuous characters, and the second shows the "character scoring" used for the analysis, where the continuous characters were scored (binned) as ternary and binary characters. The iinformation provided is partly wrong, likely to be the result of copy & paste errors (this is another reason why it should be obligatory for phylogenetic studies to provide the data as aligned-FASTA or NEXUS file). A corrected version of the "character scores" sheet based on the "raw scores" sheet is included in the figshare submission for this post.

By just filtering this matrix for same-as-in-the-fossil characters, we can identify two extant species that are identical to the fossil in all scored characters: Physalis acutifolia and P. lanceolata. Both are part of the Physalis core clade in Wilf et al.'s total evidence tree, but their position is as unresolved as that of the fossil.

Enlarged part of the above figure, showing the absolute character difference (0 to 5 out of 13 covered characters) between the fossil and other members of the Physalis core clade.

The reason for this becomes clear in the total-evidence maximum-likelihood tree. Here, the fossil is resolved as the sister of P. lanceolata (maximum likelihood bootstrap support: ML-BS < 70, the actual value would have been nice), to which it is identical, both being deeply nested in the Physalis core clade. However, the other identical species (morphologically), P. acutifolia, is placed in the first diverging subclade of the core clade (ML-BS < 70, along with most of the backbone of this clade). The "low" support may have two possible reasons:

the fossil, with 99.8% missing data, acts as a 'rogue' taxon; or
the genetic data provides little discriminating or ambiguous signals.

Solanaceae genera can be tricky, and the gene sample lacks high-divergent sequence regions. Since the molecular data are not documented, I can't assess how significant this separation is, but it appears to be supported by at least some mutations: the tree-wise distance is about 0.04 expected substitutions; and the two morphologically indistinct (regarding the scored characters) species are genetically distinct (to some degree).

Extract from Wilf et al.'s Fig. S1, showing the Physalinae subtree with the core Physalis clade and the deeply nested fossil P. infinemundi (in bold font). Support is only shown for branches with a ML-BS support ≥70.

Trees may fail to show the obvious, but networks won't

Just by using the Neighbour-net to visualize the signal in the morphological partition, we can directly argue that the fossil is likely to be part of the core Physalis. Thus, being Eocene of age, rejects the much-too-young age estimates in e.g. the dated tree by Särkinen et al. (the reference for the molecular data used by Wilf et al.)

Neighbour-net splits graph based on the morphological data partition included in Wilf et al.'s "supermatrix".

In contrast to the little information that comes along with the tree shown above (soft-ish polytomy, weak Decay index, potentially decreased ML-BS support), the splits graph highlights the ambiguity (incompatibility) of the morphological signal. The graph shows little tree-likeness, and members of the same (sub)tribe show little coherence (C = Capsiceae, H = Hyoscyameae, J = Juanulloeae, S = Solaneae; W = Withaninae; all represented by de-facto molecular clades with ML-BS ≥ 77 in Wilf et al.'s supplement Fig. S1). There is one notable exception: members of the core Physalis (red dots) are sufficiently distinct from anything else, forming a highly supported clade (ML-BS = 98 in Wilf et al.'s fig. S1),.

The network also shows that the fossil is identical to both P. acutifolia and P. lanceolata.

Neighbour-net after reducing the taxon set to the phylogenetic neighbourhood of the fossil specimen. Filled fields indicate sister/sibling species supported by a ML-BS >= 80 in Wilf et al.'s "total evidence" ML tree.

By focusing on the phylogenetic neighborhood of the fossil, we end up with a spider-web-like graph. Which means that the morphological partition has little consistent signal for recognizing potential relatives: the same features are likely to have evolved in parallel (all members of this neighborhood a likely to share a common origin) — 50 million years (and more) is a long time for a lineage to end up with a similar fruit (see also the maximum-parsimony character reconstructions on the parsimony strict-consensus tree provided in the supplement to Wilf et al.'s study).

Data and graphs

The Splits-NEXUS files for the Neighbor-nets and NEXUS-versions of Wilf et al.'s Data S1, as well as additional graphics (network with labeled bubbles) can be found on figshare.

4 comments:

UnknownSeptember 3, 2018 at 8:33 PM
I suspect that close examination of the Physalis cladistic character-set would find that many characters are also used to define the taxon in the classification; i.e. there is "Classification bias" (Bitner & Cohen, 2015).
ReplyDelete
Replies
Das GrimmSeptember 4, 2018 at 9:10 AM
This is always a possibility. Most morphological datasets used to place "phylogenetically" a fossil are heavily informed by the initial taxonomic assessment of the researchers.
To my knowledge, not a single fossil has ever been placed naively, i.e. by just recording its features and then add it to a much-including matrix to identify its systematic affinity. Often, phylogenetics in this context serve as a (more or less biased) proof of the obvious.

But classification bias or not, as the network shows, the recorded characters allow differentiating between the target clade and the rest. Only by investigating the signal in the data, we can give credit to the placement in the "total evidence" trees.

This is the main lesson to learn: any morphological character matrix has (often non-avoidable) biases, provides non-treelike signal, hence, its signal needs to be explored (with networks) rather than be used to infer simple trees.
ReplyDelete
Replies
UnknownSeptember 7, 2018 at 11:56 PM
"classification bias or not, as the network shows" seems to me to propagate a logical error. The characters used in the classification surrender a degree of freedom for every use. Thus, they may not be used again to construct the network. This error (if I have correctly understood this foundational principle of mathematical logic) contributes to the long-term misunderstanding and wrongful presentation of cladistics, where classification bias, and its associated logical circularity, have never been recognised to exist amd will probably continue to be ignored despite their recent clear presentations.
ReplyDelete
Replies

Add comment