Monday, September 3, 2018

More on networks for placing fossils, such as Eocene lantern fruits

A colleague pointed me to a paper published last year in Science about a spectacular fossil find: an Eocene Physalis-fruit with a preserved lampion. In an recent post, I advocated Neighbor-nets as nice and quick tools to place fossils phylogenetically. In this post, I'll will exemplify this once more, and argue why this would have been even more informative than what the authors showed as graphs.

The study and the data

In their 2017 paper, Wilf et al. (Science 355: 71–75) describe a new fossil find, which, by itself, rejects the often-too-young molecular dating estimates for Solanceae, the potato-tomato family, the "Nightshades". The Nightshades include many well-known plants, in addition to potato/tomato (the latter is phylogenetically a subclade of the potatoes) — we have e.g. the tobacco genus (Nicotiana), and also the genus Physalis, which includes several species commercialized as fruits (e.g. P. peruviana, also known as Cape gooseberry or goldenberry) and ornamental plants (e.g. P. alkekengi, the Chinese Lantern).

Just by looking at the pictures showing the fossil (Wilf et al.'s text-Fig. 1), anyone who ever ate a physalis, would agree that it was produced by a member of the genus. However, science is not usually about common sense, but about formal reconstructions. Thus, the authors placed their fossil using a total evidence tree approach: they scored 13 morphological traits as binary or ternary characters, concatenated these data with a molecular data set and inferred trees under maximum parsimony (their text-Fig. 2, below) and maximum likelihood (the tree can be found in the supporting information).

Wilf et al.'s total evidence tree showing the (quoted from the legend)
"Phylogenetic relationships of Physalis infinemundi sp. nov. and selected Solanaceae species" (their Fig. 2). Strict consensus of 2835 most parsimonious trees of 3510 steps (CI = 0.438, RI = 0.726)."

Based on the graph, one can confirm that the fossil (arrow; pictured, too) is part of the core Physalis, but its position within this core clade is unresolved. The Decay index shown indicates that moving the entire branch would require just one step more. Not overly re-assuring regarding the total length of the tree (3510 steps) and underlying data (the used matrix has 7070 characters!)

The molecular data were selected from an earlier study (Särkinen et al., BMC Evol. Biol., 2013), but the total evidence matrix is not provided (see this post on why we want to publish our phylogenetic data). But at least the "...morphological matrix developed in this paper is tabulated in the supplementary materials."

This file includes two sheets: the first shows the "raw scores", including four continuous characters, and the second shows the "character scoring" used for the analysis, where the continuous characters were scored (binned) as ternary and binary characters. The iinformation provided is partly wrong, likely to be the result of copy & paste errors (this is another reason why it should be obligatory for phylogenetic studies to provide the data as aligned-FASTA or NEXUS file). A corrected version of the "character scores" sheet based on the "raw scores" sheet is included in the figshare submission for this post.

By just filtering this matrix for same-as-in-the-fossil characters, we can identify two extant species that are identical to the fossil in all scored characters: Physalis acutifolia and P. lanceolata. Both are part of the Physalis core clade in Wilf et al.'s total evidence tree, but their position is as unresolved as that of the fossil.

Enlarged part of the above figure, showing the absolute character difference (0 to 5 out of 13 covered characters) between the fossil and other members of the Physalis core clade.

The reason for this becomes clear in the total-evidence maximum-likelihood tree. Here, the fossil is resolved as the sister of P. lanceolata (maximum likelihood bootstrap support: ML-BS < 70, the actual value would have been nice), to which it is identical, both being deeply nested in the Physalis core clade. However, the other identical species (morphologically), P. acutifolia, is placed in the first diverging subclade of the core clade (ML-BS < 70, along with most of the backbone of this clade). The "low" support may have two possible reasons:
  • the fossil, with 99.8% missing data, acts as a 'rogue' taxon; or
  • the genetic data provides little discriminating or ambiguous signals.
Solanaceae genera can be tricky, and the gene sample lacks high-divergent sequence regions. Since the molecular data are not documented, I can't assess how significant this separation is, but it appears to be supported by at least some mutations: the tree-wise distance is about 0.04 expected substitutions; and the two morphologically indistinct (regarding the scored characters) species are genetically distinct (to some degree).

Extract from Wilf et al.'s Fig. S1, showing the Physalinae subtree with the core Physalis clade and the deeply nested fossil P. infinemundi (in bold font). Support is only shown for branches with a ML-BS support ≥70.

Trees may fail to show the obvious, but networks won't

Just by using the Neighbour-net to visualize the signal in the morphological partition, we can directly argue that the fossil is likely to be part of the core Physalis. Thus, being Eocene of age, rejects the much-too-young age estimates in e.g. the dated tree by Särkinen et al. (the reference for the molecular data used by Wilf et al.)

Neighbour-net splits graph based on the morphological data partition included in Wilf et al.'s "supermatrix".

In contrast to the little information that comes along with the tree shown above (soft-ish polytomy, weak Decay index, potentially decreased ML-BS support), the splits graph highlights the ambiguity (incompatibility) of the morphological signal. The graph shows little tree-likeness, and members of the same (sub)tribe show little coherence (C = Capsiceae, H = Hyoscyameae, J = Juanulloeae, S = Solaneae; W = Withaninae; all represented by de-facto molecular clades with ML-BS ≥ 77 in Wilf et al.'s supplement Fig. S1). There is one notable exception: members of the core Physalis (red dots) are sufficiently distinct from anything else, forming a highly supported clade (ML-BS = 98 in Wilf et al.'s fig. S1),.

The network also shows that the fossil is identical to both P. acutifolia and P. lanceolata.

Neighbour-net after reducing the taxon set to the phylogenetic neighbourhood of the fossil specimen. Filled fields indicate sister/sibling species supported by a ML-BS >= 80 in Wilf et al.'s "total evidence" ML tree.

By focusing on the phylogenetic neighborhood of the fossil, we end up with a spider-web-like graph. Which means that the morphological partition has little consistent signal for recognizing potential relatives: the same features are likely to have evolved in parallel (all members of this neighborhood a likely to share a common origin) — 50 million years (and more) is a long time for a lineage to end up with a similar fruit (see also the maximum-parsimony character reconstructions on the parsimony strict-consensus tree provided in the supplement to Wilf et al.'s study).

Data and graphs

The Splits-NEXUS files for the Neighbor-nets and NEXUS-versions of Wilf et al.'s Data S1, as well as additional graphics (network with labeled bubbles) can be found on figshare.


  1. I suspect that close examination of the Physalis cladistic character-set would find that many characters are also used to define the taxon in the classification; i.e. there is "Classification bias" (Bitner & Cohen, 2015).

  2. This is always a possibility. Most morphological datasets used to place "phylogenetically" a fossil are heavily informed by the initial taxonomic assessment of the researchers.
    To my knowledge, not a single fossil has ever been placed naively, i.e. by just recording its features and then add it to a much-including matrix to identify its systematic affinity. Often, phylogenetics in this context serve as a (more or less biased) proof of the obvious.

    But classification bias or not, as the network shows, the recorded characters allow differentiating between the target clade and the rest. Only by investigating the signal in the data, we can give credit to the placement in the "total evidence" trees.

    This is the main lesson to learn: any morphological character matrix has (often non-avoidable) biases, provides non-treelike signal, hence, its signal needs to be explored (with networks) rather than be used to infer simple trees.

  3. "classification bias or not, as the network shows" seems to me to propagate a logical error. The characters used in the classification surrender a degree of freedom for every use. Thus, they may not be used again to construct the network. This error (if I have correctly understood this foundational principle of mathematical logic) contributes to the long-term misunderstanding and wrongful presentation of cladistics, where classification bias, and its associated logical circularity, have never been recognised to exist amd will probably continue to be ignored despite their recent clear presentations.

    1. One needs to distinguish here between inference and classification, a conceptual framework that can be based or not on an inference (see e.g. two of our October 2017 posts: Let's distinguish between Hennig and Cladistics and Clades, cladograms, cladistics, and why networks are inevitable.

      The originally published tree(s) as well as the network try to capture signals in the used matrix, illustrate the scored characters' capacity to infer a useful tree (which they don't) or network, showing the obvious (which they do).

      Whether the matrix is biased or not, is not a problem, any mere inference can solve (although inferences may point to it, example below). Hence, I consider it irrelevant for my post.

      In general, I totally agree: the logical circularity of cladistic interpretation of inferences, is a problem. The most revealing case I crossed was the (too) nice Cycadales matrix by Stevenson (1990, Mem. NY Bot. Garden, 57:8–55; I used it as the basis for my Diploma thesis [in German]). Essentially, it scores the author's opinion how the tree should looked like, hence, resulted in a single most-parsimonious tree, which is then used to put up a cladistic classification. [My guess is that the author put up a Hennigian classification by ad-hoc identifying potential synapomorphies, and then produced a matrix to get a parsimony tree matching it.]

      An inference-based indication that this matrix may be biased is its treelikeliness, the matrix' Delta Value is 0.214, which is conspicuously low for a (plant) morpho-matrix (usually 0.35–0.45; see also this post). The CI and RI of the MPT are also conspicuously high (0.71 and 0.78), likewise overall branch support via bootstrapping or Bayesian posterior probabilities. Also here, the network is quite revealing (I placed the according files together with some overview tables on figshare)

      Unfortunately, already the first genetic data refused to fall in line (I haven't checked recent literature, but I think, they still struggle to reconcile).

      This example is quite unsettling for cladistic classification based on morphologies as increasingly done in palaeontology, because the matrix does contain some likely synapomorphies, in the sense of uniquely derived traits shared by all members of well-supported clades in later molecular trees. However, the ones that supported the deeper branches in the original single-most parsimonious tree, are not.
      For entirely extinct groups, we don't have genetic data. Which makes it even more important to fully explore the signal in the assembled matrix instead of forcing it into a tree (e.g. by iterative post-inference down-weighting of characters). Especially regarding, inevitable (to a usually unknown degree) classification bias.