Monday, December 31, 2018

Patterns, processes, abduction, and consilience


In a recent blog post, David emphasized how important it is to distinguish patterns from processes in evolutionary biology, with phylogenetic analysis concentrating on the description of patterns (and not on the direct investigation of processes. David's major point is that we need to be careful to not forget about the logical limitations of our approaches:
In the world of logic, propositions cannot be converted; and yet converting propositions is exactly what is done by all descriptive data analyses.
As David correctly points out, in phylogenetic analysis, we tend to observe a pattern (some similarity between different species or languages, for example), and use this pattern to conclude that a specific process has happened (eg. the languages are so similar that we think they are identical).

Given that this problem is also important in historical linguistics, I want to share some thoughts from a linguistic perspective. Most of these were elaborated much earlier, in my PhD dissertation; and if you have read the original chapter (List 2014: 51-57), what I write below may seem repetitive. I have also alluded to these ideas in a couple of previous posts: What we know, what we know we can know, and what we know we cannot know; and Killer arguments and the nature of proof in historical sciences.

However, it is worthwhile to elaborate on these thoughts here, as David's comments are extremely interesting for historical sciences in general, and I think they deserve a more proper discussion across different disciplines.

Ontological fact and epistemological reality

The basic pattern/process problem may be even more complex than it is in evolutionary biology. In quite a few branches of science, most prominently in the historical and social sciences, even the object of investigation is not directly accessible to the researcher. All researchers can do is to try to infer the research object with the help of tests. In historiography we infer the res gestae by comparing direct and indirect (usually written) sources (Schmitter 1982: 55f). In psychology, attributes of people, such as "intelligence" cannot (yet) be directly measured but have to be inferred by measuring how they are "reflected in test performance" (Cronbach and Meehl 1955: 178).

The same holds for ancestors in historical linguistics and evolutionary biology. All we can do in order to examine whether some languages or species share a specific kind of ancestry is comparing them systematically, trying to identify patterns that provide evidence for close relationship. Given that we lack direct evidence of its existence, the ancestral languages or species we infer through comparison cannot be treated like an ontological fact but only as an epistemological reality (Kormišin 1988: 92). We address what psychologists call the construct, that is, the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981/1998), not the "real" object.

Abduction as our sole mode of logical reasoning

In historical linguistics, we can address our research objects only via constructs, and so we have to rely on abduction as our sole mode of logical reasoning (Anttila 1972: 196f). The term abduction was originally coined by Charles Sanders Peirce (1839-1914) and refers, as opposed to induction and deduction, to a "mode of reasoning [...] in which rather than progressing 'logically' [...], one infers an antecedent condition by heuristic guessing from a present case" (Lass 1997: 334). In Peirce's word:
Accepting the conclusion that an explanation is needed when facts contrary to what we should expect emerge, it follows that the explanation must be such a proposition as would lead to the prediction of the observed facts, either as necessary consequences or at least as very probable under the circumstances. A hypothesis then, has to be adopted, which is likely in itself, and renders the facts likely. This step of adopting a hypothesis as being suggested by the facts, is what I call abduction. I reckon it as a form of inference, however problematical the hypothesis may be held. (Peirce 1931/1958: 7.202)
Due to the specific aspects of knowledge we are given in the historical sciences, abduction is the only mode of reasoning that we can employ. According to Peirce (ibid.: 2.623), all three modes of reasoning, induction, deduction, and abduction, "involve the triad of 'rule', 'case' and 'result', but inference moves in different directions" (Lass 1997: 334). While induction infers a rule from a situation in which one is given case (initial situation) and result, deduction infers a result from a situation in which one is given case and a rule. Abduction, however, starts from a result (or a pattern in David's words) and a rule from which we try to infer a case.

As an example, consider the problem of language evolution. Given two languages with no written records of their previous history, we may observe as a pattern (or result) that they show striking systematic regularities in terms of sound correspondences. Given that we know, that — as a rule — languages change their sound systems slowly over time, we can conclude that the initial situation, the case, was that the two languages were once a single language. There is no way we employ any other mode of reasoning here, as long as we start from individual languages (or species) whose past we want to understand and describe.

We can think of situations in which we try to induce rules in historical linguistics, for example, when dealing with the development from Latin into its descendant languages, where we could ask about the individual processes of sound change (or sound change rules) by which the former was transformed into the latter. We can also think of situations in which we try to decide results from rules and initial situations, for example when trying to predict unobserved cognate words in languages that have not yet been completely documented by fieldwork (Bodt et al. 2018), by applying rules of sound change (or sound correspondences) to aligned cognate sets (List, forthcoming). But the big bulk of our work in historical linguistics (and also in evolutionary biology) works only via abduction: given a result (a pattern / observation in the present), we use our knowledge of rules and processes to infer an ancestral state.


Problems of reasoning based on abduction

According to Schurz (2008), different patterns of abduction can be distinguished, depending on: (1) "the kind of hypothesis which is abduced", (2) "the kind of evidence which the abduction intends to explain", and (3) "the beliefs or cognitive mechanisms which drive the abduction" (ibid.: 205). The kind of abduction that is commonly used in historical linguistics and evolutionary biology belongs to the family of factual abductions, that is, abductions in which "both the evidence to be explained and the abduced hypothesis are singular facts" (ibid.: 206). Since we mainly deal with unobservable facts (ie. constructs), we can further characterize it as historical-fact abduction (ibid.: 209).

The problem of historical-fact abduction is not necessarily that what we are try to "observe" lies in the past, but more importantly, that — due to the logic underlying abduction as a mode of reasoning — we usually have to infer both the rules and the initial situation from the patterns we observe. Given (as David emphasized) that a pattern can result from different processes, our inference of a specific, individual historical fact requires that we decide on a specific, individual process at the same time. Given that we have to infer both the process and initial state at the same time, it is not surprising that our inferences about the past are often so vague, and may easily change so quickly, specifically in a situation where we can't just travel back in time to see whether we were right.

In contrast to David, who suggested that we cannot directly investigate processes in the evolutionary sciences, however, I think that in we still can indirectly, be it with help of experiments, with simulations, or in those cases where we are lucky enough to find history documented in sources. These cases where we can study processes, however, are — and here I agree completely with David — not what we normally do in our research. What we usually do is investigating patterns and trying to infer both the process and the original state by which the patterns can be explained.

Cumulative evidence

The problem of abduction, in general (or historical-fact abduction, in specific), is to make sure that we protect ourselves from giving in to wild speculations. That we are not necessarily good at doing so is reflected in the numerous debates in historical linguistics, and evolutionary biology, where scholars at times invoke completely contrary scenarios explaining the past based on identical patterns. In addition, in historical linguistics, people often do not even agree regarding the patterns they believe can be observed in the data.

Earlier, in my dissertation (List 2014), I identified two aspects that I deem important in order to minimize the speculative aspect of our research, claiming that historical-fact abduction should be based on: (1) unique hypotheses, and (2) cumulative evidence. That we need unique hypotheses may seem self-evident at first sight, since it seems to be silly to claim that a certain pattern could be explained by a range of processes. Looking back at this point now, however, I tend to see this less strictly. In fact, I think that I would even prefer it if scholars would list all potential (individual) processes that may seem likely to have yielded a pattern, instead of focusing only on one possibility (and disregarding alternative solutions). Since we are not doctors who need to heal our patients as quickly as possible, we can afford a certain amount of doubt in our research.

Regarding the second point, what I had in mind earlier was that it is best if we have multiple results or different patterns (observed for the same species or languages under investigation) that can all be explained by the same hypothesis. In order to justify the claim that one specific hypothesis explains the evidence better than any alternative hypotheses, we can profit from combining multiple pieces of evidence that might "[fall] short of proof [when taking] each item separately" but become convincing when "all the items [are] combined" (Sturtevant 1920: 11).

Being forced to rely on multiple pieces of evidence (that only when taken together allow one to draw a rather convincing picture of the past) is not a unique problem of historical linguistics and evolutionary biology, but also of historiography – and even crime investigations, as was pointed out by Georg von der Gabelentz (1840-1893, cf. Gabelentz 1891: 154), and in later work on semiotics (cf. the papers in Eco and Sebeok 1983). The fact that historical linguistics theories are built about cases (events, unique objects), as opposed to theories about general laws, may also be the reason for the philological "style" prevalent in historical linguistic studies. I also believe that it is due to the complex nature of the inference process that a systematization of our methods has never been carried out efficiently.

While, for example, we can claim (at least to some degree) that the identification of cognate words in historical linguistics can be systematized (and even to some extent automatized, List et al. 2017), we are at a loss when it comes to systematizing the methods that we use to determine whether words have been borrowed or not. Instead of using one single method, we use a whole range of indicators, and only take borrowings for granted if at least a few of them point into the same direction (List 2018).

Consilience and conclusion

In a talk by James McInerney, held in 2015 in Paris (presenting an overview of his research as reflected in part in McInerney et al. 2014), I realized that the question of "cumulative evidence", which I had thought would have been discussed only in linguistic circles, belongs to a larger complex of discussions about consilience, as opposed to the Popperian tradition that claims that knowledge in science can only advance via falsification and the identification of general laws, as opposed to singular facts (Popper 1945: Chapter 25:II). We find this view, that we need to employ cumulative evidence when trying to infer individual facts, clearly stated in the work of William Whewell (1794-1866), who originally introduced the term consilience:
The Consilience of Inductions [ie. abductions] takes place when an Induction obtained from one class of facts, coincides with an Induction, obtained from another different class. This Consilience is a test of the truth of the Theory in which it occurs. (Whewell 1840: 469)
As far as I understand from James McInerney's talk, the idea of consilience has long been disregarded in the historical sciences but is now gaining popularity (also thanks to the influential book by Wilson 1998). Although at first I felt delighted when I realized that I was not alone with the problem that I had called "cumulative evidence", based on the old book by Sturtevant (1920), I have to admit that I still do not really know what to do with this information, as it is extremely hard to operationalize the concept of consilience. When confronted with numerous different pieces of evidence, how can we identify the hypothesis that explains them all? How can we compare two opposing hypotheses that each convincingly explain some but not all the data? How can we arrive at an objective weighting of our evidence, based on its importance?

What is clear to me is that a "probabilistic evaluation of causes and elimination of implausible causes plays a central role in factual abductions" (Schurz 2008: 207), since it reduces the search space when seeking an explanation for a given phenomenon (ibid.: 210f). But it is not clear how to arrive at such an evaluation when dealing with patterns in practice. For the time being, thinking and discussing about consilience seems interesting; but until we find ways to operationalize it, it will just remain a nice idea without any concrete value for our scientific endeavors. I dearly hope that this won't be the case.

References
Anttila, R. (1972) An introduction to historical and comparative linguistics. Macmillan: New York.

Bodt, T., N. Hill, and J.-M. List (2018) Prediction experiment for missing words in Kho-Bwa language data. Open Science Framework Preregistrations .evcbp., 7 pp. [Preprint, under review, not peer-reviewed]

Cronbach, L. and P. Meehl (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.

Eco, U. and T. Sebeok (1983) The sign of three. Indiana University Press: Bloomington.

Gabelentz, H. (1891) Die Sprachwissenschaft. Ihre Aufgaben, Methoden und bisherigen Ergebnisse. T.O. Weigel: Leipzig.

Kormišin, I. (1988) Prajazyk. Bližnjaja i dal’njaja rekonstrukcija [The proto-language. Narrow and distant reconstruction]. In: Gadžieva, N. (ed.) Sravnitel’no-istoričeskoe izučenie jazykov raznych semejTeorija lingvističeskoj rekonstrukcii [Theory of linguistic reconstruction]. 3. Nauka: Moscow, pp. 90-105.

Lass, R. (1997) Historical linguistics and language change. Cambridge University Press: Cambridge.

List, J.-M. (2014) Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.

List, J.-M., S. Greenhill, and R. Gray (2017) The potential of automatic word comparison for historical linguistics. PLOS One 12: 1-18.

List, J.-M. (2018) Automatic methods for the investigation of language contact situations. [Preprint, under review, not peer-reviewed]. URL: http://lingulist.de/documents/papers/list-2018-automatic-methods-for-the-investigation-of-language-contact.pdf

List, J.-M. (forthcoming) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 45: 1-24.

McInerney, J., M. O’Connell, and D. Pisani (2014) The hybrid nature of the Eukaryota and a consilient view of life on Earth. Nature Reviews Microbiology 12: 449-455.

Peirce, C. (1931/1958) The collected papers of Charles Sanders Peirce. Harvard University Press: Cambridge, Mass.

Popper, K. (1945) The open society and its enemies. Routledge: London.

Schmitter, P. (1982) Untersuchungen zur Historiographie der Linguistik. Struktur — Methodik — theoretische Fundierung. Gunter Narr: Tübingen.

Schurz, G. (2008) Patterns of abduction. Synthese 164: 201-234.

Statt, D. (1998) Consice dictionary of psychology. Routledge: London and New York.

Sturtevant, E. (1920) The pronunciation of Greek and Latin. University of Chicago Press: Chicago.

Whewell, W. (1847) The philosophy of the inductive sciences, founded upon their history. John W. Parker: London.

Wilson, E. (1998) Consilience: the unity of knowledge. Vintage Books: New York.

Monday, December 24, 2018

A jolly, holly network ... of Christmas carols

Today is Christmas Eve. What could be more befitting for our merry blog than to show a network of Christmas carols?

The perfect result would, of course, be a snowflake-like network. Ideally, approaching what is called a "stellar dendrite snowflake".

Stellar dendrites. (Images from a post
introducing a snowflake book:
The Snowflake.)

The data

I browsed the internet for lyrics of Christmas carols, and then scored their content in the form of a binary matrix.

The "taxon set" includes 45 traditional and (more) modern carols, some of them listed here, along with some others I remembered and sought out (eg. here). A comprehensive list of traditional carols can be found here, but using this would have made the matrix much too large for a post on Christmas Eve. (If you are reading this before Christmas, you might be spending too much time on science.) A rule of thumb is that a matrix should always have at least as many (completely defined) characters as taxa.

The 45 (hohoho!) "characters" include:
  • length (short = 0, long = 1), and tone (merry = 0, darkish = 1)
  • topics it is about / relates to / mentions — e.g. the birth scene, love, and yuletide (the latter included because as a naturalized Swede, I love the jultiden, fancy julkaffe, and much enjoyed most of my julbord);
  • major Christmas figures — Jesus, angels, drummers, elves, Jack Frost, the Grinch, milking maids, monsters, Santa Claus, shepherds, snowmen, the Wise Men from the Orient;
  • mentioned animals, such as reindeer, and plants, including the Christmas tree (traditionally a Tannenbaum – fir tree), and (very important for Anglosaxons who don't kiss each other whenever they meet, like we do in France) the mistletoe
  • last but not least, Christmas related objects — non-living things such as bells, Christmas food, harps, sleighs, snow, stars, and presents.

The network

The result is not a perfect stellar dendrite, but it is close enough.

A Neighbor-net of Christmas carols. Stippled terminal edges are reduced by factor 2.

It has quite a nice circular sorting of the carols, each related in some way to the ones next to it. The only oddly placed one is "Twelve Days of Christmas", which is a very peculiar one (and my English, favorite), along with the rather content-free "We wish You a Merry Christmas".

Finally, as a Christmas treat, the "great voices of the British public" singing (and reflecting on) my favorite carol: a Creature Comforts Christmas special.


A merry Christmas to everyone!

And please try out some networks during the coming year.

Monday, December 17, 2018

Using phylogenetic networks to prove new results about trees


By Steven Kelk and Simone Linz.

Many readers of this blog will be aware that phylogenetic tree space is often traversed using topological modification moves such as SPR (Subtree Prune and Regraft) and TBR (Tree Bisection and Reconnection). In a nutshell these moves allow us to step from one phylogenetic tree to another, with a view to finding “good” phylogenetic trees. A natural question which arises is this: what is the minimum number of SPR or TBR moves required to turn one tree into another? Being able to compute these values – known as the SPR distance and the TBR distance respectively – gives us some feeling about how long it will take, deterministically or stochastically, to move from one part of tree space to another. Unfortunately, it is NP-hard (i.e. formally intractable) to compute SPR or TBR distances.

Game over? No because NP-hardness is never a reason to give up! In 2001 Allen and Steel [1] showed the following. Suppose you have two trees, T1 and T2, both on n taxa, and they have TBR distance k. After applying common subtree and common chain reduction rules, you obtain two trees – also with TBR distance k – which have at most 28k taxa. The striking thing here is that n vanishes from the analysis. Hence, if k is small, then so too are the reduced trees, and computing the TBR distance of these two reduced trees becomes less time-consuming. This process is known as kernelization.

This is where the networks come in. In a recent pre-print [2] we have shown that the situation is actually even better than Allen and Steel calculated: the reduced trees will have at most 15k-9 taxa (and in fact, for the subtree and chain reduction rules, this is the best you can do). Perhaps somewhat counter-intuitively, the argument leverages the phylogenetic network literature. While it is quite common to use distances between trees to help construct networks, it is less common to use networks as an analytical instrument to prove new results about trees. We thus consider our new result as a somewhat striking example of the relevance of phylogenetic networks.

The high-level idea is as follows. A recent publication [3] proved that if two trees T1 and T2 have TBR distance k, then a simplest unrooted phylogenetic network that embeds both T1 and T2 will have reticulation number k. The reticulation number of an unrooted phylogenetic network is basically equal to the number of edges you have to delete to turn it into an unrooted phylogenetic tree. Due to this equivalence the problem of computing the TBR distance of two trees can be transformed into that of constructing an unrooted phylogenetic network. See the figure below. The blue tree and the green tree (which have TBR distance 2) can be obtained from the unrooted phylogenetic network on the left (which has reticulation number 2), by cutting at the blue and green breakpoints in the network (respectively).


Why is this important? We know that, after collapsing common pendant subtrees, unrooted phylogenetic networks can be obtained by “decorating” a given backbone topology with taxa. This backbone topology (known as a generator) has roughly 3k edges where chains of taxa can be added to the network, where k is the TBR distance. The critical fact is that, if you add more than 9 taxa to one of these edges, then – however you extract the two embedded trees from the network – you will obtain two trees with a common chain (of length 4 or more, which is the threshold at which common chains are reduced). So, under the assumption that all common chains have been reduced, you can add at most 9 taxa to each edge.

The figure above allows us to obtain some intuition about this. In the network, there are two breakpoints interrupting the sequence of taxa {1,2,3,4,5,6,7,8,9}, one for the green tree and one for the blue tree. The interaction of these two breakpoints induces three (not necessarily maximal) chains that are common to both trees: on taxa sets {1,2,3,4}, {5,6,7} and {8,9} respectively. Under the common chain reduction rule, chain {1,2,3,4} would have already been collapsed (since the reduction rule collapses common chains of length 4 or more) – so in fact a situation in which two breakpoints are placed along the sequence {1,2,3,4,5,6,7,8,9}, as shown in the figure, cannot happen if we assume that the reduction rules had already been applied to exhaustion. You could try to fix this by shifting the blue and green breakpoint one place anticlockwise, to give common chains {1,2,3}, {4,5,6} and {7,8,9}. Sometimes such shifts will work, sometimes they will not. However, if there had been 10 taxa here, rather than 9, you can never avoid creating a common chain of size 4 or more, no matter where you place the two breakpoints. This is simply because, however you partition the set {1,2,3,4,5,6,7,8,9,10} into three contiguous intervals, at least one of the intervals will have size 4 (or more). This common chain should then, by assumption, already have been collapsed.

This limit of 9 taxa already puts an upper bound of (roughly) 3k * 9 = 27k on the number of taxa in the reduced trees. To get to the improved bound of 15k-9, we observe that there is a limit to the number of edges that can carry 9 taxa. (Specifically: only those edges that carry two breakpoints can carry 9 taxa and the number of those edges is limited to k). The remaining edges can be decorated with at most 6, or 3, taxa, and after a bit of counting magic we obtain our result.

Interestingly, the phylogenetic network perspective does not only help us to obtain this improved upper bound, it also plays a crucial role in helping us to prove that you can’t, in the worst case, obtain a bound better than 15k-9 (even if, as well as collapsing common subtrees and chains, you also try to decompose the trees around common splits).

Looking forward, it is natural to ask: is this a one-off success? Or might it be possible to use a similar “backwards network perspective” in other unexpected places to help improve best-known results about trees?

REFERENCES

[1] Allen, B. L., & Steel, M. (2001). Subtree transfer operations and their induced metrics on evolutionary trees. Annals of combinatorics, 5(1), 1-15.

[2] Kelk, S., & Linz, S. (2018). A tight kernel for computing the tree bisection and reconnection distance between two phylogenetic trees. arXiv preprint arXiv:1811.06892.

[3] Van Iersel, L., Kelk, S., Stamoulis, G., Stougie, L., & Boes, O. (2018). On unrooted and root-uncertain variants of several well-known phylogenetic network problems. Algorithmica, 80(11), 2993-3022.

Monday, December 10, 2018

Please stop using cladograms!


I really like the journal PeerJ, not only because it is open access and publishes the peer review process, but also because it's one of the few that adhere to strict policies when it comes to data documentation. In my last (on my own) 2-piece post (part 1, part 2), I showed what networks could have offered for historical and more recent studies in Cladistics, the journal of the Willi Hennig Society. In this one, I'll illustrate why paleontology in general needs to stop using cladograms.

An example

In a recent article, Atterholt et al. (PeerJ 6: e5910, 2018) describe and discuss "the most complete enantiornithine from North America and a phylogenetic analysis of the Avisauridae". I'm not a paleozoologist and "stuff of legend", but their first 17 figures seem to make a good point about the beauty of the fossil and its relevance; and it is interesting to read about it. This makes me envy paleozoologists a bit — the reason I exchanged chemistry for paleontology was my childhood love for the thunder lizards; I specialized in zoology not botany for graduate biology courses, and I fell in love with social insects, especially bees; but then more general circumstances pushed me into plant phylogenetics.

The result of Atterholt et al.'s phylogenetic analysis is presented in their figure 18, as shown here.

Figure 18 of Atterholt et al. (2018): "A cladogram depicting the hypothetical phylogenetic position of Mirarce eatoni." [the beautiful fossil is highlighted in bold font]
This looks very familiar — graphs like this can be seen in many paleontological studies, not only those in Cladistics. However, this is a phylogeneticist's "nightmare" (but a cladist's "dream").

First, phylogenetic trees, especially those that were weighted post-analysis several times to get a more or less resolved tree, should be depicted as phylograms — trees with branch lengths. Phylogenetic hypotheses are not only about clades, and what is sister to what, but about the amount of (inferred) evolutionary change between the hypothetical ancestors, the internal nodes, and their descendants, the labelled tips. For example, we may want to know how long is the root of the clade (Avisauridae, Avisaurus s.l.) comprising the focus taxon compared to the lengths of the terminal branches within the clade. Prominent roots and short terminals are a good sign for monophyly (inclusive common origin), or at least a fossil well placed, whereas short roots and long terminals are not.

The above tree as phylogram (using PAUP*'s AccTran optimization). The beauty of cladistic classification is that the new specimen could have just been described as another species of Avisaurus (but read the author's discussion).

In this example, we seem to be on the safe side, although one may question the general taxonomic concept for extinct birds. Are the differences enough to erect a new genus for every specimen? This is hard to decide based on this matrix.

Second, a tree without branch support is just a naked line graph, telling us nothing about the quality (strengths and weaknesses) of the backing data. Neontologists are not allowed to publish naked trees. In molecular phylogenetics, we are not uncommonly asked by reviewers to drop all branches (internodes) below an arbitrary threshold: a bootstrap (BS) support value < 70 and posterior probability (PP) < 0.95. In palaentology, it has become widely accepted to not show support values at all. The reason is simple: the branch support is always low, because of data gaps and homoplasy. This is a problem the authors are well aware of:
The modified matrix consists of 43 taxa (26 enantiornithines, 10 ornithuromorphs) scored across 252 morphological characters [the provided matrix lists 253], which we analyzed using TNT (Goloboff, Farris & Nixon, 2008a). Early avian evolution is extremely homoplastic (O’Connor, Chiappe & Bell, 2011; Xu, 2018) thus we utilized implied weighting (without implied weights Pygostylia was resolved as a polytomy due to the placement of Mystiornis) (Goloboff et al., 2008b); we explored k values from one to 25 (see Supplemental Information) and found that the tree stabilized at k values higher than 12. In the presented analysis we conducted a heuristic search using tree-bisection reconnection retaining the single shortest tree from every 1,000 replications with a k-value of 13. This produced six most parsimonious trees with a score of 25.1. These trees differed only in the relative placement of five enantiornithines closely related to the Avisauridae, forming a polytomy with this clade in the strict consensus tree (Consistency Index = 0.453; Retention Index = 0.650; Fig. 18).
I've seen much worse CI and RI values in the paleophylogenetic literature (some of them are plotted in this post). For a phylogenetic inference, homoplasy equals internally incompatible signals — many characters show different, partly or fully conflicting, taxon bipartitions; or, in other words, they prefer different trees. The signal in the matrix is thus not tree-like — it doesn't fit a single tree. That's why we have to choose one using TNT's iterated reweighting procedures. (Note: an alternative "phenetic" Neighbor-joining tree has a computation time < 1s, and produces the same tree for the Ornithumorpha and the root-proximal, 'basal' part of the tree, except that Jeholornis is moved two nodes up; but it shuffles a lot in the Longirostravis–Avisauridae clade.)

Another point is that the more homoplasy we have, then the higher must have been the rate of change (here: visible anatomical mutation). The higher the rate of change, the higher the statistical inconsistency of parsimony.

In short, paleontologists (Atterholt et al. just follow the standard in paleophylogenetic publications) use data with tree-unlike signal to infer trees (see also David's last post on illogicality in phylogenetics) under a possibly invalid optimality criterion, which are then used to downweight characters (eliminate noise due to homoplasy) to infer less noisy, "better" trees.

The basic signal

We can't change the data, but we can explore and show its signal. And the basic signal from the unfiltered matrix is best visualized using a Neighbor-net splits graph.

Neighbor-net based on mean pairwise taxon distances. Thick edges correspond to branches in the published tree.

Some differentiation patterns that explain the clades in the tree can be traced, but it becomes difficult in the group that is of most interest: the (inferred) clade(s) comprising the newly described fossil. In the Neighbor-net this is placed close to another member of the Avisauridae, but not all. The matrix is not optimal for the task at hand.

The data properties

The matrix is a multistate matrix with up to six states in the definition line (although only five are used, as state "5" is not present). The taxa have variable gappyness (i.e. the proportion of completely undetermined cells), between 2% (extant birds: Anas and Gallus) and 94% (Intiornis, an Avisauridae) — the median is 56%, and the average close to it (54%). The "hypothetically" placed fossil Mirarce eatoni (in the matrix it is under its old designation: "Kaiparowits") lacks a bit more of the scored characters (61%). That may strike one as a lot, but note that the matrix has 253 characters! However, we may well ask: if I want to place a fossil for which I can score 99 characters, why bother to include another ~150 that tell me nothing about its affinity? (Note: paleobotanists struggle hard even to get such numbers, we usually have at best 50 characters.)

Its closest putative relatives, the Avisaurus s.l., lack 90% of the characters; leaving us with max. 25 characters supporting the relevant clade (assuming that the 10% are all found in Mirarce as well). Coverage is not much better in the next-closest relatives (phylogenetically speaking).

Data coverage in the phylogenetic neighborhood of Mirarce eatoni

The missing data percentage may have mislead the Neighbor-net a bit, because we will have fed it with unrepresentative or highly ambiguous pairwise distances. In the the network, the focus fossil comes close to Neuquenornis, the only other Avisauridae with some data coverage. Looking at the heat map below, we see that missing data is indeed a problem in this matrix — we have zero distances between several pairs that show different distances to the better-covered taxa.

The distance matrix drawn as a heat map: green = similar, red = dissimilar (values range between 0 and 0.8). Red arrows: taxa with too many (and ambiguous) zero pairwise distances.

The closest relative of Mirarce is, indeed, Avisaurus/Gettya gloriae, but the latter has zero distances to various other poorly covered taxa from the phylogenetic neighborhood, in contrast to the much better-covered Mirarce. Neighbor-nets are very good at getting the obvious out of a morphological matrix, but they don't perform miracles. However, why should we include poorly known taxa at all during phylogenetic inference? Wouldn't it be better to infer a backbone tree (or network showing the alternative hypotheses) based on a less gappy matrix, and then find the optimal position of the poorly known taxa within that tree (network)?

Estimating the actual character support

Some characters cover just 10–20% of the taxa, whereas others are scored for most of them — more than half of the characters are missing for more than half of the taxa. Using TNT's iterative weight-to-fit option means that we infer a tree, ideally one fitting the well-covered data (taxon- and character-wise), and then downweight all conflicting characters elsewhere to fit this tree. We then end up with a tree where we have no idea about actual character support. Since the matrix is a Swiss cheese, we only can re-affirm the first-inferred tree.

Let's check the raw character support, using non-parametric bootstrapping and maximum likelihood as the optimality criterion (corrected for ascertainment bias, as implemented in RAxML).

ML-BS Consensus Network (using Lewis' 2-parameter Mk+G model). Edge lengths are proportional to the BS support values of taxon bipartitions (= phylogenetic splits, internodes, branches in phylogenetic trees). Only splits are shown that occurred in at least 10% of 900 BS pseudoreplicates (number of necessary BS replicates determined by the Extended Majority Rule Bootstrap criterion), trivial splits collapsed. Thick edges correspond with branches in Atterholt et al.'s iterative parsimony tree; coloring as before.

The ML bootstrap Consensus Network bears not a few similarities to the distance-based Neighbor-net. The characters do not support the Avisauridae subtree, as depicted in the published TNT tree, but there are faint signals associating some of them to each other, despite the missing data. Keep in mind: a BS support of 20 for one alternative and < 10 for all others means (ideally) one fifth of the characters support the split, and the rest have no (coherent) information. Some sister pairs have quite high support (for this kind of data set), and Gettya gloriae is resolved as sister of Mirarce (unambiguously, with a BS support = 67). But, the matrix hardly has the capacity to resolve deeper relationships within the group of interest, the Enantiornithes — the polytomy with the next relatives seen in the tree and the corresponding clade dissolve. This confirms what we saw in the Neighbor-Net (despite missing data distortion).

The matrix and the tree show something that could have been deduced directly from the distance matrix: the poorly known Gettya (Avisaurus) gloriae is (literally) the closest relative of the enigmatic new genus / species Mirarce (morphological distance of 0.08 compared to 0.1–0.64 for all other taxa). But is this overall similarity enough to conclude Avisaurus, Gettya and Mirarce are a monophyletic group within the Avisauridae?

What the authors (and all paleontologists doing phylogenetics) should have done

(I would have skipped all trees, naturally, but peer reviewers and most readers probably need to see them.)

  • Trimmed the matrix to include only those characters preserved in the fossil of interest, in order to minimize missing data artefacts during inference.
  • Shown the Neighbor-net to visualize the primary signal situation, including and excluding poorly covered taxa. From the Neighbor-net it is already obvious that the fossil is an Enantiornithes, so any subsequent optimization / inference could have focussed on this group alone.
  • Then inferred a backbone tree excluding poorly covered taxa, and shown the resulting phylogram. In case one needs to test the Enantiornithes root (the Neighbor-net gives us two alternatives for the Enantiornithes root: Pengornis + Eopengornis or Protopteryx + Iberomesornis), there is no point in including the poorly covered Enantiornithes or the worst-covered taxa outside this clade.
  • Then optimized the position of the poorly covered taxa in the backbone tree. I recommend using RAxML's evolutionary placement algorithm (EPA) for this, but you can also do this in a parsimony framework if you wish. (EPA can also be used to test outgroup roots: here, one would search the branch at which all non-Enantiornithes fit best.)
  • Shown the resulting phylogram including all taxa — that is, read in the topology to the analysis, and then re-optimize branch lengths.
  • Shown a Support Consensus Network to illustrate the support for the branches in the preferred tree and their competing alternatives. (There may be one or more, as there are many options to estimate branch support.) How sure can we be about relationships within the Avisauridae and their relationships to other Enantiornithes?



Postscriptum. For those who are curious about how the ML tree would look like, here it is:


I have no idea about birds, but from a methodological point of view this is an equally (if not more, because unforced) valid hypothesis for the data set. And demonstrating its limitations: note the relatively long branches with very low support making up the backbone of the Enantiornithes clade. This is typical for matrices lacking coherent discriminatory signal and/or struggling with internal conflict.

Monday, December 3, 2018

The pedigree of grape varieties


We are all familiar with the concept of a family tree (formally called a pedigree). People have been compiling them for at least a thousand years, as the first known illustration is from c.1000 CE (see the post on The first royal pedigree). However, these are not really tree-like, in spite of their name, unless we exclude most of the ancestors from the diagram. After all, family histories consist of males and females inter-breeding in a network of relationships, and this cannot be represented as a simple tree-like diagram without leaving out most of the people. I have written blog posts about quite a few famous people who have really quite complex and non-tree-like family histories (including Cleopatra, Tutankhamun, Charles II of Spain, Charles Darwin, Henri Toulouse-Lautrec, and Albert Einstein).

A history of disease within an Amish community

Clearly, the history of domesticated organisms is even more complex than that of humans. After all, in most cases we have gone to a great deal of trouble to make these histories complex, by deliberately cross-breeding current varieties (of plants) and breeds (of animals) to make new ones. So, I have previously raised the question: Are phylogenetic trees useful for domesticated organisms? The answer is the same: no, unless you leave out most of the ancestry.

In most cases, we have no recorded history for domesticated organisms, because most of the breeding and propagating was undocumented. Until recently, it was effectively impossible to reconstruct the pedigrees. This has changed with modern access to genetic information; and there is now quite a cottage industry within biology, trying to work out how we got our current varieties of cats, dogs, cows and horses, as well as wheat, rye and grapes, etc. I have previously looked at some of these histories, including Complex hybridizations in wheat, and Complex hybridizations in barley and its relatives.

Grapes

One example of particular interest has been grape varieties. I have discussed some of the issues in a previous post: Grape genealogies are networks, not trees, including the effects of unsampled ancestors when trying to perform the reconstruction.

There are a number of places around the web where you can see heavily edited summaries of what is currently known about the grape pedigree. However, these simplifications defeat the purpose of this blog post, which is to emphasize the historical complexity. The only diagram that I know of that shows you the full network (as currently known) is one provided by Pop Chart (The Genealogy of Wine), a commercial group who provide infographic posters for just about anything. They will sell you a full-sized poster of the pedigree (3' by 2'), but here I have provided a simple overview (which you can click on to see somewhat larger).

Grape variety genealogy from Pop Chart

You can actually zoom in on the diagram on the Pop Chart web page to see all of the details. This allows you to spend a few happy hours finding your favorite varieties, and to see how they are related. You will presumably get lost among the maze of lines, as I did.

Monday, November 26, 2018

How languages lose body parts: once more about structural data in historical linguistics


This is a joint post by Guido Grimm and Johann-Mattis List.

Mattis’ last two blog posts dealt with problems of what linguists call "structural data". Here we discuss what this means for the inference of relationships between languages.

A closer look at structural data: the questionnaire issue

As pointed out before, what is called structural data in comparative linguistics is a very diverse mix of data solely unified by the idea of having some kind of questionnaire that a linguist may use when going into the field and trying to describe a certain language. These questionnaires are a bit different from the traditional concept lists usually used for the purpose of historical language comparison (see the collection of different lists in the Concepticon project by List et al. 2016). The main difference is that they are based on an imaginative question that a field worker asks an informant (which could as well be a written grammar of the language under question). Since questions can be asked in many different ways, while concepts in historical language comparison are usually restricted to the so-called "basic vocabulary", the diversity of structural datasets is much greater than the diversity we encounter when comparing questionnaires based on concept lists.

When analyzing these data, we deal with characters of very different nature, and likely different evolutionary pathways or histories. A biological analogy would probably be (true) total evidence data sets that combine genetic data from: genes/genomes with different inheritance pathways (paternally, maternally, biparentally; basic information level), morphological-anatomical data (visible form, phenotypic), palaeontological data (historical evidence), ontogenetic (life-history stages, developmental features), and biochemical data (expression level). The only difference is probably that the linguistic characters’ histories may be more complex. [Side-remark: ‘total evidence’ datasets found in the biological literature are typically just combination of genetic and morphological data, allowing for the inclusion of extinct/fossil taxa.]

To give a specific example, let's have a look at a the Chinese dataset by Szeto et al. (2018), mentioned in Mattis' blogpost from September. This dataset is now accessible as a GitHub repository (https://github.com/cldf-datasets/szetosinitic). Mattis added some information regarding the different features of the questionnaire. We list these features in slightly abbreviated form in the table below, adding rough categorizations by Mattis in the Comment column.

ID
Description
Comment
p-1
5 or more tone categories
phonological / diachronic
p-2
Retroflex fricative initials
phonological / diachronic
p-3
Bilabial nasal coda
phonological / diachronic
p-4
Stop codas
phonological / diachronic
p-5
Monosyllabic word for 'snake'
lexical
p-6
Differentiation between 'hand' and 'arm'
lexical / semantic
p-7
Differentiation between 'defecate' and 'urinate'
lexical / semantic
p-8
Differentiation between 'eat' and 'drink'
lexical / semantic
p-9
Semantically void suffix in 'table'
lexical
p-10
Different classifiers for humans and pigs
lexical / semantic
p-11
[CLF N] constructions in subject position with definite reference
syntactic
p-12
Reduplicated monosyllabic nouns
morphological
p-13
Post-verbal modal auxiliary developed from 'ge/acquire'
syntactic / diachronic
p-14
Modified-modifier order in animal gender marking
morphological / syntactic
p-15
Post-verbal adverb meaning 'first'
lexical / syntactic
p-16
[V DO IO] order in double object dative constructions
syntactic
p-17
'Give' as a disposal marker
syntactic / diachronic
p-18
'Give' as a passive marker
syntactic / diachronic
p-19
'Go' as a post-VP associated motion marker
syntactic / diachronic
p-20
Marker-Standard-Adjective order in comparatives
syntactic
p-21
case system
morphological / syntactic

Mattis has tried to characterize the features, i.e. matrix’ characters, by generalizing linguistic categories: "phonological", pointing roughly to questions about pronunciation (the biological equivalent would be phenotypic traits in morphology or anatomy); "lexical", pointing to the words in the lexicon (this would be the DNA of a language); "morphological", pointing to the ways in which words are constructed; and "syntactic", pointing to the ways in which words are combined to form sentences. In combination, “morphological” and “syntactic” are equal to ‘meta-level’ biological traits, such as development-related features, ontogenetic evidence, and biochemical composition — the ways in which the genetic code is expressed or used in a living organism in adaption to the environment.

Mattis also flagged some characters as "diachronic", to mark whether the respective feature was selected by the authors due to their independent knowledge about the history of the Chinese dialects. This is something rarely possible in biology, but imagine that we could go back in time to literally observe the evolution of a lineage over a given time-period, and code this observed evolution as traits. Note that this is not entirely science-fiction — there are two examples where we can observe directly pathways of biological evolution: mutation patterns in viruses, and horizontal modification of marine morphs in high-resolution sediment cores.

While one can discuss to what degree a certain feature should belong to this category, it is rather obvious that all phonological features are diachronic, because they name distinctions that reflect well-known processes of sound change, which happened in a couple of Chinese dialects and have been proposed in the past by dialectologists in order to classify the Chinese dialects historically.

For example, consider feature p-3 of the questionnaire: Does a given dialect have a syllable that ends in [-m]? From the history of the Chinese dialects we know that the [-m] was present in Middle Chinese, but later merged with [-n] and [] in many varieties. Given that we know that this happened, and that we know that people have used this to mark a split, especially between the "innovative" dialects in the North and the South, it is clear that this feature bears explicit historical information. The same holds for all phonological features that we find in the data: p-1, the number of different tones in the dialects is again roughly reflecting the differences between languages in the North and in the South (the North having lost many tones); p-2 reflects the retention or specific development of retroflex sounds (similar to sh in English as opposed to s) mostly in the North; and p-4 reflects if a variety has syllables that can end in [-p, -t, -k], again a feature characteristic for the more "conservative" varieties in the South of China.

Figure 1: Overlap of features in Szeto et al.'s (2018) structural feature collection of Chinese dialects

Four lexical features have further been flagged as "semantic"; we query here existing or missing distinctions of concepts. People who learned, for example, Russian or certain German dialects know that it is rather common to have a single word for what other languages call "arm" and "hand" (see the respective entry in the CLICS database) or "foot" and "leg".

This diverse feature collection is coded as binary characters, reflected by presence/absence, or a yes/no answer to the question in the questionnaire. The choice of features is very selective. A biological analogy would be a matrix collecting incompatible splits of paternal (molecular) genealogies, along with a few prominent phenotypical traits (reflecting major evolutionary steps), and some traits that we expect to be primarily triggered not by genetics (inheritance) but by expression or adaptation to the environment. Biologists would not phylogenetically analyze such diverse and complex, potentially selection-biased data (although it could be very interesting), but linguists do.

In this context, it is remarkable, but also typical for these kind of data, that the 21-character feature collection by Szeto et al. (2018) has no feature in common with the collection by Norman (2003), a 15-character-matrix, which we also converted to our Cross-Linguistic Data Formats (see Forkel et al. 2018) in order to increase the data comparability.


Figure 2: A Neighbor-net splits graph of the structural data by Szeto et al. (2018).
The typification, coded as binary matrix to infer the Neighbor-net splits graph in Figure 2, demonstrates some basic characteristics of such 2-dimensional graphs. Note four of the 'characters' (typification categories) correlate with an edge(-bundle) in the network, separating the 'taxa' (the queried features). All "semantic" taxa are also "lexical", but "lexical" is more comprehensive, hence, "semantic" is placed as 'descendant' of "lexical" (Neighbor-nets can visualize ancestor-descendant relationships to some degree). "Morphological" taxa are either just "morphological" or also "syntactic", hence the pronounced box.

For "diachronic" and "syntactic", we have no corresponding edge(-bundle), because one taxon is also "lexical", but the others are "diachronic" and "syntactic" — this is a conflict that cannot be resolved with two dimensions. To visualize all the resultant 'taxon' splits, called also taxon bipartitions, we would need a third dimension. Lacking a third dimension, the Neighbor-net prioritizes keeping most "syntactic" together, because the "diachronic-syntactic" are closer to "syntactic" (max. 1 'character' difference) than to "diachronic-phonological" (2 character difference). The "syntactic-lexical" has to be placed apart because it is equally close to "lexical" and "syntactic" 'taxa', but differs much from "morphological-syntactic" or "diachronic-syntactic", the closest two relatives of "syntactic"-only 'taxa'. It is resolved closer to the centre of the graph, because it is more closely related to the other "syntactic" taxa than to the rest of the "lexical" taxa. This is also the reason why the "syntactic"-only taxa have to be placed farther out: "Diachronic-phonological" and "syntactic-lexical" are closer to the other endpoints, and the distance of "syntactic"-only to "diachronic-phonological", "lexical" and "morphological" should be as large as possible.

Losing body parts: How data coding masks underlying processes

Most typologists collecting structural data are not per se interested in phylogenies. Yet, given that scholars deliberately collect historical (diachronic) features, this shows that even if they would not necessarily admit it, they have a genuine interest in uncovering the history of the languages under question; or at least, how closely related languages (or here: dialects) are. But this requires understanding the characters we analyze, the collected "structural data".

In evolutionary biology, the key question people (should) ask when trying to select characters is how their change can be modeled on a tree or a network. What processes could be expected that shaped the data? What is behind the diversity? Is similarity or dissimilarity instigated by:
  • [A] inheritance, i.e. passed from an ancestor to all / some of its descendants,
  • [B] random mutation and/or sorting, i.e. the product of a stochastic, evolutionary neutral process,
  • [C] non-random mutation, i.e. processes that recur frequently, may be beneficial and positively (gain, or negatively: loss) selected for, or
  • [D] secondary contact, mixing of lineages by hybridization (symmetric mixing) and introgression (asymmetric mixing)?
[A]–[C] are vertical processes following a tree, even if the tree does not necessarily need to be the same; [D] is (mostly) horizontal and can only be modeled using a network. For each of the above, we can find an analogy in the evolution of languages.

In addition, process [3], and to a lesser extent [4], can lead to what biologists call 'homoplasy', meaning that the same feature is observed in two unrelated or distantly related taxa. In the context of phylogenetic inferences, homoplasies inflict tree-incompatible signals, seemingly reticulate patterns originating from a tree-like evolution. Structural (or other) linguistic data and phenotypical biological data have a lot in common — complex processes are boiled down to mere absence or presence of features (or traits, as they are called in biology).

Figure 3: Basic evolutionary processes, we need to consider when looking at linguistic data. Or biological traits, when we replace simplification by adaptive evolution, positively selected traits.

If we check the features in our table above, and ask: to which degree can they be used to model these processes (see also David's last post on illogic in phylogenetics), e.g. simply distinguish between similarity by chance, relatedness, or secondary contact (mixing), we can easily see that they are by no means optimal for evolutionary investigations. This is not necessarily because of the processes they involve, but more importantly because of the data sampling, which makes modeling almost impossible, with each character needing its own model.

As an example, take the feature p-6 in our table. Whether or not a language makes a distinction between "arm" and "hand" does not seem to follow specific geographic or genealogical patterns. The following figure shows a plot from the CLICS database (List et al. 2018), visualizing the most frequently recurring polysemies (or colexifications) centering around the concept "arm". The full visualization in CLICS can be found here, and when hovering with the mouse over the link between "arm" and "hand" (marked in green below).

Figure 4: Colexification network in the CLICS database.

From eye-balling the data, it is hard to find a consistent geographic / language-family pattern, which suggests that the feature p-6 is likely to show a high degree of homoplasy in the languages of the world. Obviously, different people decided not to distinguish between "hand" or "arm". But, the example of the Sami languages in northern Scandinavia also demonstrate that some people using related, long-isolated languages, consistently don't make the distinction. Here, the homoplasy is inherited (lineage-conserved). A biological analogy would be the rarely applied difference between a 'convergence' (a trait is independently evolved in different lineages) and a 'parallelism' (a trait is expressed by different but not all members of the same lineage).

Figure 5: Geographic distribution of arm/hand colexifications in the CLICS database.

A specific analogy to the "hand-arm" colexification / differentiation pattern is leaf shedding in oaks and their relatives (Fagaceae, the beech family). Some oak lineages (section Cerris of oaks, beech trees, chestnuts) are essentially or strictly deciduous, others (sections Cylcobalanopsis, Ilex, the sister sections of Cerris; Castanopsis, the sister genus of chestnuts) are always evergreen, and the biggest group (number of species) of all Fagaceae, subgenus Quercus includes evergreen (1 section), mixed (the two by far largest sections), and deciduous (1 nearly extinct section) sublineages. To some extent this is linked to the climate in which the species thrive (high latitudes and/or per-humid = deciduous, low latitude and/or seasonally dry = evergreen), but consistently evergreen and deciduous lineages do co-exist.

Looking at the Chinese dialects, we see that p-6 represents a trivial split in the network.

Figure 6: A Neighbor-net inferred from the Szeto et al. matrix. Dialects that distinguish "arm" and "hand" with filled dots ('1' for character 6 in the matrix), those that don't ('0') with empty dots. We can put a single line separating all don't- from do-taxa (dialects), i.e. a bipartition of the taxon set fitting the character partition seen in (p-)6.

But, given the general patterning of the feature on a global scale, does this really mean that it is inherited — that is, a good feature to reflect relatedness?

Whether a feature is likely to be homoplastic is just one part of the story. Linguists typically have more information about how things change than do biologists, putting a double-edged sword in their hands (that they hardly ever use). Asking whether "hand" and "arm" are expressed by distinctive concepts does not consider the underlying processes. Here, we can assume at least three different character states, namely:
  1. "arm" and "hand" are expressed by the same word, which is the original word for "arm",
  2. "arm" and "hand" are expressed by the same word, which is the original word for "hand", and
  3. "arm" and "hand" are expressed by different word.
We could even have a forth state, in which "arm" and "hand", in the whole long history of the ancestral languages, was always used to express "arm or hand" (i.e., both body parts). No differentiation and no later generalization from either arm nor hand took place.

Figure 7: Left, current scoring; right, scoring taking into account the actual mutation process.

From Ancient Chinese, we know that "1" (Yes, I do differ between "arm" and "hand") was most likely the original state. We can further assume that once the distinction is dropped, it is less likely to come back again (although this can, of course, also happen). That is, our model involves two possible mutations (vertical process): we lose the word for "arm" due to its replacement by "hand", or we lose the word for "hand" due to its replacement by "arm", each with its own probability.

Figure 8: Probability distribution for transitions involving "hand" and "arm".

The probability, mutation or not, and which mutation, relates to four principal driving factors:
  1. probability of random loss (mutation)
  2. probability of random gain (mutation)
  3. global linguistic tendencies
  4. regional socially-enforced preference
Establishing p-arm (loss "arm") and p-hand (loss "hand") is not trivial, because they may be affected by what is the word for "arm" and "hand" (for simplicity we will assume that p+arm and p+hand are close to 0). We could expect a higher tendency to keep the word that is easier to pronounce or less easy to confuse with other words and, hence, is easier to understand. If two dialects with different states come into contact, this may also influence the decision to take over a state or not. In everyday language, a distinction between "arm" and "and" may be useless because of the clear context in which both words are used, so p1-word > p2-words. However, closeness to administration centers or areas with a higher percentage of educated people could decrease p1-word, because it may be considered a sign of poor social standard to not make the difference between "arm" and "hand".

Figure 9: Vertical and horizontal processes involving transitions of "hand" and "arm".

Estimating p can only be left to phylogenetic algorithms (unless more detailed information is available). But we can (and should) design the questionnaire to capture as many of the processes as possible. In this case, to not only ask whether there is a distinction between "arm" and "hand", but also to find out whether the word "arm" or "hand" is used, e.g. by using two questions/binary characters:
  • Do we use "hand"?
  • Do we use "arm"?
Note that this question requires quite a deal of knowledge about the languages under investigation, since it may not be trivial to find out what was the "original" word for "arm" or "hand".

Therefore, a further step would be to replace the binary characters by a value measuring the similarity between the words used for "hand" and those used for "arm". One could again argue that adding this information would add historical information to the feature, but it is clear that the abstract nature of the question is hiding important phylogenetic (and also typological) information from us.

It seems therefore, that, instead of asking whether or not there is a distinction between "arm" and "hand", it would make much more sense to trace the cognacy (or homology) of the expressions for "arm" and "hand" across all taxa (languages, dialects), and think of ways how this could be scored and modeled by phylogenetic analyses. The structural data framework with its features based on simple yes-no questions therefore inevitably leads to a misinterpetation of processes when analyzing the data with phylogenetic software.

The need for exploratory data analysis

In reality, structural (or other) data sets in linguistics face problems similar to the ones palaeontologists face when trying to establish phylogenetic relationships between fossils (extinct organisms) — the probability for a mutation (visible change) is largely unknown, and differs not only from character to character but also within the same characters. A state 0, 1, 2 etc. may have a higher probability to manifest (or get lost) in one lineage than in another.

In addition, the linguistic problems recur in a similar way to that of biologists working close to and below the species level (see also Guido's post on population dynamics and individual-based fossil phylogenies) — reticulation is rather the rule than the exception, as similarity is triggered by contact,  so that horizontal processes, not inheritance, may dominate evolutionary dynamics. Thus, the diversity pattern cannot be modeled by a tree alone. Establishing explicit probabilistic frameworks to deal with this may not only be difficult but even impossible (given the available data). Meanwhile, however, one can embrace exploratory data analysis as a heuristic tool.

So, let's look at the example. As in the original paper, we used the binary matrix of the 21 characters to infer a planar, 2-dimensional (meta-)phylogenetic network, a Neighbor-net splits graph. The resulting graph is a longitudinally inflated spider-web, with its endpoints defined by the southern Chinese dialects (e.g. Guangzhou, Nanning, Taishan) and the north-central (eg. Linxia and Xining) dialects. The latter are significantly closer (geographically and data-wise) to the Bejing version of Chinese.

Figure 10: The Neighbor-net based on simple mean (Hamming) pairwise binary character distances

The first thing to note is that the matrix includes dialects that are indistinct (green stars) for all 21 characters, and some that are geographically and data-wise very similar to each other, while being distinct from all others (green ovals). In biology, we call this (taxic, lineage-)coherence. In addition to Linxia and Xining, we have Nanchang and Lichuan characterized by elongated ('tree-like') terminal edge-bundles. These obviously represent closely related dialects sharing a long(er) common history.

Others have more than one possible closest relative. For instance, Liuzhou may share quite a few features with Guangzhou, but it is equally close to the Nanchang-Lichuan pair (yellow fields). Dongtai (orange star) is unique, but its 'neighborhood' (orange-ish brackets) as defined by shared edge-bundles that include Changsha (which again is most related to Jiujang) and Taiyuan plus Baotou, the latter two substantially closer to the Bejing (red star) group.

Similar to Dongtai, and also connected to the central part of the graph, are dialects with long-terminal branches (edges). Hefeng (blue star) is substantially different from Dongtai, and only has one further dialect in its neighborhood (blue bracket), Wangrong, a close relative of the Bejing group. The Wuhan, Chengdu, and Guiyang (gray field) dialects appear, on the other hand, to be completely isolated.

As explained above, there are different processes, vertical and horizontal ones, that may trigger similarity, and we want to get an idea as to which character may be influenced by which process. From the graph, several aspects are obvious:
  • geographic closeness plays a major role,
  • the signal provided by the data is not tree-like,
  • the data is highly homoplastic, and includes internal conflict.
Not so obvious is whether this situation is due to random or evolutionary directed similarity, or reticulation. Since the graph is planar, and puts the Chinese dialects in a circular order, we can order the character matrix accordingly to see how the traits form groups (which could be called cliques in this context). In the next step, we can then map each character onto this network, to see how well they fit with the overall similarity pattern. We showed this above for p-6 (hand-arm-distinction, one split), and here we add a character with quite a poor fit, p-17 (syntactic-diachronic), "give" as a disposal marker.

Figure 11: Character mapping for p-17 (filled dots, "give" used as disposal marker; empty, not used), with the p-6 split indicated as well. Red, splits (taxon bipartitions defined by character cliques) that have no corresponding edge-bundle (neighborhood); blue, splits with neighborhood; green, unique, isolated change (deviation from the rule) within the neighborhood.

The number of inferred mutations in the map uses Ockham’s Razor, upon which parsimony (tree and network) inference relies as well. Using such a map, we can even provide an estimate for how likely (qualitatively spoken) a change is under the assumption that neighborhoods in the graph represent either exchange (homogenization) between closely related dialects or are inherited, reflecting both horizontal and vertical relatedness. Mapping characters on a 2-dimensional network allows finding a scenario beyond a single tree hypothesis.

For p-6, we need just one change (i.e. loss in all more south-bound dialects), but we don't find an edge bundle corresponding to this unique change. Given what we discussed above about p-6, we have more independent losses than the simple reconstructed one. Social preference or general contact for retaining the primitive state of having two words could explain why dialects closer to the Beijing dialect area have a "0", although not all are closely related in general.

For p-17, we need at least four (independent) changes from "0" → "1", two of which have a corresponding edge bundle (blue, Nanchang plus Lichuan, Changsha plus Dongtai), one isolated (green, Luoyang), and one without a corresponding edge bundle (Wuhan and Hefeng dialects). The (equally parsimonious) alternative for p-17 would be a series of gains and losses, with the same number of steps:

Figure 12: Alternative scenario for p-17.

This is where one needs to consider additional knowledge about the probability of getting or retaining a certain feature. The state shared by most dialects across the entire net is “0”, irrespective of overall similarity, which would make it a natural pick for the primitive state. Thus, assuming four (or more) changes from 0 → 1 (acquisition of the queried feature), rather than two independent acquisitions (starting with the Beijing group; note, the position of the root will not change the number of needed changes), then a loss (1 → 0) in many southbound dialects and a re-gain (0 → 1) in the Nanchang + Lichuan dialects.

The same assessment can be made for all of the characters, and we end up with something like this:

Figure 13: Fully annotated split network of the data. Changes relating to edge-bundles accordingly colored, arc indicate changes without a corresponding edge-bundle. Note, the prominent yellow split that defines a neighborhood of dialects most similar to the Beijing dialect, albeit there is no character supporting this edge. The rather poor fit of many character splits (cliques) with edge-bundles relate to the fact that we visualize a highly complex diversification (multi-dimensional processes) using a planar, 2-dimensional graph.

While this figure may be confusing at first sight, it comprehensively shows what the characters contribute to the overall graph. We can discriminate more-likely from less-likely mutations (how many changes are needed at least), but also the character assemblies shared by putatively closely related dialects.
  • p-3 and p-11 are a typical feature of Guangzhou and allied dialects within the southern Chinese complex. p-3 is also present in Lichuan, and p-11 in Jixi (thus in not so distant dialects).
  • Features p-6 to p-9, p-16, and p-19 form a diagnostic suite for the Guangzhou dialects and other dialects related to them in the one or other fashion and distinguish them from, e.g., the Beijing group
  • The latter, the Beijing group, has fewer diagnostic character assemblies. One characteristic sequence could be p-1, p-2, p-12, p-14, but this includes three features with a minimum of 3+ changes. Similarity here is mostly the result of a lack of (potentially) derived features (hence, the character-unsupported yellow edge-bundle defining a Beijng-including neighborhood)

Outlook and summary

In this re-investigation, we have, once more, commented on the problems we see with the use of structural features for the purpose of historical language comparison and phylogonetic reconstruction. We see the major problems in the (often) unfortunate choice of question, resulting in elicitations of features that cannot be easily modeled with current software for phylogenetic analyses. It is important to keep in mind, in linguistics and phylogenetics, that we can infer trees or networks based on data of no matter what quality and information content. But before we present the result, we should have taken a look at the primary data.
  • Does it fit with the resulting graph, or not?
  • Where does it fit, and where not?
In the context of our critique of linguistic questionnaires, the mapping strategy discussed above opens a potential avenue to identify:
  • stable / unstable features (geographically or evolution-wise) and
  • coherent / incoherent features.
Based on this, we can then inquire as to which degree language (or dialect) groups influenced, stabilized or modified each other by geographic proximity.

Inference-wise, the natural next step would be to use the information about the minimum number of necessary changes to counter-weight characters. This would eventually allow to use median networks (and related) approaches on the data, which is currently the only way to explicitly identify ancestors using phylogenetic reconstructions. With the current matrices, the extreme homoplasy makes an unweighted application of median networks and related methods impossible.

References

Forkel, R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 2393-2400.

List, J.-M., M. Walworth, S. Greenhill, T. Tresoldi, and R. Forkel (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3.2: 130–144.

Norman, J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.): The Sino-Tibetan languages. Routledge: London and New York, pp. 72-83.

Szeto, P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Supplementary data

The data we used to create the analyses and figures provided in this post are available at https://github.com/cldf-datasets/szetosinitic/tree/master/examples