The Genealogical World of Phylogenetic Networks: What we know, what we know we can know, and what we know we cannot know

This is a guest blog post by:

Johann-Mattis List

Centre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

What we know, what we know we can now, and what we know we cannot know: Ontological facts and epistemological reality in historical linguistics and evolutionary biology

In a recent blog post (Multiple sequence alignment), David wrote about some theoretical issues regarding the concept of homology in evolutionary biology, and specifically its impact on the design of sequence alignment programs. In that post, he mentioned a recently published paper, where he discusses algorithms for sequence alignment and notes that "there is no known objective function for identifying homology" (Morrison 2015: 14).

This statement triggered my interest, since I was immediately reminded of problems that have been occupying historical linguists for a long time now. These problems arise from the fact that in historical disciplines, such as evolutionary biology or historical linguistics (but also in general history or some parts of geology), scholars are not trying to infer general laws of nature, but rather use knowledge of general laws to infer unique events.

The tasks of scholars working in these disciplines is similar to the task of a crime investigator or a doctor: Detectives use the evidence from a crime scene to infer the individual events that led to the crime (and arrest the culprit), and doctors use the symptoms of patients to identify their individual diseases (and then look for a way to cure them). Similarly, evolutionary biologists and historical linguists try to identify the evolutionary events that lead to the observed diversity of life and languages, respectively.

What unites all these disciplines is the specific mode of reasoning that they employ. Charles Sanders Peirce (1839-1914) was among the first to investigate this reasoning mode in detail (Peirce 1931/1958: 7.202). He called it abduction, and contrasted it with induction and deduction, the traditional modes of logical reasoning. Induction is used to infer a currently unknown general rule from an initial state and its result state, while deduction infers the result state of an initial state and a general rule. On the other hand, abduction seeks to infer initial states from result states by employing a general rule.

What further complicates the task of evolutionary biologists and historical linguists is that we have only limited means to verify or falsify a given hypothesis, since, in contrast to detectives and doctors, our research objects usually do not confess, nor do they give positive feedback when we propose the right hypothesis. We never know whether we found the true murderer or whether we proposed the right cure.

Historical linguistics and the limits of knowledge

In historical linguistics, discussions regarding the limits of our knowledge have been centered around the question of the "nature of the proto-language". Using comparative techniques, in the second half of the 19th century linguists started to reconstruct ancestral words of languages that are not attested in any written source. Thus, linguists would first try to identify cognate (homologous) words in Indo-European languages, and then infer how these words were pronounced in the Indo-European language which was spoken some 8,000 years ago. This technique, which was originally introduced by August Schleicher (1821-1868) in 1861, became very popular, and has remained the standard way of knowledge representation in historical linguistics. Whenever linguists propose such a reconstructed form, based on various pieces of evidence, they use an asterisk symbol * to indicate that the word has been inferred, and that there is no written source that would confirm its existence.

As an example, consider some of the words for "sun" in Indo-European languages (discussed in detail in List 2014: 136):

Russian солнце [sɔnʦə]
Polish słońce [swɔnjʦɛ]
French soleil [sɔlɛj]
Italian sole [sole]
German Sonne [sɔnə]
Swedish sol [suːl]

According to modern historical linguistics theory, these words are all assumed to go back to the same ancestral word in Indo-European. The reconstructed pronunciation of the ancestral form is traditionally represented as *séh₂u̯el- "sun" and an approximate pronunciation of the nominate singular would be [soxwl] (with [x] indicating the same sound as the ch in German Rauch "smoke").

These techniques are generally thought to be quite reliable, and they provided concrete help in the decipherment of many ancient languages (including the Egyptian hieroglyphes, Linear B, and Hittite). The status of the reconstructions that scholars produced was, however, controversially debated. While some scholars claimed that there was a high probability that the proposed reconstructions would come close to the original pronunciation, others would classify them as a pure fiction (Schmidt 1872).

Linear B

While it is obvious that reconstructions represent hypotheses and not indisputable truths, it is less clear how they relate to the actual historical facts. First of all, we know for sure that our hypotheses are not stable over time. As our knowledge of the evidence increases, as we include more languages in our comparison, or get deeper insights into the major processes underlying language history, our hypotheses will also constantly be changed and refined. This is nicely reflected in August Schleicher's Fable (a short parable called "The Sheep and the Horses"), a text that he wrote in his reconstructed version of Proto-Indo-European, in order to illustrate what was by then known about the origin of the Indo-European language. When looking at the many later versions, written by scholars in order to illustrate how our knowledge of Indo-European had changed since then, the differences in the pronunciations are really striking (see this summary in Wikipedia), but so are the similarities.

Judging from the degree to which these reconstruction hypotheses evolved over about 150 years, we can reach an important, apparently paradoxical, conclusion: While our reconstructions in historical linguistics are far from being realistic (in the sense of representing actual pronunciations of an Indo-European people), they are by no means fictions, as Johannes Schmidt claimed long ago. The reconstructions are not (and never will be) realistic, since they will always be preliminary, depending on our currently available data and the theoretical development in our field. On the other hand, the reconstructions are also not necessarily unrealistic, since they reflect scientific hypotheses that have been constantly refined and independently developed using the best knowledge we have at that moment. So, although we know that our hypotheses do not truly reflect what really happened, we have good reasons to assume that they come much closer to the real story than any random hypothesis.

As reflected in David's aforementioned statement regarding the lack of an objective function for homology identification in evolutionary biology, the problem of assessing the realism of our hypotheses is not unique to historical linguistics. In a similar way to that with which we discuss the realism of our reconstructed forms in historical linguistics, one may discuss the realism behind any multiple sequence alignment in evolutionary biology. The objects of investigation in historical linguistics and evolutionary biology are not directly accessible to the researchers, but can only be inferred by tests and theories.

Interestingly, this problem also occurs in the social sciences. In psychology, for example, such attributes of people as "intelligence" cannot be directly observed, but have to be inferred by measuring what they provoke or how they are "reflected in test performance" (Cronbach and Meehl 1955: 178). What is inferred by psychological tests is usually called a construct, and is strictly separated from the underlying quality that scholars originally wanted to measure. The construct is thereby understood as the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981 [1998]: 67). As in the case of reconstruction in linguistics or homology assessment in biology, it is not the "real" object or process.

Conclusion

What can we conclude from this? Or, to put it differently, why should we care about constructs or the degree of fiction behind our claims in historical linguistics and evolutionary biology? I see two important reasons to do so.

First, we can avoid confusion in our fields by strictly separating ontological facts and epistemological reality. In evolutionary biology, this would help to avoid the confusion that often arises when scholars talk about homologous genes, when in practice what they mean is that they applied some similarity threshold and some cluster procedure to cluster genes in sets of presumed homologs. In historical linguistics, on the other hand, it would help us to get rid of the tiresome debate between formalists (who emphasize that reconstructed forms are simple formulas) and realists (who take reconstructed forms as realistic representations) in reconstruction.

Second, from a broader viewpoint, as scientists, we should always try to be explicit in our claims, and we should also always try to be honest about what we know, what we know we can know, and what we know we cannot know.

References

Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.

List J-M (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.

Peirce CS (1931/1958) Collected papers of Charles Sanders Peirce. Ed. by C Hartshorne and P Weiss. Cont. by AW Burke. 8 vols. Cambridge MA: Harvard University Press.

Schleicher A (1861) Compendium der vergleichenden Grammatik der indogermanischen Sprache. Vol. 1: Kurzer Abriss einer Lautlehre der indogermanischen Ursprache. Weimar: Böhlau.

Schmidt J (1872) Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Hermann Böhlau.

Statt DA, comp. (1981 [1998]) Concise Dictionary of Psychology, 3rd ed. London and New York: Routledge.

Wednesday, April 15, 2015

What we know, what we know we can know, and what we know we cannot know

Johann-Mattis List

No comments:

Post a Comment