Showing posts with label Linguistics. Show all posts
Showing posts with label Linguistics. Show all posts

Monday, June 1, 2015

An artistic language tree


On this blog we occasionally draw your attention to the overlap between the scientific world and the artistic world. The language tree shown below is from the Stand Still Stay Silent site, which describes itself as "a post apocalyptic webcomic with elements from Nordic mythology". The tree data apparently come from the Ethnologue language database.

The detail about the Nordic languages derives from the fact that the author, Minna Sundberg, is Finnish-Swedish, and the Scandinavian languages have next to nothing in common with the Finno-Ugric languages.

Posters and prints of the tree are available for purchase.


Wednesday, May 13, 2015

Homology and cognacy: fundamental historical relations between words


This is a guest blog post, following on from his previous post, by:

Johann-Mattis List

Centre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

Introduction

All languages constantly change. Words are lost when speakers cease to use them, new words are gained when new concepts evolve, and even the pronunciation of the words changes slightly over time. Slight modifications that can barely be noticed during a person's lifetime sum up to great changes in the system of a language over centuries. When the speakers of a language diverge, their speech keeps on changing independently in the two communities, and at a certain point of time the independent changes are so great that they can no longer communicate with each other — what was one language has become two.

Demonstrating that two languages once were one is one of the major tasks of historical linguistics. If no written documents of the ancestral language exist, one has to rely on specific techniques for linguistic reconstruction (see the examples in this previous post). These techniques require us to first identify those words in the descendant languages that presumably go back to a common word form in the ancestral language. In identifying these words, we infer historical relations between them. The most fundamental historical relation between words is the relation of common descent. However, similarly to evolutionary biology, where homology can be further subdivided into the more specific relations of orthology, paralogy, and xenology, more specific fundamental historical relations between words can be defined for historical linguistics, depending on the underlying evolutionary scenario.

Homology and Cognacy in Linguistics and Biology

In evolutionary biology there is a rather rich terminological framework describing fundamental historical relations between genes and morphological characters. Discussions regarding the epistemological and ontological aspects of these relations are still ongoing (see the overview in Koonin 2005, but also this recent post by David). Linguists, in contrast, have rarely addressed these questions directly. They rather assumed that the fundamental historical relations between words are more or less self-evident, with only few counter-examples, which were largely ignored in the literature (Arapov and Xerc 1974; Holzer 1996; Katičić 1966). As a result, our traditional terminology to describe the fundamental historical relations between words is very imprecise and often leads to confusion, especially when it comes to computational applications that are based on software originally developed for applications in evolutionary biology.

As an example, consider the fundamental concept of homology in evolutionary biology. According to Koonin (2005: 311), it "designates a relationship of common descent between any entities, without further specification of the evolutionary scenario". The terms orthology, paralogy, and xenology are used to address more specific relations. Orthology refers to "genes related via speciation" (Koonin 2005: 311); that is, genes related via direct descent. Paralogy refers to "genes related via duplication" (ibid.); that is, genes related via indirect descent. Xenology, a notion which was introduced by Gray and Fitch (1983), refers to genes "whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters" (Fitch 2000: 229); i.e. to genes related via descent involving lateral transfer.

In historical linguistics, the only relation that is explicitly defined is cognacy (also called cognation). Cognacy usually refers to words related via “descent from a common ancestor” (Trask 2000: 63), and it is strictly distinguished from descent involving lateral transfer (borrowing). The term cognacy itself, however, covers both direct and indirect descent. Hence, traditionally, German Zahn 'tooth' is cognate with English tooth, and German selig 'blessed' with English silly, and German Geburt 'birth' with English birth, although the historical processes that shaped the present appearance of these three word pairs are quite different. Apart from the sound shape, Zahn and tooth have regularly developed from Proto-Germanic *tanθ-; selig and silly both go back to Proto-Germanic *sæli- 'happy', but the meaning of the English word has changed greatly; Geburt and birth stem from Proto-Germanic *ga-burdi-, but the English word has lost the prefix as a result of specific morphological processes during the development of the English language (all examples follow Kluge and Seebold 2002, with modifications for the pronunciation of Proto-Germanic). Thus, of the three examples of cognate words given, only the first would qualify as having evolved by direct inheritance, while the inheritance of the latter two could be labelled as indirect, involving processes which are largely language-specific and irregular, such as meaning shift and morpheme loss. Trask (2000: 234) suggests the term oblique cognacy to label these cases of indirect inheritance, but this term seems to be rarely used in historical linguistics; and at least in the mainstream literature of historical linguistics I could not find even a single instance where the term was employed (apart from the passage by Trask).


In the table above (with modifications taken from List 2014: 39), I have tried to contrast the terminology used in evolutionary biology and historical linguistics by comparing to which degree they reflect fundamental historical relations between words or genes. Here, common descent is treated as a basic relation which can be further subdivided into relations of direct common descent, indirect common descent, and common descent involving lateral transfer. As one can easily see, historical linguistics lacks proper terms for at least half of the relations, offering no exact counterparts for homology, orthology, and xenology in evolutionary biology.

Cognacy in historical linguistics is often deemed to be identical with homology in evolutionary biology, but this is only true if one ignores common descent involving lateral transfer. One may argue that the notion of xenology is not unknown to linguists, since the borrowing of words is a very common phenomenon in language history. However, the specific relation which is termed xenology in biology has no direct counterpart in historical linguistics: the term borrowing refers to a distinct process, not a relation resulting from the process. There is no common term in historical linguistics which addresses the specific relation between such words as German kurz 'short' and English short. These words are not cognate, since the German word has been borrowed from Latin cŭrtus 'mutilated' (Kluge and Seebold 2002). They share, however, a common history, since Latin cŭrtus and English short both (may) go back to Proto-Indo-European *(s)ker- 'cut off' (Vaan 2008: 158). The specific history behind these relations is illustrated in the following figure.


A specific advantage of the biological notion of homology as a basic relation covering any kind of historical relatedness, compared to the linguistic notion of cognacy as a basic relation covering direct and indirect common descent, is that the former is much more realistic regarding the epistemological limits of historical research. Up to a certain point, it can be fairly reliably demonstrated that the basic entities in the respective disciplines (words, genes, or morphological characters) share a common history. Demonstrating that more detailed relations hold, however, is often much harder. The strict notion of cognacy has forced linguists to set goals for their discipline which may often be far too ambitious to achieve. We need to adjust our terminology accordingly and bring our goals into balance with the epistemological limits of our discipline. In order to do so, I have proposed to refine our current terminology in historical linguistics to the schema shown in the table below (with modifications taken from List 2014: 44):


Fifty Shades of Cognacy

In a recent blog post, David pointed to the relative character of homology in evolutionary biology in emphasizing that it "only applies locally, to any one level of the hierarchy of character generalization". Recalling his example of bat wings compared to bird wings, which are homologous when comparing them as forelimbs but who are analogous when comparing them as wings, we can find similar examples in historical linguistics.

If we consider words for 'to give' in the four Romance languages Portuguese, Spanish, Provencal and French, then we can state that both Portuguese dar and Spanish dar are homologous, as are Provencal douna and French donner. The former pair go back to the Latin word dare 'to give', and the latter pair go back to the Latin word donare 'to gift (give as a present)'. In those times when Latin was commonly spoken, both dare and donare were clearly separated words denoting clearly separated contexts and being used in clearly separated contexts. The verb donare itself was derived from Latin donum 'present, gift'. Similarly to English where nouns can be easily used as verbs, Latin allowed for specific morphological processes. In contrast to English, however, these processes required that the form of the noun was modified (compare English gift vs. to gift with Latin donum vs. donare).

What the ancient Romans (who spoke Latin as their native tongue) were not aware of is that Latin donum 'gift' and Latin dare 'to give' themselve go back to a common word form. This was no longer evident in Latin, but it was in Proto-Indo-European, the ancestor of the Latin language. Thus, Latin dare goes back to Proto-Indo-European *deh3- 'to give', and Latin donum goes back to Proto-Indo-European *deh3-no- 'that which is given (the gift)' (Meiser 1999; what is written as *h3 in this context was probably pronounced as [x] or [h]). The word form *deh3-no- is a regular derivation from *deh3-, so at the Indo-European level both forms are homologous, since one is derived from the other. That means, in turn, that Latin dare and donum are also homologs, since they are the residual forms of the two homologous words in Proto-Indo-European. And since Latin donare is a regular derivation of donum, this means, again, that Latin dare and donare are also homologous, as are the words in the four descendant languages, Portuguese dar, Spanish dar, Provencal douna, and French donner. Depending on the time depth we apply, we will arrive at different homology decisions. I have tried to depict the complex history of the words in the following figure:


Judging from the treatment in linguistic databases, many scholars do not regard these different "shades of homology" as a real problem. In most cases, scholars use a "lumping approach" and label as cognates all words that go back to a common root, no matter how far that root goes back in time (compare, for example, the cognate labeling for reflexes of Proto-Indo-European *deh3- in the IELex).

Importantly, this labeling practice, however, may be contrary to the models that are used to analyze the data afterwards. All computational analyses model language evolution as a process of word gain and word loss. The words for the analyses are sampled from an initial set of concepts (such as 'give', 'hand', 'foot', 'stone', etc.) which are translated into the languages under investigation. If we did not know about the deeper history of Latin dare and donare, we would assume a regular process of language evolution here: at some point, the speakers of Gallo-Romance would cease to use the word dare to express the meaning 'to give' and use the word donare instead, while the speakers of Ibero-Romance would keep on using the word dare. This well-known process of lexical replacement (illustrated in the graphic below), which may provide strong phylogenetic signals, is lost in the current encoding practice where all four words are treated as homologs. Our current practice of cognate coding masks vital processes of language change.


Outlook

Historical linguistics needs a more serious analysis of the fundamental processes of language change and the fundamental historical relations resulting from these processes. In the last two decades a large arsenal of quantitative methods has been introduced in historical linguistics. The majority of these methods come from evolutionary biology. While we have quickly learned to adapt and apply these methods to address questions of language classification and language evolution, we have forgotten to ask whether the processes these methods are supposed to model actually coincide with the fundamental processes of language evolution. Apart from adapting only the methods from evolutionary biology, we should consider also adapting the habit of having deeper discussions regarding the very basics of our methodology.

References

Arapov MV, Xerc MM (1974) Математические методы в исторической лингвистике [Mathematical methods in historical linguistics]. Moscow: Nauka. German translation: Arapov, M. V. and M. M. Cherc (1983). Mathematische Methoden in der historischen Linguistik. Trans. by R. Köhler and P. Schmidt. Bochum: Brockmeyer.

Fitch WM (2000) Homology: a personal view on some of the problems. Trends in Genetics 16.5, 227-231.

Gray GS, Fitch WM (1983) Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Molecular Biology and Evolution 1.1, 57-66.

Holzer G (1996) Das Erschließen unbelegter Sprachen. Zu den theoretischen Grundlagen der genetischen Linguistik. Frankfurt am Main: Lang

Katičić R (1966) Modellbegriffe in der vergleichenden Sprachwissenschaft. Kratylos 11, 49-67.

Kluge F, Seebold E (2002) Etymologisches Wörterbuch der deutschen Sprache. 24th ed. Berlin: de Gruyter.

List J-M (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

Meiser G (1999) Historische Laut- und Formenlehre der lateinischen Sprache. Wissenschaftliche Buchgesellschaft: Darmstadt.

Trask RL (2000) The Dictionary of Historical and Comparative Linguistics. Edinburgh: Edinburgh University Press.

Vaan M (2008) Etymological Dictionary of Latin and the Other Italic Languages. Leiden and Boston: Brill.

Wednesday, April 15, 2015

What we know, what we know we can know, and what we know we cannot know


This is a guest blog post by:

Johann-Mattis List

Centre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

What we know, what we know we can now, and what we know we cannot know: Ontological facts and epistemological reality in historical linguistics and evolutionary biology

In a recent blog post (Multiple sequence alignment), David wrote about some theoretical issues regarding the concept of homology in evolutionary biology, and specifically its impact on the design of sequence alignment programs. In that post, he mentioned a recently published paper, where he discusses algorithms for sequence alignment and notes that "there is no known objective function for identifying homology" (Morrison 2015: 14).

This statement triggered my interest, since I was immediately reminded of problems that have been occupying historical linguists for a long time now. These problems arise from the fact that in historical disciplines, such as evolutionary biology or historical linguistics (but also in general history or some parts of geology), scholars are not trying to infer general laws of nature, but rather use knowledge of general laws to infer unique events.


The tasks of scholars working in these disciplines is similar to the task of a crime investigator or a doctor: Detectives use the evidence from a crime scene to infer the individual events that led to the crime (and arrest the culprit), and doctors use the symptoms of patients to identify their individual diseases (and then look for a way to cure them). Similarly, evolutionary biologists and historical linguists try to identify the evolutionary events that lead to the observed diversity of life and languages, respectively.

What unites all these disciplines is the specific mode of reasoning that they employ. Charles Sanders Peirce (1839-1914) was among the first to investigate this reasoning mode in detail (Peirce 1931/1958: 7.202). He called it abduction, and contrasted it with induction and deduction, the traditional modes of logical reasoning. Induction is used to infer a currently unknown general rule from an initial state and its result state, while deduction infers the result state of an initial state and a general rule. On the other hand, abduction seeks to infer initial states from result states by employing a general rule.

What further complicates the task of evolutionary biologists and historical linguists is that we have only limited means to verify or falsify a given hypothesis, since, in contrast to detectives and doctors, our research objects usually do not confess, nor do they give positive feedback when we propose the right hypothesis. We never know whether we found the true murderer or whether we proposed the right cure.

Historical linguistics and the limits of knowledge

In historical linguistics, discussions regarding the limits of our knowledge have been centered around the question of the "nature of the proto-language". Using comparative techniques, in the second half of the 19th century linguists started to reconstruct ancestral words of languages that are not attested in any written source. Thus, linguists would first try to identify cognate (homologous) words in Indo-European languages, and then infer how these words were pronounced in the Indo-European language which was spoken some 8,000 years ago. This technique, which was originally introduced by August Schleicher (1821-1868) in 1861, became very popular, and has remained the standard way of knowledge representation in historical linguistics. Whenever linguists propose such a reconstructed form, based on various pieces of evidence, they use an asterisk symbol * to indicate that the word has been inferred, and that there is no written source that would confirm its existence.

As an example, consider some of the words for "sun" in Indo-European languages (discussed in detail in List 2014: 136):
According to modern historical linguistics theory, these words are all assumed to go back to the same ancestral word in Indo-European. The reconstructed pronunciation of the ancestral form is traditionally represented as *séh₂u̯el- "sun" and an approximate pronunciation of the nominate singular would be [soxwl] (with [x] indicating the same sound as the ch in German Rauch "smoke").

These techniques are generally thought to be quite reliable, and they provided concrete help in the decipherment of many ancient languages (including the Egyptian hieroglyphes, Linear B, and Hittite). The status of the reconstructions that scholars produced was, however, controversially debated. While some scholars claimed that there was a high probability that the proposed reconstructions would come close to the original pronunciation, others would classify them as a pure fiction (Schmidt 1872).

Linear B

While it is obvious that reconstructions represent hypotheses and not indisputable truths, it is less clear how they relate to the actual historical facts. First of all, we know for sure that our hypotheses are not stable over time. As our knowledge of the evidence increases, as we include more languages in our comparison, or get deeper insights into the major processes underlying language history, our hypotheses will also constantly be changed and refined. This is nicely reflected in August Schleicher's Fable (a short parable called "The Sheep and the Horses"), a text that he wrote in his reconstructed version of Proto-Indo-European, in order to illustrate what was by then known about the origin of the Indo-European language. When looking at the many later versions, written by scholars in order to illustrate how our knowledge of Indo-European had changed since then, the differences in the pronunciations are really striking (see this summary in Wikipedia), but so are the similarities.

Judging from the degree to which these reconstruction hypotheses evolved over about 150 years, we can reach an important, apparently paradoxical, conclusion: While our reconstructions in historical linguistics are far from being realistic (in the sense of representing actual pronunciations of an Indo-European people), they are by no means fictions, as Johannes Schmidt claimed long ago. The reconstructions are not (and never will be) realistic, since they will always be preliminary, depending on our currently available data and the theoretical development in our field. On the other hand, the reconstructions are also not necessarily unrealistic, since they reflect scientific hypotheses that have been constantly refined and independently developed using the best knowledge we have at that moment. So, although we know that our hypotheses do not truly reflect what really happened, we have good reasons to assume that they come much closer to the real story than any random hypothesis.

As reflected in David's aforementioned statement regarding the lack of an objective function for homology identification in evolutionary biology, the problem of assessing the realism of our hypotheses is not unique to historical linguistics. In a similar way to that with which we discuss the realism of our reconstructed forms in historical linguistics, one may discuss the realism behind any multiple sequence alignment in evolutionary biology. The objects of investigation in historical linguistics and evolutionary biology are not directly accessible to the researchers, but can only be inferred by tests and theories.


Interestingly, this problem also occurs in the social sciences. In psychology, for example, such attributes of people as "intelligence" cannot be directly observed, but have to be inferred by measuring what they provoke or how they are "reflected in test performance" (Cronbach and Meehl 1955: 178). What is inferred by psychological tests is usually called a construct, and is strictly separated from the underlying quality that scholars originally wanted to measure. The construct is thereby understood as the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981 [1998]: 67). As in the case of reconstruction in linguistics or homology assessment in biology, it is not the "real" object or process.

Conclusion

What can we conclude from this? Or, to put it differently, why should we care about constructs or the degree of fiction behind our claims in historical linguistics and evolutionary biology? I see two important reasons to do so.

First, we can avoid confusion in our fields by strictly separating ontological facts and epistemological reality. In evolutionary biology, this would help to avoid the confusion that often arises when scholars talk about homologous genes, when in practice what they mean is that they applied some similarity threshold and some cluster procedure to cluster genes in sets of presumed homologs. In historical linguistics, on the other hand, it would help us to get rid of the tiresome debate between formalists (who emphasize that reconstructed forms are simple formulas) and realists (who take reconstructed forms as realistic representations) in reconstruction.

Second, from a broader viewpoint, as scientists, we should always try to be explicit in our claims, and we should also always try to be honest about what we know, what we know we can know, and what we know we cannot know.

References

Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.

List J-M (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.

Peirce CS (1931/1958) Collected papers of Charles Sanders Peirce. Ed. by C Hartshorne and P Weiss. Cont. by AW Burke. 8 vols. Cambridge MA: Harvard University Press.

Schleicher A (1861) Compendium der vergleichenden Grammatik der indogermanischen Sprache. Vol. 1: Kurzer Abriss einer Lautlehre der indogermanischen Ursprache. Weimar: Böhlau.

Schmidt J (1872) Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Hermann Böhlau.

Statt DA, comp. (1981 [1998]) Concise Dictionary of Psychology, 3rd ed. London and New York: Routledge.

Wednesday, November 26, 2014

An outline history of phylogenetic trees and networks


This the 300th post on this blog, and so I thought we might have a bit of a summary. Here is the early history of phylogenetic trees and networks as we currently know it. There may, of course, be as yet undetected sources. Details of each of these historical notes (including illustrations) can be found elsewhere in this blog — you can use the search feature in the right side-bar to find them.

Biology

Genealogies as pedigrees (the history of individuals) have a long history. For example, they appear in inscriptions concerning the pharaohs of Ancient Egypt, although these are very imprecise and have caused many headaches for modern scholars. They appear as chains of ancestors and descendants in the Old Testament of the Christian Bible, often contradicting each other and claiming impossible lifespans. Most importantly for modern usage, they were employed in the New Testament to legitimize Jesus as the messiah foretold in the Old Testament. The first known illustration of this appeared in c.400 AD, and it was actually a network, as there were two lineages leading to Jesus (via both Joseph and Mary).

The apparent success of this application (later called the Tree of Jesse, pictures of which started appearing in the 10th century) has meant that both royalty and the nobility have subsequently used pedigrees to assert their own right to be regal and noble. The first known illustration of this is from c.1000 AD, in which Cunigunde of Luxembourg's ancestry was traced in a tree-like manner to include Charlemagne, thus legitimizing her claim to being royal.

Also, up until 1215 AD marriage within seven degrees of separation was not allowed by the christian church, and intestate inheritance applied the same relationship limit. So, a record of blood ties among relatives was often needed; and these started appearing in family bibles, for example. The first recorded tree-like illustrated pedigree was for Lambert of Saint-Omer, which appeared in 1122 AD in his personal copy of his book Liber Floridus.

It seems obvious, then, to also construct genealogies for groups of organisms, which we now call phylogenies (a word coined by Ernst Haeckel in 1866). The Great Chain of Being was for a long time the most popular iconography for relationships, mainly because it neatly tied in with the Christian philosophy of a chain of intellectual ideas, leading from pragmatic earthly concerns and culminating in the idealistic heavens. Humans were, of course, at the head of the chain of earthly beings, and capable of ascending to the heavens.

However, this did not work from a purely observational point of view. Observed pedigrees were not linear, but branched with each generation and often fused again via marriage. Furthermore, biodiversity (the patterns among groups of organisms) also seemed to have multiple relationships. This lead Vitaliano Donati in 1750 (Della Storia Naturale Marina dell' Adriatico) to suggest that:
In addition, the links of the chain are joined in such a way within the links of another chain, that the natural progressions should have to be compared more to a net than to a chain, that net being, so to speak, woven with various threads which show, between them, changing communications, connections, and unions. [from the original Italian]
He was not alone in this thought, although others chose different metaphors. For example, Carl von Linné in 1751 (Philosophia Botanica) wrote this:
All plants show affinities on either side, like territories in a geographical map. [from the original Latin]
Neither author published a reticulating diagram to illustrate their thoughts, although one of Linné's students subsequently produced a version of his ideas in 1792 (Caroli a Linné, Praelectiones in Ordines Naturales Plantarum).

So, it was Georges-Louis Leclerc, Comte de Buffon, who produced the first empirical phylogeny in 1755 (Histoire Naturelle Générale et Particulière, Tome V). This was a network showing the evolutionary origin of domesticated dog breeds. This was followed by Antoine Nicolas Duchesne in 1766 (Histoire Naturelle des Fraisiers), who produced a network showing the evolutionary origin of strawberry cultivars. In both cases the evolutionary process illustrated by the reticulations in the network was hybridization. Note that both of these diagrams refer to within-species genealogies, rather than to relationships between species; and neither author seems to have contemplated the idea of among-species phylogenies.

Thus, in both theory and practice modern phylogenetic metaphors started as networks, not trees. It was Peter Simon Pallas in 1776 (Elenchus Zoophytorum) who first suggested using a tree as a simplified metaphor:
As Donati has already judiciously observed, the works of Nature are not connected in series in a Scale, but cohere in a Net. On the other hand, the whole system of organic bodies may be well represented by the likeness of a tree that immediately from the root divides both the simplest plants and animals, [but they remain] variously contiguous as they advance up the trunk, Animals and Vegetables; [from the origina Latin]
Again, no diagram was forthcoming to illustrate this. It was Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck, who finally produced an empirical phylogeny in 1809 (Philosophie Zoologique). This was a small tree showing the evolutionary relationships among the major groups of animals. However, it represented what we would now call transformational evolution, as Lamarck did not believe in extinction, and thus he showed one group transforming into another. This differed from both Buffon and Duchesne, who were illustrating a process of increasing diversity of groups. It also differed by referring to supra-species relationships.

For the next 50 years, diagrams showing biodiversity relationships illustrated what we now call patterns of affinity, rather than showing historical relationships. These affinity diagrams showed apparent similarities among groups of organisms, without any implication that the relationships were the result of evolutionary history. The majority of these diagrams were networks rather than trees, indicating that groups of organisms had observed similarities with several other groups.

It is Charles Darwin and Alfred Russel Wallace who are credited with introducing, in 1858, the idea that natural selection could be the important process by which new species arise, although the idea of natural selection itself had been "in the air" for more than half a century with respect to within-species variation. (In the case of Patrick Matthew, he had also suggested a role in the origin of new species; 1831, On Naval Timber and Arboriculture; with Critical Notes on Authors who have Recently Treated the Subject of Planting).

As was by now becoming a tradition, neither Darwin nor Wallace (nor Matthew) produced a diagram to illustrate their thoughts. Darwin did draw a theoretical diagram in his subsequent 1859 book (On the Origin of Species by Means of Natural Selection), but he used it to illustrate continuity of evolutionary descent and the processes of extinction and diversification, rather than strictly as representing a phylogeny. His famous "Tree of Life" metaphor had nothing to do with the diagram (it was a Biblical metaphor, to stimulate the imagination of his readers).

The first person to get into print what we could call an empirical diagram representing Darwin's idea was Johann Friedrich Theodor Müller in 1864 (Für Darwin), who drew a small (three-species) tree of amphipods. This was followed by St George Jackson Mivart in 1865 (Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592). This was a much more extensive diagram illustrating possible evolutionary relationships among primate species (including humans) based solely on their body skeleton.

Confusion between trees and networks reappeared at this time. In particular, Franz Martin Hilgendorf had produced an unpublished PhD thesis in 1863 (Beiträge zur Kenntniß des Süßwasserkalkes von Steinheim) during which he constructed an empirical network of relationships among extinct snail species; but he rejected this because it did not match the Darwinian idea of an evolutionary tree. He later collected more data, and instead published a phylogenetic tree in 1866 (Planorbis multiformis im Steinheimer Süßwasserkalk: ein beispiel von gestaltveränderung im laufe der zeit).

Thus, we last saw an explicit evolutionary network in 1766, referring to with-species variation. The first person to publish an evolutionary network showing relationships among species was apparently Ferdinand Albin Pax in 1888 (Monographische übersicht über die arten der gattung Primula. Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 10: 75-241). He produced 14 networks of various primula species, apparently showing affinity relationships, but three of these also illustrate hybridization, which is strictly an evolutionary process.

Anthropology

Genealogies appear in anthropology as well as in biology. Any human creation can be considered to have a history of "descent with modification" if copies are passed from generation to generation (eg. languages, books, tales). For our purposes here, the most important historical developments were in linguistics (languages studies) and in stemmatology (manuscript studies).

Georg Stiernhielm appears to have been the first linguist to draw a genealogy, when he produced a small network of Germanic languages in 1671 (De Linguarum Origine Præfatio, the preface to his edition of Evangelia ab Ulfila Gothorum). This was followed by Félix Gallet in c.1800 (Arbre Généalogique des Langues Mortes et Vivantes), who produced a single broadsheet with a network of Indo-European languages.

Note that, as for biology, the modern metaphors started as networks, not trees. More importantly, note that Stiernhielm's diagram pre-dated Buffon's dog network by more than 80 years — evolutionary ideas were less revolutionary in linguistics than they were in biology.

Darwin explicitly noted a connection between language genealogies and biology genealogies in 1859. However, the first people to get into print what we could call empirical diagrams representing Darwin's idea did so before Darwin published anything on the subject. In 1853 František Ladislav Čelakovský published a tree depicting a history of the Slavic languages (Čtení o Srovnávací Mluvnici Slovanské na Universitě Pražskě), and Auguste Schleicher published one on the development of the Indo-Germanic language family (Die ersten Spaltungen des Indogermanischen Urvolkes. Allgemeine Monatsschrift für Wissenschaft und Literatur 1853: 786-787).

Stemmatology differs from linguistics and biology in first producing a tree rather than a network. Hans Samuel Collin and Carl Johan Schlyter produced this in 1827 (first volume of Corpus Iuris Sueo-Gotorum Antiqui), with a tree of relationships among hand-written copies of documents containing the Medieval laws of Sweden. This was also a tree that represented Darwin's genealogical idea, and so it may be considered to be the first one of that type to be published (ie. 25 years before Čelakovský and Schleicher, and 30 years before Darwin).

This early lead was followed by the first network in 1832, when Friedrich Wilhelm Ritschl's stemma of a book by Thomas Magister (Thomae Magistri sive Theoduli Monachi Ecloga vocum Atticarum) explicitly showed sources of contamination among the manuscript copies — that is, different parts of a manuscript were copied from different sources, rather strict ancestor-descendant copying.

Interestingly, the tree metaphor didn’t endure in anthropology as well as it did in biology. It was quickly replaced by alternative metaphors, such as wave, web, warp & weft, lattice and other continuously reticulating images. Horizontal flow of information has always been seen as a dominant force in anthropological histories.

Timeline

Networks

1671 Georg Stiernhielm — small language network
1750 Vitaliano Donati — biology network suggestion
1751 Carl von Linné — biology map suggestion
1755 Georges-Louis Leclerc, Comte de Buffon — intra-species network
1766 Antoine Nicolas Duchesne — intra-species network
1792 Carl von Linné — map
1800 Félix Gallet — language network
1832 Friedrich Wilhelm Ritschl — small manuscript network
1863 Franz Martin Hilgendorf — unpublished inter-species network
1888 Ferdinand Albin Pax — inter-species network

Trees

1776 Peter Simon Pallas — biology tree suggestion
1809 Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck — small inter-species tree
1827 Hans Samuel Collin and Carl Johan Schlyter — manuscript tree
1853 František Ladislav Čelakovský — language tree
1853 August Schleicher — language tree
1859 Charles Robert Darwin — generalized tree
1864 Johann Friedrich Theodor Müller — small inter-species tree
1865 St George Jackson Mivart — large inter-species tree
1866 Franz Martin Hilgendorf — large inter-species tree

Monday, May 12, 2014

Automated natural language processing


Natural language processing is all about getting computers to automatically extract information from natural (human) languages, rather than from specially designed computer languages, or even from mathematical datasets.

Each year the Conference on Computational Natural Language Learning (CoNLL) features a practical task, in which participants train and test their own language-parsing systems on exactly the same natural-language datasets. For the tenth CoNLL (CoNLL-X), in 2006, the task was Dependency Parsing. (Previous tasks had included chunking, clause identification, named entity recognition, and semantic role labeling.)


Parsing refers to identifying the words, their associated part of speech (noun, verb, etc) and their syntactic relations (subject, predicate, etc) based on the formal rules of grammar. In computational linguistics the result is often represented as a tree diagram showing the relationships among the words. From this tree we can try to understand the exact meaning of the text. Wikipedia, of course, has an article with more details, if you are interested.

For the 2006 CoNNL, the 18 parsing algorithms were tested using treebanks for 12 different languages. In linguistics, a treebank is a previously parsed body of text with the syntactic or semantic sentence structure annotated. So, the idea is to use some existing treebanks (produced by hand) to train the parsers, and then test them on some new treebanks, to see if they can produce the correct tree. In particular, the testing in 2006 involved what is called dependency grammar, which gives primacy to the verb as the structural center of a clause.

The paper by Buchholz and Marsi (2006) discusses the treebanks for the 12 languages, describes how they were converted into the same dependency format, and provides an overview of the parsing approaches taken by the 18 participants. The methods are named after the first author of the associated paper.

I analyzed the results using a couple of phylogenetic networks. As usual, I used the manhattan distance to evaluate the multivariate relationships in the data, and displayed this using a NeighborNet.

The first graph shows the relationships among the different parsing methods. Methods near each other in the network have a similar parsing success, while methods further apart are progressively more different from each other.


The methods form a simple gradient of increasing average success, from top-left to bottom-right. This means that the methods do not vary much in their success from language to language — if they are successful at parsing one language then they are successful on the other languages as well, and if not then not.

Perhaps this is not unexpected. However, the two most successful methods, by McDonald and Nivre, have quite different approaches to parsing — they differ on nine of the ten characteristics listed by Buchholz and Marsi (2006). Their very similar success is therefore noteworthy — there is apparently more than one way of skinning this particular cat.

The second graph shows the relationships among the different languages used. Languages near each other in the network have a similar parsing success, while languages further apart are progressively more different from each other.


The languages also form a simple gradient of increasing average success, from top-right to bottom-left. The average success at parsing Japanese was 86% (range 65-92%) and the average success at parsing Turkish was 56% (range 38-66%). This does not necessarily mean that Japanese is generally easier to pars than Arabic, Slovene and Turkish, because the datasets themselves varied considerably in the type of text contained in their treebanks. Nevertheless, Arabic, Slovene and Turkish are all "morphologically rich" languages, and parsing them is expected to be hard. It is interesting to note that Dutch is different from the other Germanic languages (Danish, German and Swedish), and Spanish is different from Portugese.

The practical task for the 2014 conference will be Grammatical Error Correction, which was also the task for 2011–2013. The parsers will be given short English texts written by non-native speakers of English, and they will be evaluated on their ability to detect the grammatical errors and provide corrected texts. English is an ideal language for this task, as it is often suggested that for every native speaker of English there are 4–5 non-native speakers, and therefore automated correction of text would be of enormous practical value. (Mandarin Chinese has more speakers in total, but most of these are native speakers.)

Reference

Buchholz, Sabine and Marsi, Erwin (2006) CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the 10th Conference on Computational Natural Language Learning, pp. 149-164. Association for Computational Linguistics, Stroudsburg, PA.

Monday, November 18, 2013

Language history and language weirdness


Native speakers of any language will judge the "difficulty" of another language by how much it differs from their own. For example, the Foreign Service Institute (FSI) of the U.S. Department of State lists five categories of increasing time taken for native English speakers to acquire "General Professional Proficiency" in other languages. This refers to an average, of course, and anyone may personally find one language or another more easy or difficult than others.


FSI Category I (the least time needed) includes most of the Germanic and Romance languages, since English was originally a Germanic language that received a huge Romance input after the Normans turned up in Britain in 1066. The exception is German itself, which is alone in Category II (needing longer), because of its more complex grammar. Category V (the longest time needed for proficiency) consists of Arabic, Cantonese, Japanese, Korean and Mandarin, with Japanese being considered the most difficult.

Most languages are in Category IV, including the rest of the Indo-European languages. The recognizably tougher ones in that group are the Uralic languages (Estonian, Finnish and Hungarian), because of their countless noun cases. Interestingly, Category III (easier than IV) consists of Indonesian, Malaysian and Swahili, which have no known historical connection to English — they just happen to have fewer linguistic differences than do the other languages.

And that is the point of this post — linguistic similarities don't necessarily reflect the evolutionary history of the languages. There are trees allegedly showing the genealogy of languages, because there is vertical transfer of information in the history of languages (generation to generation), but horizontal transfer has also been a powerful evolutionary force, as cultures come in contact with each other. The history of English, as noted above, shows both vertical (Germanic) and horizontal (Romance) influences. Language history is a reticulating network, not an evolutionary tree.

Just as importantly, though, languages can have coincidental similarities. There are, after all, not that many different ways of constructing a language, and there are reported to be ~6,900 distinct languages on this planet. So, chance similarities must abound — what in biology we would call parallelisms and convergences. This makes constructing the evolutionary history of languages difficult.

The complexity created by coincidences has lead some people to wonder about how "unusual" any one language might be. This can be defined as how many of its characteristics occur commonly in other languages, and how many of them occur more rarely. The most unusual languages will be those that have lots of the rare features; and we might call them linguistic outliers. The Idibon blog has already had a look at this topic (The weirdest languages), and here I reconsider their data in the light of a phylogenetic network.

The data

The original data come from the World Atlas of Language Structures, which describes itself as "a large database of structural (phonological, grammatical, lexical) properties of languages gathered by a team of 55 authors". There are apparently 2,676 different languages in the database, coded for 192 linguistic features. Sadly, the database is very sparse, so that most languages have not yet been coded for most of the features (there are 5–1,519 languages coded for each feature).

So, the Idibon people selected a subset of the data: 1,693 languages and 21 features. These features were chosen to be an uncorrelated subset of those 165 features that have at least 100 languages coded; and the selected languages each have at least 10 features coded.

The features are certainly an eclectic collection, which you can read about on the WALS site:
83A:
87A:
143A:
143G:
69A:
116A:
57A:
101A:
6A:
71A:
129A:
130A:
44A:
14A:
9A:
72A:
111A:
64A:
124A:
117A:
19A:
Order of Object and Verb
Order of Adjective and Noun
Order of Negative Morpheme and Verb
Minor Morphological Means of Signaling Negation
Position of Tense-Aspect Affixes
Polar Questions
Position of Pronominal Possessive Affixes
Expression of Pronominal Subjects
Uvular Consonants
The Prohibitive
Hand and Arm
Finger and Hand
Gender Distinctions in Independent Personal Pronouns
Fixed Stress Locations
The Velar Nasal
Imperative-Hortative Systems
Nonperiphrastic Causative Constructions
Nominal and Verbal Conjunction
'Want' Complement Subjects
Predicative Possession
Presence of Uncommon Consonants
From the subset of languages, I chose all of those languages with at least 12 of these features coded, plus Icelandic (10 features), and Cornish and Gaelic(Scots) (11 features).

I then tried to fill in some of the missing data, to get as many languages as easily possible up to having 14 features coded (ie. two-thirds of the features). For the phonology features (6A, 9A, 19A), the relevant information can be looked up on the web, particularly in Wikipedia and the Native American Language Net. For the word features (129A, 130A), I used the LEXILOGOS Online Translation.

In the process, I found that Idibon has at least one feature mis-coded compared to the WALS web site: for feature 14A, some of the languages that should be coded "Second " have been coded as "Antepenultimate", and all of the others that should be coded "Second" have missing data.

I also found a few contradictions between the WALS coding and the information elsewhere on the web. In some of these cases I re-coded the WALS data.

My final spreadsheet is available online. There are 280 languages coded for at least 14 of the 21 features, compared to 239 such languages in the Idibon analysis. There are 19% of the data still missing, varying from 0–53% across the 21 features.

The network

My network is intended as an exploratory data analysis, rather than some attempt at an evolutionary diagram. Thus, the network simply displays the apparent similarity among the languages. That is, languages that are closely connected in the network are similar to each other based on their linguistic features, and those that are further apart are progressively more different from each other.

First, I recoded the multivariate linguistic data as 59 binary characters. Then the similarity among the 280 languages was calculated for each pair of languages using the Gower similarity index, which can accommodate missing data (by ignoring features that are missing for each pairwise comparison). A Neighbor-net analysis was then used to display the between-language similarities as a phylogenetic network.


The network is not very tree-like, is it? A few tentative groups can be recognized, as indicated by my colouring, but that is all. These groups do not correspond to any known language groups, meaning that the language features chosen do not reveal a traditional tree-like genealogy. Whether this reflects horizontal transfer of linguistic features, coincidence, or simply inadequate data, is not necessarily clear.

However, it seems most likely that much of the complexity represents coincidence. In the study of language evolution, parallelism and convergence are not nuisances, which is the way they are treated when constructing phylogenies of organisms. Coincidental similarities are a fundamental part of language history, but they are not necessarily the product of processes like natural selection, as they often are in biology.

If we look at some of the details, the nature of the complexity becomes clearer, as shown in the next figure. Here, I have colour-coded the Indo-European family of languages by their so-called "genus", plus the other languages that occur in Europe (the Uralic group, and Basque):
Albanian - pale brown
Armenian - dark brown
Baltic - orange
Celtic - pale blue
Germanic - black
Greek - pale green
Indic - pink
Iranian - blue
Romance - purple
Slavic - green
Uralic - red
Basque - grey


Note that the seven Germanic languages are clustered in a single location, as are the two Baltic languages. The others appear in either two (Celtic, Romance, Iranian) or four (Indic, Slavic, Uralic) locations. This implies considerable linguistic variation within most of what are considered to be closely related languages (that is why they are called language genera). A larger collection of features might change the pattern, of course, but I still reckon that there is a large component of non-vertical transmission here. This is either coincidence or horizontal transmission. For the Indo-European languages, the latter is perhaps quite likely; but it is equally likely that it is simply coincidence, even at this relatively fine scale.

The weirdest languages

The Idibon blog tried to reduce the multivariate data down to a single number for each language (scaled 0–1), representing its "weirdness" in terms of how many uncommon features it has. So, I have performed the same calculation for my expanded dataset.

The complete list is in the spreadsheet, but here are the top and bottom most-unusual languages:
Top 20
Mixtec (Chalcatongo)
Seri
Nenets
Diegueño (Mesa Grande)
Oromo (Harar)
Choctaw
Kutenai
Iraqw
Danish
Kongo
Norwegian
Dutch
Swedish
German
Armenian (Eastern)
Abkhaz
Mumuye
Ju|'hoan
Khoekhoe
Ladakhi

0.9725
0.9354
0.9346
0.9196
0.9187
0.9138
0.9079
0.9005
0.8843
0.8830
0.8751
0.8705
0.8585
0.8581
0.8473
0.8445
0.8410
0.8346
0.8300
0.8247
     Bottom 20
     Kanuri
     Kunama
     Kiowa
     Marathi
     Khanty
     Turkish
     Bulgarian
     Wichita
     Manam
     Kewa
     Sentani
     Bororo
     Usan
     Cantonese
     Hungarian
     Chamorro
     Ainu
     Cherokee
     Purépecha
     Hindi

0.2410
0.2401
0.2361
0.2752
0.2149
0.2145
0.2112
0.2054
0.2085
0.1984
0.1952
0.1534
0.1508
0.1435
0.1316
0.1285
0.1277
0.1232
0.0997
0.0872

My results differ from those of the Idibon blog for two reasons: more languages, and more data for some of the languages. Some of my added languages make it to the top of the weirdness list, including Seri, Danish and Swedish; and some of the other languages considerably change their score — for example, Hebrew, Welsh, Portuguese and Chechen are now near top of the list, and Quechua, Basque, Saami and Cornish are no longer near bottom. All of the big changes are increases in weirdness, suggesting that the missing data are important for this calculation.

Nevertheless, it is worth noting that five of the seven Germanic languages are in the top 15 (plus English is at 40 and Icelandic 47). Unusually, most of the Germanic languages still use cases (modifications to words that show how they relate to other words in a sentence). This means that you have to memorize a lot of different versions of each noun, just as you do in Latin. Moreover, these languages change the word order when asking a question as opposed to making a statement, whereas most languages add a participle instead. (In the most unusual language, Mixtec, a native language from Mexico, there is apparently no difference between a question and statement!)

English has a lower score than other Germanic languages presumably because of the French influence mentioned above (French is ranked 42). For example, in English there are now very few cases (only for some pronouns), as in the other Germanic languages, but instead it uses a fairly strict word order to express grammatical relationships. (You will note that two of the English-speaking authors of this blog now live in countries with other Germanic languages, and so we know just how big a pain it is to learn illogical case endings.)

English does have one really odd feature, though, which is the use of the sound "th" (which is part of feature 19A). There are two forms of this sound, voiced (as in "the") and unvoiced (as in "thing"). These sounds do not exist in most languages, and they are rare even among the other Indo-European languages. That is why you often hear non-native speakers say "dis" and "zis" instead of "this" — "th" is a sound that they have no experience making.

Actually, the Indo-European languages are very diverse in their weirdness. Many of them are at the top of the list, but there are also some at the bottom, including Hindi which is dead last. Notably, three of the Romance languages are at the top (Spanish, Portuguese, French) and two are at the bottom (Romanian, Italian). This seems unlikely, given the overall similarity of Spanish and Italian, for example; and so it probably reflects the specific choice of linguistic features.

The data are also potentially sensitive to some of the feature coding. One notable example is for feature 19A in Arabic. WALS codes Arabic as having pharyngeals but not "th", while Wikipedia says that the pharyngeals are doubtful, but that Arabic has "th". So, the possble codings of Arabic, and their resulting weirdness, are:
Feature
"Th" sounds only
Pharyngeals only
Pharyngeals and "th"
Score
0.0893
0.0469
0.0045
Weirdness
  0.6788
  0.7416
  0.9245
So, this feature alone can potentially change Arabic from "normal" to "very weird", depending on how it is coded.

Conclusion

Languages do not have a tree-like evolutionary history. Even the relatively small dataset presented here seems to show the influence of horizontal evolution. But, more importantly, we should not underestimate the coincidental occurrence of language features (parallelism and convergence). These have usually been treated as a nuisance in phylogenetic studies of organisms, but they are likely to be important for the study of languages. I have discussed this further in a previous post (False analogies between anthropology and biology).

Monday, July 22, 2013

The earliest tree / network of languages (1671)


Urmas Sutrop (2012), who seems to have dug deeper into linguistic history than most other researchers, has noted that: "The first language family trees I managed to track down date from the 17th century. To my knowledge, the very first language family tree was published by the Estonian-Swedish scholar Georg Stiernhielm."

Actually, Stiernhielm's "tree" is a hybridization network, thus making it also the first known phylogenetic network, of any type.


Georg Stiernhielm (1598-1672) was a civil servant, linguist and poet. He is best known as "the father of Swedish poetry" (he didn't write many poems, but their language form was very influential), but here we are interested in his linguistic work. In particular we are interested in his 1671 edition of Wulfia's "Gothic Bible": D.N. Jesu Christi SS. Evangelia ab Ulfila Gothorum in Moesia Episcopo Circa Annum à Nato Christo CCCLX. Ex Græco Gothicé translata, nunc cum Parallelis Versionibus, Sveo-Gothicâ, Norrænâ, seu Islandicâ, & vulgatâ Latinâ edita (published by Nicolai Wankif, Stockholmiæ). A copy is available from Google Books.

The Gothic Bible or Wulfila Bible is the Christian Bible as translated by Bishop Wulfila into the Gothic language spoken by the Eastern Germanic, or Gothic, Tribes in c.350 AD. Wulfila invented the Gothic alphabet, comprised of Greek letters and runic signs improvised by himself, so that he could do this, and it is thus considered to be the first text written in German.

Stiernhielm's edition sets out four texts in parallel (ie. four columns per double page): Gothic, Icelandic, Swedish (called Suedo-Gothic), and finally "vulgar Latin". The transliteration of the Gothic text is in Latin font, the Icelandic and Swedish translations appear in the so-called "Gothic" letters, and the Latin translation is, naturally, in Latin font.

What is important to us, however, is that Stiernhielm took the opportunity to present a 48-page preface: De Linguarum Origine Præfatio [Preface on the Origin of Language], in which he discussed his ideas about the origins of languages. The diagram shown above (from page xxxvi) is apparently intended to illustrate the idea that three Germanic dialects [Svevica, Mechlenbergia, Brabantica] could gradually merge into one new dialect [Lingua Nova], which would be different from the earlier ones but would still be a Germanic dialect [ipsa Germanica]. This is thus explicitly a hybridization network.


However, Stiernhielm went much further than this. As Umberto Eco (1995) has described, there has long been the idea (dating back at least to the Christian Bible, and the story of the Garden of Eden) that there once existed a language which perfectly and unambiguously expressed the essence of all possible things and concepts, and that the jumble of modern languages is a confused corruption of this "perfect language" (this is the story of the Tower of Babel). Many European philosophers have speculated about a solution for this modern confusion, either by trying to retrieve the language spoken in the Garden of Eden, or by thinking of a "Language of Reason" that would possess the perfection of the lost speech of Eden.

The languages that have been proposed as this "perfect language" include, in time order:
Hebrew, Gaelic, Tuscan, Dutch, German, Swedish, English, and French.
Stiernhielm, in his Preface, was responsible for the suggestion of Swedish.

Stiernhielm's argument was that Old Swedish (Suedo-Gothic) came closest to the "primaeval language" because Old Swedish was a Japhethian language. In the Bible, Japheth had not been present under the Tower of Babel, and therefore was not involved in the subsequent confusion of languages. Stiernhielm argued that the language of Japheth and his descendants ought thus to be a continuation of the language spoken in the Garden of Eden. He concluded that all of the Gothic dialects arose from this stock (he illustrates this with family trees), and he considered Old Swedish to be the most archaic Japhethian language.

This patriotic conclusion was not at all out of place in 17th century Sweden. The Swedish empire was then at its height, covering most of northern Europe. Indeed, shortly after Stiernhielm, Olof Rudbeck (a professor at Uppsala University) wrote a four-volume work (Atlantica sive Manheim) supporting the idea that Swedish was the original language of Adam, and also identifying Sweden as Atlantis, the cradle of civilization, from which civilization spread to the rest of the world. He did some useful things, too, including founding what later became Linnaeus' botanical garden.

Thanks to Johann-Mattis List for alerting me to Sutrop's paper, and thus leading me to Stiernhielm's work.

References

Eco U. (1995) The Search for the Perfect Language. Wiley-Blackwell.

Sutrop U. (2012) Estonian traces in the Tree of Life concept and in the language family tree theory. Journal of Estonian and Finno-Ugric Linguistics 3: 297-326.

Monday, June 24, 2013

The first Darwinian evolutionary tree


Tassy (2011) has pointed out that a Darwinian evolutionary tree has certain key characteristics that (in combination) distinguish it from other models of evolution, such as those devised by Darwin's predecessors:
  • it includes ancestral and descendant forms
  • ancestral taxa are species not higher taxa
  • extant taxa are only at the leaves not the internal nodes or edges
  • there is gradual a transition between forms
  • there is splitting of lineages.

Almost all of the early trees and networks do not match at least one of these criteria. For example, the earliest networks, those of Buffon in 1755 and Duchesne in 1766, illustrated within-species relationships in dog breeds and strawberry cultivars, respectively, so that contemporary taxa appeared at internal nodes (ie. some breeds or cultivars were seen as ancestors of others).

Lamarck's famous tree of 1809 showed relationships between higher taxonomic groups rather than species, and had several such groups transforming into other groups, so that the interior nodes represented contemporary taxonomic groups. His view of evolution was thus fundamentally different to that of Darwin.

Most of the subsequent pre-1859 trees of biological relationships showed non-genealogical affinity, for example those of Agassiz, Augier, Bronn, and Hitchcock — these were not intended to be evolutionary diagrams, because their authors did not believe in evolution (Ragan 2009; Tassy 2011). Other people followed the lead of Lamarck, and thus drew similar trees, such as Barbançois, Strickland, and Wallace.

Atkinson and Gray (2005) point out that "Darwinian ideas of descent with modification were less revolutionary in linguistics than they were in biology", and so Darwinian trees appeared earlier in linguistics. For example, Schlegel (1808) is usually credited with introducing a "stammbaum" (family tree) approach to comparative grammar, along with Bopp (1816). The previous language tree of Gallet in c.1800, showed a combination of geographical and chronological relationships, rather than being strictly genealogical, and some contemporary languages were shown at internal nodes. Both Čelakovský and Schleicher in 1853 independently drew the first truly genealogical diagrams in linguistics. These had contemporary languages at the leaves but language groups on the internal edges, rather than ancestral languages.

This leaves open the question of who first drew a tree that could be considered to be completely Darwinian.

The first tree

A family-tree approach has also been developed for textual analysis, where genealogical diagrams are called "stemma", and this is where we actually find the first diagrams that match all of Darwin's ideas about evolution, as listed above.

In 1827 Hans Samuel Collin and Carl Johan Schlyter published the first volume of the Corpus Iuris Sueo-Gotorum Antiqui, which was a compilation of all of the Medieval laws of Sweden, presented in both Latin and Swedish. Collin was involved as editor of volumes 1 and 2, with Schlyter acting as sole editor of volumes 3-13 (the latter published in 1877, so that Schlyter spent 55 years on the project).

In order to compile the definitive version of the laws, the editors consulted all of the known documents (some 800 or so), which consist of hand-written manuscript copies, each one being a copy of some earlier copy. The editors performed a detailed comparative analysis of the texts in order to establish an authemtic version of the original laws (their subsequent commentary in the books is longer than the laws themselves). This is, literally, a study of "descent with modification", and not merely an analogy with Darwin's famous expression.

The first volume, which covers the county laws of Västergötland, is unique among the 13 volumes in that the editors make the following comment in their Preface (page XXXVII):
Latin:
Quo evidentius appareat mutua illorum codicum nunc descriptorum ratio, qui continent textum Iuris VG. antiquioris vel recentioris, vel partem aliquam illius textus, hanc rationem, prout ex iis, in quibus inter se conveniunt aut differunt codices, iudicare potuimus, schemate quodam cognationis, Tab. III, exprimere tentavimus.
Swedish:
För att göra förhållandet emellan de nu beskrifna codices, som innehålla WGL:s text eller någon del deraf, enligt dess äldre eller yngre redaktion, så mycket mer åskådligt, hafva vi sökt att genom ett slags stamtafla, Tab. III, framställa deras slägtskap så som vi af deras inbördes öfverensstämmelser eller olikheter tryckt oss kunna sluta dertill. 
Holm (1972) translates the Swedish as:
To make the relationship all the clearer between the codexes now described, containing in whole or in part the text of the Västergötland Law in its older or younger redaction, we have attempted to present their affinities, as far as we could determine them from mutual agreements and differences, in a kind of family-tree in Table III.
There are two online copies of the book, in Google Books, but neither of these displays Figure III correctly, as it is apparently a foldout. So, I have included it here.

The 1827 stemma from Collin & Schlyter.
The manuscript texts are lettered. The vertical axis represents time,
with the dashed lines indicating 25-year intervals from 1300 to 1500.

O'Hara (1996) points out that the idea of establishing the most authentic version of a text by reconstructing its ancestry may have been part of an earlier monastic tradition, designed to elucidate the nature of the original scriptures. However, Collin and Schlyter appear to be the first to have done this in such a thorough manner, as most people did not bother to locate all of the extant texts (see Holm 1972). Moreover, their use of a genealogical diagram to illustrate their conclusions seems to be totally original (Timpanaro 2005; Robins 2007). The stemma matches all of the Darwinian criteria, and so it lays claim to being the first Darwinian tree. Its most obvious difference to Darwin's ideas is that it refers to individual objects rather than to groups such as species.

Holm (1972) attributes the figure (and the idea for it) to Schlyter alone, although there is nothing in the original text to support this assumption — all of the editorial comments are written in the plural. However, Frederiksen (2009) has revisited the background to the stemma, and she concludes that "there is every possibility" that Schlyter should be given the sole credit. She also points out that if the stemma is "regarded as a schema that draws up the principal lines [of descent] and disregards the contamination of the tradition it would seem to be almost accurate." In other words, the editors seem to have got it right the first time.

The issue of contamination is an important one, referring to the fact that many textual copies are actually compiled form multiple sources. Under these circumstances the stemma should be a reticulating network not a tree. Indeed, Holm (1972) attributes the absence of stemma in any of the other 16 volumes to concern by Schlyter about contamination, and therefore the actual usefulness of a tree. Nevertheless, several other people produced stemma of varying degrees of sophistication soon after 1827, including Carl Zumpt in 1831, Friedrich Ritschl in 1832, and Johan Madvig in 1833 (Holm 1972; O'Hara 1996; Timpanaro 2005). For this blog, it is worth pointing out that the stemma by Ritschl (1832) explicitly shows contamination, and is thus a reticulate network, the first of its kind in stemmatology.

Hilgendorf's 1866 phylogeny of fossil snails.
The fossils are aligned horizontally with respect to
their geological layers.

One very interesting feature of the Collin & Schlyter figure is its clear resemblance to the fossil diagrams produced independently by Franz Hilgendorf and Albert Gaudry in 1866. Both of these people studied fossils in situ, so that they could see their distribution in the geological layers, and the fossil record was complete enough for them to construct evolutionary scenarios that connected the fossils together. They thus both produced evolutionary trees with the vertical axis explicitly representing time. The only real difference from the stemma is that in their diagrams time proceeds from bottom to top (as do the fossil layers in situ).

Finally, it is worth noting one very modern feature on the stemma. In only one case is a manuscript indicated as being a direct descendant of another. In all other cases the internal nodes are unlabeled, so that the known texts show sister-group relationships rather than direct ancestor-descendant relationships. If we do not have independent evidence that an observed text (or fossil) is a direct ancestor of another text (or fossil), then we should not indicate it as such in the evolutionary history.

References

Atkinson QD, Gray RD (2005) Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Bopp F (1816)  Über das Conjugationssystem der Sanskritsprache, in Vergleichung mit jenem der griechischen, lateinischen, persischen und germanischen Sprache. Andreäischen, Frankfurt-am-Main.

Collin HS, Schlyter CJ (eds) (1827) Corpus Iuris Sueo-Gotorum Antiqui. Volumen 1. Westgötalagen. Haeggström, Stockholm.

Frederiksen BO (2009) Stemmaet fra 1827 over Västgötalagen: en videnskabshistorisk bedrift og dens mulige forudsætninger. Arkiv för Nordisk Filologi 124: 129-150.

Holm G (1972) Carl Johan Schlyter and textual scholarship. Saga och Sed (Kungl. Gustav Adolfs Akademiens Årsbok 1972): 48-80.

Lamarck J-B (1809) Philosophie Zoologique. Dentu et l'Auteur, Paris.

O'Hara R (1996) Trees of history in systematics and philology. Memorie della Società Italiana di Scienze Naturali e del Museo Civico di Storia Naturale di Milano 27: 81-88.

Ragan MA (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Ritschl F (1832) Thomae Magistri sive Theoduli Monachi Ecloga vocum Atticarum. Orphanotrophei, Halle.

Robins W (2007) Editing and evolution. Literature Compass 4: 89–120.

Schlegel F (1808) Über die Sprache und Weisheit der Indier: ein Beitrag zur Begrundung der Alterthumskunde. Mohr und Zimmer, Heidelberg.

Tassy, P. (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Timpanaro S (2005) The Genesis of Lachmann's Method [translation]. University of Chicago Press, Chicago.

Wednesday, June 19, 2013

Using phylogenetic analyses for textual analysis


I have written before about the distinction between phylogenetic networks and other types of biological network (see Biological versus phylogenetic networks). Basically, a phylogenetic network starts with observed data and infers the network connections via some optimization procedure, whereas for most other networks the connections are the observed data and the network is summarized by one or more statistics such as Degree Centrality or Betweenness Centrality (for an explanation, see Network measures and phylogenetic networks).

This distinction is also important for the use of networks as data displays, both in biology and elsewhere. I have noted that splits networks, for example, are a very useful alternative to multivariate data-display analyses such as Principal Components Analysis (see Networks can outperform PCA ordinations in phylogenetic analysis). PCA can, for example, produce mathematical artifacts that distort the display, which is obviously undesirable (see Distortions and artifacts in Principal Components Analysis analysis of genome data).

It is interesting, therefore, to compare the different network types in terms of their ability to analyze and display a particular data set. To demonstrate the generality of the methods, here I discuss an analysis of a text document.

The one I have chosen has previously been analyzed by Seth Long (Text Network and Corpus Analysis of the Unabomber Manifesto). The Unabomber Manifesto is a 35,000 word document from 1995 entitled "Industrial Society and its Future", written by Theodore (Ted) Kaczynski, which is basically a critique of contemporary techno-capitalist society. A textual analysis is of interest because, as Seth notes: "The motives of all authors — or at least their traces — are always left behind in the lexical choices of their texts. Deliberate, written language is like a rhetorical fingerprint."

Textual analyses

Seth Long's textual analysis procedure was:
  1. import the text into an analytical tool (in this case AutoMap) in order to remove trivial words (eg. articles, conjunctions, pronouns), and to reduce inflected words to their base form;
  2. use the same tool to quantify what words are connected to what other words and how often (in this case using a two-word gap);
  3. import the result into a a network analysis tool (in this case Gephi) in order to visualize the semantic connections; each word is visualized as a node in the network, and words that appear next to each other appear as edges in the network.
According to Seth:
The two most important network visualizations, in my opinion, show nodes with the highest levels of Betweenness Centrality and the highest levels of Degree Centrality. The latter measures how many total connections a node has to other individual nodes. The former measures whether or not a node is connected to other nodes that themselves have many connections. 
In a textual network, a word with high degree centrality is a word used in connection with a myriad of other words. This simply tells you that a word is used frequently in a text and in a variety of contexts. A word with high betweenness centrality is a word used frequently and in conjunction with other words that also connect to other nodes to form community clusters. This tells you that a word is not only used frequently and not only in many contexts but also that it is used in connection with words that also do a lot of semantic work in the text. A word with high betweenness centrality is a word through which many meanings in a text circulate.
Nodes with the greatest Degree Centrality in the text

Nodes with the greatest Betweenness Centrality in the text

The size of the words in the two networks represents their "amount" of centrality (ie. their importance). Clearly, these networks are very complex, and it would be best to simplify them. Seth does this in some of his other textual analyses, where he uses "one of Gephi’s degree range tools to hide the most disconnected nodes, thereby ‘cleaning’ the visualization of all but the most prominent clusters and connections" (eg. see Meaning circulation in Lolita). This has not been done in this example.

I will not provide an interpretation of these two networks, which you can find in Seth's original post. The basic conclusion is that "Kaczynski is a primitivist who loves nature more than humanity."

Seth also notes:
One thing a text network does, beyond providing an interesting visualization, is to point the researcher in the direction of terms and n-grams that might be explored more granularly in a corpus analysis tool, such as the NLTK [Natural Language Toolkit]. It provides a map of a text’s semantic circulation, a map that can be followed when we return to the world of pure textuality.
The two corpus analyses that Seth Long provides are a histogram of the most frequent words, and a graph of where in the text the most frequent words fall (beginning, middle, end, throughout, etc).

Phylogenetic analyses

We can now compare these analyses to the use of phylogenetic trees and networks as heuristic tools for data analysis and display. The objective is the same as for the above analyses, and the general approach is also very similar. The main difference, as explained above, is that the nodes and edges are inferred rather than observed. This means that the words appear only at the ends of terminal edges, rather than being scattered throughout the network.

These analyses involve:
  1. remove trivial words, count the frequency of the remaining words, simplify the network by choosing how many of the words to display (50 in this case), and record their location in the text;
  2. calculate the semantic "distance" between words based on their co-occurrence in a sliding window (in this case 20 words), using some similarity measure (in this case the jaccard coefficient); 
  3. visualize the distances as an unrooted phylogenetic tree, in this case a neighbor-joining tree calculated using TreeCloud;
  4. visualize the distances as an unrooted phylogenetic network, in this case a neighbor-net (a splits graph) calculated using SplitsNetworkCloud.

TreeCloud of the text

NetworkCloud of the text

In these phylogenetic analyses, the size of the words represents their frequency in the text, and the colour of the words represents their location in the text (red near the beginning, blue near the end). This adds the corpus analyses to the network visualization, making the graphs more informative. This can happen because the visualization itself is the inferred network, rather than the visualization summarizing various aspects of centrality of the underlying observed network.

The relative distance between two words in the text is given by the relative length of the path between them in the tree or network. Note that the "clean-up" of the graphs, by restricting the number of included words, helps a lot with the interpretation (as it would if also applied to the previous two graphs).

The phylogenetic tree focuses on certain of the word connections, rather than trying to display them all — it tries to infer which connections are "important" based on the measure of semantic distance, rather then connecting all nearby words. The interpretations from the tree are similar to those from the previous networks, but in many ways the interpretations are displayed more directly by the inferred (phylogenetic) graph than by summarizing centrality (either degree or betweenness).

Finally, the network is much more complex than the tree, which is often the case. Note that for the tree the edge lengths are all equal, but in the network the "average" distance between two words is given by the length of the path between them. This is just for illustrative purposes, as the tree could be drawn with variable edge lengths or the network drawn with unit edge lengths. The main reason for using unit edge lengths is that the terminal edges are often very long and the structure of the tree or network is hidden in the centre, as discussed by Gambette et al. (2012).

The main patterns that the network adds to the tree are: (a) "human" is separately associated with "control" and "behavior" on one hand and "beings" on the other; (b) "power" is separately associated with "process" and "autonomy" on one hand and "satisfy" on the other; and (c) "primitive" is separately associated with "individuals", "societies" and "groups" on one hand and "modern" and "man" on the other.

Conclusion

The phylogenetic approach is helpful because it focuses on certain of the network connections, rather than trying to display them all, as do the other networks. It cannot separately analyze concepts like degree and betweenness, and so information is lost; but this is traded off against the ability to include corpus analyses such as word frequency and location. Phylogenetic trees and networks can thus be valuable tools for textual analysis.

The TreeCloud was introduced by Gambette and Véronis (2010).  If you read French, then examples are presented by Amstutz and Gambette (2010) and by Gambette and Martinez (2012) (the latter has a comparison with some other multivariate data anlayses).

Thanks to Philippe Gambette for producing the NetworkCloud.

References

Amstutz D., Gambette P. (2010) Utilisation de la visualisation en nuage arboré pour l'analyse littéraire. JADT'10: 10th International Conference on Statistical Analysis of Textual Data.

Gambette P., Gala N., Nasr A. (2012) Longueur de branches et arbres de mots. Corpus 11: 129-146.

Gambette P., Martinez W. (2012) L'affaire du Mediator au prisme de la textométrie. Manuscript.

Gambette P., Véronis J. (2010) Visualising a text with a tree cloud. In: Locarek-Junge H., Weihs C. (eds) Classification as a Tool of Research, Proceedings of IFCS'09 (11th Conference of the International Federation of Classification Societies), pp. 561-570.