The Genealogical World of Phylogenetic Networks: Character cliques and networks

[This post is the second part of our miniseries on the origin and evolution of sign language manual alphabets]

One aspect of exploratory data analysis (EDA) is for us to try to understand how our data relate to our inference(s). This is especially important when the signal from our data is increasingly complex. Sign language manual alphabets are such a case.

In our first post about sign language manual alphabets, I introduced the principal networks that we used to classify sign languages. Here, I'll describe our character mapping procedure and why we did it as part of our EDA framework, in order to establish scenarios for the origin and evolution of sign languages.

Characters and mapping

We encoded each hand-shape used to signify a certain concept, such as the letters included in the standard Latin alphabet "a", "b", "c", .... "x", "y", "z", as a binary sequence – the presence or absence of a certain COGID (we will explain and discuss this in a later post). These binary sequences can be seen as an analogy of the genetic code, as a sort of 'linguistic haplotype', and their evolution can be mapped onto a network based on the entire dataset.

For instance, our matrix has three binaries (haplotypes) for the concept [g] in the oldest set of sign languages (pre-1840), two of which can be found in the earliest alphabets in our dataset: those of Yebra 1953 and Bonet 1620. Russian 1835, the oldest Cyrillic alphabet, uses a somewhat different hand-shape for its counterpart of the Latin "g", the Cyrillic "г".

For the concept [g], we thus have three taxon cliques, each defined by a distinct binary/haplotype: the 'Yebra haplotype', the 'Bonet haplotype', and the 'Cyrillic haplotype'.

By mapping these haplotypes on the network, as shown in the next figure, we can see that there is a small edge bundle reflecting the basic split between the Yebra and Bonet haplotypes.

Hand-shape drawings are taken from the original manuscripts.

We can also see that the Russian haplotype either evolved from the Yebra haplotype kept in the older Austrian-origin Group, ie. is an adaptation of the Yebra haplotype, or that it is a genuinely new invention — note the similarity of the Russian hanshape with the letter г.

We repeated this procedure for all 26 concepts of the standard Latin alphabet, to get an idea of how often the encoded linguistic haplotypes fit with the overall pattern visualized in the inferred Neighbor-nets (ie. the neighborhoods as defined by edge bundles). This is shown in the next figure.

The arrows indicate inferred evolutionary processes (replacement or invention).

Using this network mapping(which, in principle, uses the logic of parsimony/median networks), we can make direct inferences about the general mode of evolution.

For instance, even though Russian 1835 uses a different set of hand-shapes (ie. is defined by partly unique haplotypes), the hand-shapes for the concepts [p] and [z] are exclusively shared with the Austrian-origin Group. The biological equivalent would be: the 'Austrian haplotypes' are a uniquely shared derived feature reflecting a putative common origin of the Austrian and Russian lineages — ie a potential linguistic synapomorphy. We also can see that all haplotypes shared by Russian and all ([a][c][f][r][u][y]) or part ([b][e][i][k][n][o][x]) of the French-origin Group, an alternative source that may have inspired this early Cyrillic alphabet, lack this quality.

We can also make inferences about:

which hand-shape is the original one (O);
lineage-specific / diagnostic hand-shapes, eg. At. = Austrian, Da. = Danish (using two letter abbreviations);
which hand-shapes are shared but apparently derived, eg. At.-Fr. are hand-shapes / haplotypes shared by members of the Austrian- and French-origin groups not found in the Yebra or Bonet alphabets — C stands for cosmopolitan, non-original handshapes common in various lineages, including British-origin Group, and D represents derived but rare hand-shapes without any clear lineage-affiliation; and
alphabet-unique (ie. represent a linguistic autapomorphy.

In addition, we can explore certain details, including patterns (character-based taxon cliques) that are at odds with the overall reconstruction. The latter are to be expected, because the graph is planar (2-dimensional) but the processes that shaped sign alphabets are likely to be multi-dimensional. For instance, our networks failed to resolve the affinity of the contemporary Norwegian Sign Language, the reason for which can be seen in the following character map.

Note the position of Norwegian 1955, which is still part of the Austrian-origin Group (like older manual alphabets used in the late 19th century in Norway). However, it is already influenced by international standardization — eg. concepts [k], [p], and [z] use(d) French hand-shapes. Hence, Norwegian 1955 shares quite a high number of lineage-diagnostic hand-shapes with Danish 1967 and the Icelandic Sign Language. These, and others, were further replaced in its contemporary counterpart (Norwegian SL) by hand-shapes borrowed from various lineages — eg. [c],[f] from the nearly extinct Austrian-origin Group, [p] from the Russian Group, [k] same as in the Spanish Group) — as well as unique hand-shapes, including hand-shapes evolved from earlier forms or those that have been genuinely invented.

Why we map character evolution along networks

In many cases, we only have one set of data, in order to draw our conclusions based on the graph(s) we infer. We cannot test to which degree our data (the way we scored the differentiation patterns) and inferences are systematically biased. Thus, we want to explore which aspects of our inference are supported by character splits, and establish taxon cliques and evolutionary pathways for the characters (scored traits). Lacking an independent source of data, the latter would involve circular reasoning — ie. mapping the traits along a tree derived from those same traits.

By inferring a tree, we crystallize one pattern dimension out of the data, although more often than not this will be a comprise from multidimensional signals. A network, such as a Neighbor-net, has two dimensions, and hence our mapping can consider two alternatives at the same time — this enables us to make a choice, if we have to. Another practical advantage of a Neighbor-net is that it is quick to infer, so that we can easily reduce the data set and use a more focused graph for the map.

In cases where 2-dimensional graphs don't suffice, there are still Consensus networks, which would allow mapping character evolution based on a sample of many alternative trees.

We could even eliminate the circular reasoning while maintaining a relatively stable inference framework. Deleting a character or several characters (or recoding them: see eg. Should we try to infer trees on tree-unlikely matrices?) can easily lead to a new tree topology, although it has less effect on the structure of a Neighbor-net. When we would need to worry about circular reasoning for mapping a certain concept, or two concepts that may have interacted, we just base our Neighbour-net on a distance matrix calculated from a reduced character matrix, and then map only those concepts not considered for the inference.

Other posts in this miniseries