This is a joint post by Guido Grimm and Johann-Mattis List.
Mattis’ last two blog posts dealt with
problems of what linguists call "structural data". Here we
discuss what this means for the inference of relationships between
languages.
A closer look at structural data: the questionnaire issue
As pointed out before, what is called
structural data in comparative linguistics is a very diverse mix of data solely
unified by the idea of having some kind of
questionnaire that a linguist may use when going into the field and trying to
describe a certain language. These questionnaires are a bit
different from the traditional
concept lists usually used for the
purpose of historical language comparison (see the collection of
different lists in the
Concepticon
project by List et al. 2016). The main difference is that they are
based on an imaginative
question
that a field worker asks an informant (which could as well be a
written grammar of the language under question). Since questions can
be asked in many different ways, while concepts in historical
language comparison are usually restricted to the so-called "basic
vocabulary", the diversity of structural datasets is much
greater than the diversity we encounter when comparing questionnaires
based on concept lists.
When analyzing these data, we deal with
characters of very different nature, and likely different evolutionary
pathways or histories. A biological analogy would probably be (true)
total evidence data sets that combine genetic data from: genes/genomes
with different inheritance pathways (paternally, maternally,
biparentally; basic information level), morphological-anatomical data
(visible form, phenotypic), palaeontological data (historical
evidence), ontogenetic (life-history stages, developmental features), and biochemical
data (expression level). The only difference is probably that the linguistic characters’
histories may be more complex. [Side-remark: ‘total evidence’
datasets found in the biological literature are typically just
combination of genetic and morphological data, allowing for the inclusion of
extinct/fossil taxa.]
To give a specific example, let's have a look at a the
Chinese dataset by Szeto et al. (2018), mentioned in Mattis'
blogpost
from September. This dataset is now
accessible as a GitHub repository (
https://github.com/cldf-datasets/szetosinitic). Mattis added some
information regarding the different features of the questionnaire. We
list these features in slightly abbreviated form in the table below,
adding rough categorizations by Mattis in the
Comment column.
ID
|
Description
|
Comment
|
p-1
|
5 or more tone categories
|
phonological / diachronic
|
p-2
|
Retroflex fricative initials
|
phonological / diachronic
|
p-3
|
Bilabial nasal coda
|
phonological / diachronic
|
p-4
|
Stop codas
|
phonological / diachronic
|
p-5
|
Monosyllabic word for 'snake'
|
lexical
|
p-6
|
Differentiation between 'hand' and 'arm'
|
lexical / semantic
|
p-7
|
Differentiation between 'defecate' and 'urinate'
|
lexical / semantic
|
p-8
|
Differentiation between 'eat' and 'drink'
|
lexical / semantic
|
p-9
|
Semantically void suffix in 'table'
|
lexical
|
p-10
|
Different classifiers for humans and pigs
|
lexical / semantic
|
p-11
|
[CLF N] constructions in subject position with
definite reference
|
syntactic
|
p-12
|
Reduplicated monosyllabic nouns
|
morphological
|
p-13
|
Post-verbal modal auxiliary developed from
'ge/acquire'
|
syntactic / diachronic
|
p-14
|
Modified-modifier order in animal gender marking
|
morphological / syntactic
|
p-15
|
Post-verbal adverb meaning 'first'
|
lexical / syntactic
|
p-16
|
[V DO IO] order in double object dative
constructions
|
syntactic
|
p-17
|
'Give' as a disposal marker
|
syntactic / diachronic
|
p-18
|
'Give' as a passive marker
|
syntactic / diachronic
|
p-19
|
'Go' as a post-VP associated motion marker
|
syntactic / diachronic
|
p-20
|
Marker-Standard-Adjective order in comparatives
|
syntactic
|
p-21
|
case system
|
morphological / syntactic
|
Mattis has tried to characterize the features,
i.e. matrix’ characters, by generalizing linguistic categories: "phonological", pointing roughly to
questions about pronunciation (the biological equivalent would be
phenotypic traits in morphology or anatomy); "lexical", pointing to the words in
the lexicon (this would be the DNA of a language); "morphological", pointing to the ways
in which words are constructed; and "syntactic", pointing to the ways in
which words are combined to form sentences. In combination, “morphological” and
“syntactic” are equal to ‘meta-level’ biological traits, such as
development-related features, ontogenetic evidence, and biochemical
composition — the ways in which the genetic code is expressed or used in a
living organism in adaption to the environment.
Mattis also flagged some characters as
"diachronic", to mark whether the respective feature was
selected by the authors due to their independent knowledge about the
history of the Chinese dialects. This is something rarely possible in biology,
but imagine that we could go back in time to literally observe the
evolution of a lineage over a given time-period, and code this
observed evolution as traits. Note that this is not entirely science-fiction —
there are two examples where we can observe directly pathways of biological evolution: mutation patterns in viruses, and horizontal modification of
marine morphs in high-resolution sediment cores.
While one can discuss to what degree a certain feature should belong to this
category, it is rather obvious that all phonological features are
diachronic, because they name distinctions that reflect well-known
processes of sound change, which happened in a couple of Chinese
dialects and have been proposed in the past by dialectologists in
order to classify the Chinese dialects historically.
For example, consider feature p-3 of the
questionnaire: Does a given dialect have a syllable that ends in
[
-m
]
?
From the history of the Chinese dialects we
know that the
[
-m
]
was present in Middle Chinese, but later merged with
[
-n
]
and
[
-ŋ
]
in many varieties. Given that we know that this happened, and that we
know that people have used this to mark a split, especially between
the "innovative" dialects in the North and the South, it is
clear that this feature bears explicit historical information. The same holds for all phonological features that we find in the data:
p-1, the number of different tones in the dialects is again roughly
reflecting the differences between languages in the North and in the
South (the North having lost many tones); p-2 reflects the retention
or specific development of retroflex sounds (similar to
sh
in English as opposed to
s)
mostly in the North; and p-4 reflects if a variety has syllables that
can end in
[
-p,
-t, -k
]
,
again a feature characteristic for the more "conservative"
varieties in the South of China.
 |
Figure 1: Overlap of features in Szeto et al.'s (2018) structural feature collection of Chinese dialects |
Four lexical features have further been flagged
as "semantic"; we query here existing or missing
distinctions of concepts. People who learned, for example, Russian or certain
German dialects know that it is rather common to have a single word for
what other languages call "arm" and "hand" (see the respective entry in the
CLICS database) or
"foot" and "leg".
This diverse feature collection is coded as
binary characters, reflected by presence/absence, or a yes/no answer to the
question in the questionnaire. The choice of features is very selective. A biological analogy would be a matrix
collecting incompatible splits of paternal (molecular) genealogies, along with a few prominent phenotypical traits (reflecting major
evolutionary steps), and some traits that we expect to be primarily triggered
not by genetics (inheritance) but by expression or adaptation to the
environment. Biologists would not phylogenetically analyze such
diverse and complex, potentially selection-biased data (although it
could be very interesting), but linguists do.
In this context, it is remarkable, but also
typical for these kind of data, that the 21-character feature collection by Szeto
et al. (2018) has no feature in common with
the collection by Norman (2003), a 15-character-matrix, which we also
converted to our
Cross-Linguistic
Data Formats (see Forkel et al. 2018) in order to increase the
data comparability.
 |
Figure 2: A Neighbor-net splits graph of the structural data by Szeto et al. (2018). |
The typification, coded as binary matrix to infer the Neighbor-net splits
graph in Figure 2, demonstrates some basic characteristics of such 2-dimensional
graphs. Note four of the 'characters' (typification categories)
correlate with an edge(-bundle) in the network, separating the 'taxa' (the queried features). All "semantic" taxa are also "lexical", but
"lexical" is more comprehensive, hence, "semantic" is placed as
'descendant' of "lexical" (Neighbor-nets can visualize
ancestor-descendant relationships to some degree). "Morphological" taxa
are either just "morphological" or also "syntactic", hence the
pronounced box.
For "diachronic" and "syntactic", we have no
corresponding edge(-bundle), because one taxon is also "lexical", but
the others are "diachronic" and "syntactic" — this is a conflict that cannot be
resolved with two dimensions. To visualize all the resultant 'taxon'
splits, called also taxon bipartitions, we would need a third dimension.
Lacking a third dimension, the Neighbor-net prioritizes keeping most
"syntactic" together, because the "diachronic-syntactic" are closer to
"syntactic" (max. 1 'character' difference) than to
"diachronic-phonological" (2 character difference). The
"syntactic-lexical" has to be placed apart because it is equally close
to "lexical" and "syntactic" 'taxa', but differs much from
"morphological-syntactic" or "diachronic-syntactic", the closest two
relatives of "syntactic"-only 'taxa'. It is resolved closer to the
centre of the graph, because it is more closely related to the other
"syntactic" taxa than to the rest of the "lexical" taxa. This is also the
reason why the "syntactic"-only taxa have to be placed farther out:
"Diachronic-phonological" and "syntactic-lexical" are closer to the other
endpoints, and the distance of "syntactic"-only to
"diachronic-phonological", "lexical" and "morphological" should be as
large as possible.
Losing body parts: How data coding masks underlying processes
Most typologists collecting structural data
are not
per se interested in phylogenies. Yet, given that scholars
deliberately collect historical (diachronic) features, this shows that
even if they would not necessarily admit it, they have a genuine
interest in uncovering the history of the languages under question; or at least, how closely related languages (or here: dialects) are.
But this requires understanding the characters we analyze, the
collected "structural data".
In evolutionary biology, the key question
people (should) ask when trying to select characters is how their
change can be modeled on a tree or a network. What processes could be expected that shaped the data? What is behind the diversity? Is
similarity or dissimilarity instigated by:
- [A] inheritance, i.e. passed from an ancestor
to all / some of its descendants,
- [B] random mutation and/or sorting, i.e. the
product of a stochastic, evolutionary neutral process,
- [C] non-random mutation, i.e. processes that
recur frequently, may be beneficial and positively (gain, or
negatively: loss) selected for, or
- [D] secondary contact, mixing of lineages by
hybridization (symmetric mixing) and introgression (asymmetric
mixing)?
[A]–[C] are vertical processes following a
tree, even if the tree does not necessarily need to be the same; [D] is (mostly) horizontal and
can only be modeled using a network. For each of the above, we
can find an analogy in the evolution of languages.
In addition, process [3], and to a
lesser extent [4], can lead to what biologists call 'homoplasy', meaning that the same feature is observed in two unrelated or
distantly related taxa. In the context of phylogenetic inferences, homoplasies inflict tree-incompatible signals, seemingly reticulate patterns originating from a tree-like evolution. Structural (or other) linguistic data and
phenotypical biological data have a lot in common — complex processes
are boiled down to mere absence or presence of features (or traits, as they are called in biology).
 |
Figure 3: Basic evolutionary processes, we need to consider when looking at linguistic data. Or biological traits, when we replace simplification by adaptive evolution, positively selected traits. |
If we check the features in our table above, and
ask: to which degree can they be used to model these processes (see also
David's last post on illogic in phylogenetics), e.g. simply distinguish between similarity by chance, relatedness, or
secondary contact (mixing), we can easily see that they are by no means
optimal for evolutionary investigations. This is not necessarily because of the processes they involve, but more
importantly because of the data sampling, which makes modeling
almost impossible, with each character needing its own model.
As an example, take the feature p-6 in our
table. Whether or not a language makes a distinction between "arm"
and "hand" does not seem to follow specific geographic or
genealogical patterns. The following figure shows a plot from the CLICS
database (List et al. 2018), visualizing the most frequently
recurring polysemies
(or colexifications) centering
around the concept "arm". The full visualization in CLICS can be found
here,
and when hovering with the mouse over the link between "arm"
and "hand" (marked in green below).
 |
Figure 4: Colexification network in the CLICS database. |
From eye-balling the data, it is hard to find a
consistent geographic / language-family pattern, which suggests that
the feature p-6 is likely to show a high degree of homoplasy in the
languages of the world. Obviously, different people decided not
to distinguish between "hand" or "arm". But, the example of the
Sami languages in northern Scandinavia also demonstrate that some
people using related, long-isolated languages, consistently don't
make the distinction. Here, the homoplasy is inherited
(lineage-conserved). A biological analogy would be the rarely applied
difference between a 'convergence' (a trait is independently
evolved in different lineages) and a 'parallelism' (a trait is
expressed by different but not all members of the same lineage).
 |
Figure 5: Geographic distribution of arm/hand colexifications in the CLICS database. |
A specific analogy to the "hand-arm" colexification / differentiation pattern is
leaf
shedding in oaks and their relatives (Fagaceae, the beech family).
Some oak lineages (section
Cerris of
oaks, beech trees, chestnuts) are essentially or strictly deciduous,
others (sections
Cylcobalanopsis, Ilex,
the sister sections of
Cerris;
Castanopsis,
the sister genus of chestnuts) are always evergreen, and the biggest
group (number of species) of all Fagaceae, subgenus
Quercus
includes evergreen (1 section), mixed (the two by far largest sections), and
deciduous (1 nearly extinct section) sublineages. To some extent this is linked to the climate in
which the species thrive (high latitudes and/or per-humid =
deciduous, low latitude and/or seasonally dry = evergreen), but
consistently evergreen and deciduous lineages do co-exist.
Looking at the Chinese dialects, we see that p-6
represents a trivial split in the network.
 |
Figure 6: A Neighbor-net inferred from the Szeto et al. matrix. Dialects that distinguish "arm" and "hand" with filled dots ('1' for character 6 in the matrix), those that don't ('0') with empty dots. We can put a single line separating all don't- from do-taxa (dialects), i.e. a bipartition of the taxon set fitting the character partition seen in (p-)6. |
But, given the general patterning of the feature on a global scale, does this really mean that it is inherited — that is, a
good feature to reflect relatedness?
Whether a feature is likely to be
homoplastic is just one part of the story. Linguists typically have
more information about how things change than do biologists, putting a
double-edged sword in their hands (that they hardly ever use). Asking whether "hand"
and "arm" are expressed by distinctive concepts does not
consider the underlying processes. Here, we can assume at least three
different character states, namely:
- "arm" and
"hand" are expressed by the same word, which is the
original word for "arm",
- "arm" and
"hand" are expressed by the same word, which is the
original word for "hand", and
- "arm" and "hand" are expressed by different word.
We could even have a forth state, in which "arm" and "hand", in the whole long history of the ancestral languages, was always used to express "arm or hand" (i.e., both body parts). No differentiation and no later generalization from either arm nor hand took place.
 |
Figure 7: Left, current scoring; right, scoring taking into account the actual mutation process. |
From
Ancient Chinese, we know that "1" (Yes, I do differ between "arm" and "hand") was most likely the original state. We can further assume that once the
distinction is dropped, it is less likely to come back again (although this can, of course, also happen). That is, our model involves two possible mutations (vertical process): we lose the word for "arm" due to its replacement by "hand", or we lose the word for "hand" due to its replacement by "arm", each with its own
probability.
 |
Figure 8: Probability distribution for transitions involving "hand" and "arm". |
The probability, mutation or not, and which mutation,
relates to four principal driving factors:
- probability of random loss (mutation)
- probability of random gain (mutation)
- global linguistic tendencies
- regional socially-enforced preference
Establishing
p-arm
(loss "arm") and
p-hand (loss "hand") is not trivial,
because they may be affected by what is the word for "arm" and "hand" (for simplicity we will assume that
p+arm and
p+hand are close to 0). We could expect a higher
tendency to keep the word that is easier to pronounce or less easy to confuse with other words and, hence, is easier to understand. If two dialects with
different states come into contact, this may also influence the decision to take over a
state or not. In everyday language, a distinction between "arm"
and "and" may be useless because of the clear context in which both words
are used, so
p1-word
>
p2-words.
However, closeness to administration
centers or areas with a higher percentage of educated people could
decrease
p1-word,
because it may be considered a sign of poor social standard to not
make the difference between "arm" and "hand".
 |
Figure 9: Vertical and horizontal processes involving transitions of "hand" and "arm". |
Estimating
p can only be left to phylogenetic algorithms (unless more detailed
information is available). But we can (and should) design the
questionnaire to capture as many of the processes as possible. In
this case, to not only ask whether there is a distinction between "arm" and "hand", but also to find out whether the word "arm"
or "hand" is used, e.g. by using two questions/binary characters:
- Do we use "hand"?
- Do we use "arm"?
Note that this question requires quite a deal of knowledge about the
languages under investigation, since it may not be trivial to find out
what was the "original" word for "arm" or "hand".
Therefore, a further step would be to replace the binary
characters by a value measuring the similarity between the words used
for "hand" and those used for "arm". One could again argue that adding this information
would add historical information to the feature, but it is clear that
the abstract nature of the question is hiding important phylogenetic
(and also typological) information from us.
It seems therefore, that,
instead of asking whether or not there is a distinction between "arm"
and "hand", it would make much more sense to trace the
cognacy (or homology) of the expressions for "arm" and
"hand" across all taxa (languages, dialects), and think of ways how this
could be scored and modeled by phylogenetic analyses. The structural data
framework with its features based on simple yes-no questions
therefore inevitably leads to a misinterpetation of processes when
analyzing the data with phylogenetic software.
The need for exploratory data analysis
In reality, structural (or other) data sets in
linguistics face problems similar to the ones palaeontologists face when trying to
establish phylogenetic relationships between fossils (extinct organisms) — the probability
for a mutation (visible change) is largely unknown, and differs not
only from character to character but also
within the same characters.
A state 0, 1, 2 etc. may have a higher probability to manifest (or
get lost) in one lineage than in another.
In addition, the linguistic problems recur in a similar way to that of biologists working close to and below the species level (see also
Guido's post on population dynamics and individual-based fossil phylogenies) —
reticulation is rather the rule than the exception, as similarity is
triggered by contact, so that horizontal processes, not inheritance, may dominate evolutionary dynamics. Thus, the diversity pattern cannot be modeled by a tree alone. Establishing explicit probabilistic frameworks
to deal with this may not only be difficult but even impossible (given the available
data). Meanwhile, however, one can embrace exploratory data analysis as a heuristic tool.
So, let's look at the example. As in the original paper, we used the binary matrix of the 21 characters to infer a planar, 2-dimensional
(meta-)phylogenetic network, a Neighbor-net splits graph. The resulting graph is a longitudinally inflated spider-web, with its endpoints defined by the southern
Chinese dialects (e.g. Guangzhou, Nanning,
Taishan) and the north-central (eg. Linxia and Xining) dialects. The latter
are significantly closer (geographically and data-wise) to the Bejing
version of Chinese.
 |
Figure 10: The Neighbor-net based on simple mean (Hamming) pairwise binary character distances |
The first thing to note is that the matrix
includes dialects that are indistinct (green stars) for all 21 characters, and some
that are geographically and data-wise very similar to each other, while being
distinct from all others (green ovals). In biology, we call this (taxic, lineage-)coherence. In addition to Linxia and Xining, we have
Nanchang and Lichuan characterized by elongated ('tree-like')
terminal edge-bundles. These obviously represent closely related
dialects sharing a long(er) common history.
Others have more than one possible closest
relative. For instance, Liuzhou may share quite a few features with
Guangzhou, but it is equally close to the Nanchang-Lichuan pair (yellow fields).
Dongtai (orange star) is unique, but its 'neighborhood' (orange-ish brackets) as defined by shared
edge-bundles that include Changsha (which again is most related to Jiujang) and Taiyuan plus Baotou, the
latter two substantially closer to the Bejing (red star) group.
Similar to Dongtai, and also connected to the
central part of the graph, are dialects with long-terminal branches
(edges). Hefeng (blue star) is substantially different from Dongtai, and only has
one further dialect in its neighborhood (blue bracket), Wangrong, a close relative
of the Bejing group. The Wuhan, Chengdu, and Guiyang (gray field) dialects appear,
on the other hand, to be completely isolated.
As explained above, there are different
processes, vertical and horizontal ones, that may trigger similarity,
and we want to get an idea as to which character may be influenced by which
process. From the graph, several aspects are obvious:
- geographic closeness plays a major role,
- the signal provided by the data is not
tree-like,
- the data is highly homoplastic, and includes internal
conflict.
Not so obvious is whether this situation is due to random or
evolutionary directed similarity, or reticulation. Since the graph is planar, and puts the Chinese dialects in a circular
order, we can order the character matrix accordingly to see how the
traits form groups (which could be called
cliques in this context). In the next step, we can then map each
character onto this network, to see how well they fit with the overall
similarity pattern. We showed this above for p-6 (hand-arm-distinction, one split), and here we add a
character with quite a poor fit, p-17 (syntactic-diachronic), "give"
as a disposal marker.
 |
Figure 11: Character mapping for p-17 (filled dots, "give" used as disposal marker; empty, not used), with the p-6 split indicated as well. Red, splits (taxon bipartitions defined by character cliques) that have no corresponding edge-bundle (neighborhood); blue, splits with neighborhood; green, unique, isolated change (deviation from the rule) within the neighborhood. |
The number of inferred mutations in the map uses Ockham’s Razor, upon which parsimony
(tree and network) inference relies as well. Using such a map, we can even provide an
estimate for how likely (qualitatively spoken) a change is under the
assumption that neighborhoods in the graph represent either
exchange (homogenization) between closely related dialects or are
inherited, reflecting both horizontal and vertical relatedness.
Mapping characters on a 2-dimensional network allows finding a scenario
beyond a single tree hypothesis.
For p-6, we need just one change (i.e. loss in
all more south-bound dialects), but we don't find an edge bundle
corresponding to this unique change. Given what we discussed above about
p-6, we have more independent losses than the simple reconstructed
one. Social preference or general contact for retaining the primitive state of having two
words could explain why dialects closer to the
Beijing dialect area have a "0", although not all are closely related
in general.
For p-17, we need at least four (independent)
changes from "0" → "1", two of which have a corresponding
edge bundle (blue, Nanchang plus Lichuan, Changsha plus Dongtai), one
isolated (green, Luoyang), and one without a corresponding edge bundle
(Wuhan and Hefeng
dialects). The (equally parsimonious) alternative for p-17 would be a series of gains and
losses, with the same number of steps:
 |
Figure 12: Alternative scenario for p-17. |
This is where one needs to consider additional
knowledge about the probability of getting or retaining a certain
feature. The state shared by most dialects across the entire
net is “0”, irrespective of overall similarity, which would make
it a natural pick for the primitive state. Thus, assuming four (or more) changes from 0 → 1 (acquisition of the queried feature), rather than two independent acquisitions (starting with the Beijing group; note, the position of the root will not change the number of needed changes), then a loss (1 → 0) in many southbound dialects and a re-gain (0 → 1) in the Nanchang + Lichuan dialects.
The same assessment can be made for all of the characters, and we end
up with something like this:
 |
Figure 13: Fully annotated split network of the data. Changes relating to edge-bundles accordingly colored, arc indicate changes without a corresponding edge-bundle. Note, the prominent yellow split that defines a neighborhood of dialects most similar to the Beijing dialect, albeit there is no character supporting this edge. The rather poor fit of many character splits (cliques) with edge-bundles relate to the fact that we visualize a highly complex diversification (multi-dimensional processes) using a planar, 2-dimensional graph. |
While this figure may be confusing at first sight, it comprehensively shows what the characters contribute to the overall
graph. We can discriminate more-likely from less-likely mutations (how many changes are needed at least), but also the character assemblies shared by putatively closely related dialects.
- p-3 and p-11 are a typical feature of Guangzhou and
allied dialects within the southern Chinese complex. p-3 is also present in Lichuan, and p-11 in Jixi (thus in not so distant dialects).
- Features p-6 to p-9, p-16, and p-19 form a diagnostic suite
for the Guangzhou dialects and other dialects related to them in the
one or other fashion and distinguish them from, e.g., the Beijing
group
- The latter, the Beijing group, has fewer diagnostic
character assemblies. One characteristic sequence could be p-1, p-2,
p-12, p-14, but this includes three features with a minimum of 3+
changes. Similarity here is mostly the result of a lack of
(potentially) derived features (hence, the character-unsupported yellow edge-bundle defining a Beijng-including neighborhood)
Outlook and summary
In this re-investigation, we have, once more, commented on the problems we see with the use of structural features for the purpose of historical language comparison and phylogonetic reconstruction. We see the major problems in the (often) unfortunate choice of question, resulting in elicitations of features that cannot be easily modeled with current software for phylogenetic analyses. It is important to keep in mind, in linguistics and phylogenetics, that we can infer trees or networks based on data of no matter what quality and information content. But before we present the result, we should have taken a look at the primary data.
- Does it fit with the resulting graph, or not?
- Where does it fit, and where not?
In the context of our critique of linguistic questionnaires, the mapping strategy discussed above opens a potential avenue to identify:
- stable / unstable features (geographically or evolution-wise) and
- coherent / incoherent features.
Based on this, we can then inquire as to which degree language (or dialect) groups influenced, stabilized or modified each other by geographic proximity.
Inference-wise, the natural next step would be to use the information about the minimum number of necessary changes to counter-weight characters. This would eventually allow to use median networks (and related) approaches on the data, which is currently the only way to explicitly identify ancestors using phylogenetic reconstructions. With the current matrices, the extreme homoplasy makes an unweighted application of median networks and related methods impossible.
References
Forkel, R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics.
Scientific Data 5.180205: 1-10.
List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In:
Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 2393-2400.
List, J.-M., M. Walworth, S. Greenhill, T. Tresoldi, and R. Forkel (2018) Sequence comparison in computational historical linguistics.
Journal of Language Evolution 3.2: 130–144.
Norman, J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.): The Sino-Tibetan languages. Routledge: London and New York, pp. 72-83.
Szeto, P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach.
Linguistic Typology 22.2: 233-275.
Supplementary data
The data we used to create the analyses and figures provided in this post are available at
https://github.com/cldf-datasets/szetosinitic/tree/master/examples