Monday, September 24, 2018

Structural data in historical linguistics

The majority of historical linguists compare words to reconstruct the history of different languages. However, in phylogenetic studies focusing on cognate sets reflecting shared homologs across the languages under investigation, there exists another data type that people have been trying to explore in the past. The nature of this data type is difficult to understand for non-linguists, given that it has a very abstract nature. In the past, it has led to a considerable amount of confusion both among linguists and among non-linguists who tried to use this data for quick (and often also dirty) phylogenetic approaches. For this reason, I figured it would be useful to introduce this type of data in more detail.

This data type can be called "structural". To enable interested readers to experiment with the data themselves, this blogpost comes along with two example datasets that we converted into a computer-readable format (with much help from David), since the original papers only offered the data as PDF files. In future blogposts, we will try to illustrate how the data can, and should, be explored with network methods. In this first blogpost, I will try to explain the basic structure of the data.

Structural data in historical linguistics and language typology

In order to illustrate the type of data we are dealing with here, let's have a look at a typical dataset, compiled by the famous linguist Jerry Norman to illustrate differences between Chinese dialects (Norman 2003). The table below shows a part of the data provided by Norman.

No. Feature Beijing Suzhou Meixian Guangzhou
1 The third person pronoun is tā, or cognate to it + - - -
4 Velars palatalize before high-front vowels + + - -
7 The qu-tone lacks a register distinction + - + -
12 The word for "stand" is zhàn or cognate to it + - - -

In this example, the data is based on a questionnaire that provides specific questions; and for each of the languages in the sample, the dataset answers the question with either + or -. Many of these datasets are binary in their nature, but this is not a necessary condition, and questionnaires can also query categorical variables, such as, for example, the major type of word order might have three categories (subject-object-verb, subject-verb-object or other).

We can also see is that the questions can be very diverse. While we often use more or less standardized concept lists for lexical research (such as fixed lists of basic concepts, List et al. 2016), this kind of dataset is much less standardized, due to the nature of the questionnaire: asking for the translation of a concept is more or less straightforward, and the number of possible concepts that are useful for historical research is quite constrained. Asking a question about the structure of a language, however, be it phonological, lexical, based on attested sound changes, or on syntax, provides an incredible number of different possibilities. As a result, it seems that it is close to impossible to standardize these questions across different datasets.

Although scholars often call the data based on these questionnaires "grammatical" (since many questions are directed towards grammatical features, such as word order, presence or absence of articles, etc.), most datasets show a structure in which questions of phonology, lexicon, and grammar are mixed. For this reason, it is misleading to talk of "grammatical datasets", but instead the term "structural data" seems more adequate, since this is what the datasets were originally designed for: to investigate differences in the structure of different languages, as reflected in the most famous World Atlas of Language Structures (Dryer and Haspelmath 2013,

Too much freedom is a restriction

In addition to mixed features that can be observed without knowing the history of the languages under investigation, many datasets (including the one by Norman we saw above) also use explicit "historical" (diachronic in linguistic terminology) questions in their questionnaires. In his paper describing the dataset, Norman defends this practice, as he argues that the goal of his study is to establish an historical classification of the Chinese dialects. With this goal in mind, it seems defensible to make use of historical knowledge and to include observed phenomena of language change in general, and sound change in specific, when compiling a structural dataset for group of related language varieties.

The problem of the extremely diverse nature of questionnaire items in structural datasets, however, makes their interpretation extremely difficult. This becomes especially evident when using the data in combination with computational methods for phylogenetic reconstruction. This is problematic for two major reasons.
  1. Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively. Since scholars can select suitable features from a virtually unlimited array of possibilities, it is extremely difficult to guarantee the objectivity of a given feature collection. 
  2. If features are mixed, phylogenetic methods that work on explicit statistical models (like gain and loss of character states, etc.) may often be inadequate to model the evolution of the characters, especially if the characters are historical. While a feature like "the language has an article" may be interpreted as a gain-loss process (at some point, the language has no article, then it gains the article, then it looses it, etc.), features showing the results of processes, like "the words that originally started in [k] followed by a front vowel are now pronounced as []", cannot be interpreted as a process, since the feature itself describes a process.
For these reasons, all phylogenetic studies that make use of structural data, in contrast to purely lexical datastes, should be taken with great care, not only because they tend to yield unreliable results, but more importantly because they are extremely difficult to compare across different language families, given that they have way too much freedom when compiling them. Feature collections provided in structural datasets are an interesting resource for diversity linguistics, but they should not be used to make primary claims about external language history or subgrouping.

Two structural datasets for Chinese dialects

Before I start to bore the already small circle of readers interested in these topics, it seems better to stop discussing the usefulness of structural data at this point, and to introduce the two datasets that were promised at the beginning of the post.

Both datasets target Chinese dialect classification, the former being proposed by Norman (2003), and the latter reflecting a new data collection that was recently used by Szeto et al. (2018) to propose a North-South-split of dialects of Mandarin Chinese with help of a Neighbor-Net analysis (Bryant and Moulton 2004). Both datasets have been uploaded to Zenodo, and can be found in the newly established community collection cldf-datasets. The main idea of this collection is to collect various structural datasets that have been published in the literature in the past, and allow those people interested in the data, be it for replication studies or to thest alternative approaches, easy access to the data in various formats.

The basic format is based on the format specifications laid out by the CLDF initiative (Forkel et al. 2018), which provides a software API, format specifications, and examples for best practice for both structural and lexical datasets in historical linguistics and language typology. The collection is curated on GitHub (cldf-datasets), and datasets are converted to CLDF (with all languages being linked to the Glottolog database,, Hammarström et al. 2018) and also to Nexus format. The dataset is versionized, it may be updated in the future, and interested readers can study the code used to generate the specific data format from the raw files, as well as the Nexus files, to learn how to submit their own datasets to our initiative.

Final remarks on publishing structural datasets online

By providing only two initial datasets for an enterprise whose general usefulness is highly questionable, readers might ask themselves why we are going through the pain of making data created by other people accessible through the web.

The truth is that the situation in historical linguistics and language typology has for a very long time been very unsatisfactory. Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data). In other cases, access to the data is exacerbated by providing data only in PDF format in tables inside the paper (or even worse: long tables in the supplement of a paper), which force scholars wishing to check a given analysis themselves to reverse-engineer the data from the PDF. That data is provided in a form difficult to access is not even necessarily the fault of the authors, since some journals even restrict the form of supplementary data to PDF only, giving authors wishing to share their data in an appropriate form a difficult time.

Many colleagues think that it is time to change this, and we can only change it by offering standard ways to share our data. The CLDF along with the Nexus file, in which the two Chinese datasets are now published in this open repository collection, may hopefully serve as a starting point for larger collaboration among typologists and historical linguistics. Ideally, all people who publish papers that make use of structural datasets, would — similar to the practice in biology where scholars submit data to GenBank (Benson et al. 2013) — submit their data in CLDF format and Nexus, so that their colleagues can easily build on their results, and test them for potential errors.


Benson D., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers (2013) GenBank. Nucleic Acids Res. 41.Database issue: 36-42.

Bryant D. and V. Moulton (2004) Neighbor-Net. An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21.2: 255-265.
Campbell, L. and W. Poser (2008): Language classification: History and method. Cambridge University Press: Cambridge.

Cathcard C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Dryer M. and Haspelmath, M. (2013) WALS Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.

Forkel R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (forthcoming) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

Hammarström H., R. Forkel, and M. Haspelmath (2018) Glottolog. Version 3.3. Max Planck Institute for Evolutionary Anthropology: Leipzig.

List J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp 2393-2400.

Norman J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.) The Sino-Tibetan languages. Routledge: London and New York, pp 72-83.

Pritchard J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Szeto P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Zhang M., W. Pan, S. Yan, and L. Jin (2018) Phonemic evidence reveals interwoven evolution of Chinese dialects. bioarxiv.


  1. Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively.

    I'm a phylogeneticist in biology. We cannot avoid accidental sampling bias – except by simply making our datasets large enough. As long as we don't end up with redundant characters (i.e. the same character twice in different wordings), this works; there are simulation studies to show that.

    Total-evidence approach!

    If features are mixed, phylogenetic methods that work on explicit statistical models (like gain and loss of character states, etc.) may often be inadequate to model the evolution of the characters, especially if the characters are historical.

    Then use parsimony. The behavior of parametric methods (maximum likelihood, Bayesian inference) when given datasets with realistic distributions of missing data is not well understood anyway; it hasn't really been tested.

    (I have a section discussing that in this preprint.)

    Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data).

    ...Huh. The journals I've published in require authors to publish their data.

    1. Thanks a lot for your comments. As your surprise in the last part shows: the situation is a bit different in linguistics from biology at times, although I think that we are making progress. Our datasets, however, are still very small (a dataset with 200 characters per taxon is already close to being considered as a large one), while on the other hand, parsimony enjoys a bad reputation in our field, so that barely any method that was published in the last 10 years would test it against the Bayesian methods that are generally considered to be more robust. But we'll see what the future brings, especially, if, as I hope, more data of different kinds will be shared publicly in easily accessible formats, so people can play with the data and test its usefulness for phylogenetic reconstruction.