Friday, August 17, 2018

Distinguishability in Phylogenetic Networks, photos

Evidence that we were in the Netherlands.

Evidence that we did some work.

Left to right: Steven Kelk, David Morrison, Mike Steel, Philippe Gambette (obscured), Tiago Tresoldi, Claudia Solis-Lemus, Fabio Pardi, Simone Linz, Mark Jones.

Left to right: David Morrison, Cecile Ané, Philippe Gambette (obscured), Katharina Huber, Leen Stougie, Remie Janssen, Yukihiro Murakami, Mattis List, Gereon Kaiping and Charles Semple.

Left to right: David Morrison (obscured), Axel Janke, Steven Kelk, Charles Semple, Claudia Solis-Lemus, Mark Jones (obscured), Fabio Pardi, Leo van Iersel, Simone Linz and Vincent Moulton.

Céline Scornavacca lectures Cecile Ané.

Axel Janke and Leo van Iersel contemplate methods for infering hybridization.

Philippe Gambette and Guido Grimm.

Mozes Blom and Jim Whitfield.

Mike Steel and Luay Nakhleh.

Luay delivers his Final Message, to Mozes Blom, Cecile Ané, Katharina Huber and Charles Semple.

Monday, August 13, 2018

Workshop: Distinguishability in Phylogenetic Networks

This week we are back in Leiden (in the Netherlands), for the third workshop sponsored by the Lorentz Center. The first workshop, in October 2012, is discussed in this blog post: Workshop: The Future of Phylogenetic Networks. The second one, in July 2014, is discussed here: Workshop: Touching the Data.

I say "we" because all of the blog authors are attending, making up nearly one-quarter of the participants. As before, it has been organized by Steven Kelk, Leo van Iersel, and David Morrison, this time along with Céline Scornavacca. The program and abstracts can be found here. It runs for the whole week 13 August – 17 August 2018.

The workshop is similar to the previous one, in that it is intended to be a small and well-focused event. The basic aim, as before, is to get biologists and computational people to sit down in a small group and actually talk about real phylogenetic issues. The main issue at hand this time can be called "indistinguishability", which significantly complicates the reconstruction, analysis and interpretation of phylogenetic networks. This includes the problems of: (i) distinguishing horizontal from vertical descent, (ii) distinguishing among reticulate processes, (iii) distinguishing reticulate evolution from incomplete lineage sorting, and (iv) distinguishing among network topologies.

You will note that we seem to be in the Netherlands in World Cup years. This is quite safe this year, but last time the workshop was in July, and my wife and I traveled through Germany at the end of the campaign, which was an "interesting" experience.

I am hoping to add some blog posts based on what happens at the workshop, either as it proceeds or at the end.

Monday, August 6, 2018

Trivial data, but not so trivial graphs

One may expect that perfectly compatible, trivial data will lead to perfect trees that are trivial to interpret. And this may really be the case when phylogenetics is restricted to contemporary taxa and molecular data. Adding to various earlier posts that deal with data patterns and their representation in inference graphs (e.g. Networks can outperform PCA..., Stacking neighbour-nets..., Clades, cladograms, cladistics ... and networks ...), I will show in this post what we get when we deal with very trivial, straightforward to interpret, data.

Two trivial scenarios: a linear and a dichotomous evolutionary sequence

The virtual data matrix for our experiment comprises seven taxa (OTUs) from different time scales and six binary (Dollo) characters. There are two historical scenarios that are supported by patterns in the data (see the first figure).

The linear scenario has a mother taxon that evolves by acquiring a unique, persistent trait, and is replaced by its daughter taxon through time. In contrast, the dichotomous scenario has two subsequent events of cladogenesis: the all-ancestor A splits into two taxa (B, E), each defined by a unique change in a binary character passed on to their descendants. B and E then underwent a second cladogenetic event, giving rise to C+D and F+G.

The resultant data matrices have different properties. In the case of the linear evolution, all changes lead to synapomorphies sensu Hennig (characters #1–#5) along with one terminal autapomorphy of the latest member of the lineage, G (character #6).

In the case of the dichotomous evolution, we have two synapomorphies supporting the BCD and EFG clades (characters #1, #4), respectively, and four autapomorphies (each one for C, D, F and G, the youngest set of taxa).

The following figure shows the character-based splits (taxon bipartitions) for the linear evolution scenario:
(Trivial splits, one taxon separated from all others, in blue)

Reconstructing the (true) evolutionary pathway is trivial based on this perfect split pattern, especially if we know that A is the oldest taxon and G the youngest.

It's equally straighforward for our second scenario, with perfectly dichotomous evolution:

Character 1 and character 4 define taxon cliques comprising B,C,D and E,F,G. The remaining characters indicate that C,D and F,G derive from B and E, respectively.

Explicit inferences

As stated above, the data properties for both scenarios are different. The matrices have a different number of parsimony-informative characters (4 for linear, 2 for dichotomous). Accordingly, the reconstructed optimal trees (here using the maximum parsimony, least-squares, and maximum likelihood criteria), are better resolved / more correct for the linear than for the dichotomous evolution.

MPT = most-parsimonious tree; ML = maximum likelihood. *Corrected for ascertainment bias.

Using all of the variable characters, NJ and ML are generally more decisive and produce higher support for the right branches. But for the dichotomous evolution scenario, they also show ghost-clades ("para-clades" as they include close relatives sharing a recent common origin, but do not represent monophyletic groups sensu Hennig) with low support. The corresponding MPT has no ghost-clades, but it also provides no clues to how B,C,D and E,F,G are related to each other.

Beyond this, and as can be seen in many real-world examples, there is no fundamental difference between character-based inferences such as maximum parsimony (MP) or maximum likelihood (ML) and distance-based inferences (NJ) fulfilling (here) the least-squares criterion (sometimes still called "phenetic" inferences in contrast to the "phylogenetic" parsimony, Bayesian inference and maximum likelihood).

The differences diminish further when we look at the phylograms instead of the cladograms, as shown next.

Another observation we can make is that for the linear-evolution scenario (four synapormophies), the ascertainment bias correction under ML has little effect, but it is crucial for the dichotomous evolution (two synapomorphies) to get sensible branch lengths.

Parsimony provides the most conservative (and least decisive) results for the dichotomous-evolution scenario, also because of the way I applied it: PAUP* allows optimizing trees with hard polytomies when using the default branch-and-bound search (for tree inference as well as bootstrapping), whereas the NJ / BioNJ algorithm and the ML implementation in RAxML will always produce fully dichotomized trees, including zero-length or near-zero-length branches. This explains the difference in the support values of preferred and alternative splits.

(Non-filtered) Bootstrap support consensus networks for the linear evolution scenario. Same scale for all graphs, trivial splits (dashed lines) collapsed.
(Non-filtered) Bootstrap support consensus networks for the dichotomous evolution scenario.

Trees are not wrong, but they miss the point

None of the graphs above show anything strongly erroneous, but they also don't fully capture the evolutionary pathways — that is, the actual ancestor-descendant relationships. This is because our taxon set includes ancestral forms, which, in traditional trees, have to be placed as sisters to part or all of their descendants. Networks provide a quick solution to this limitation.

Median-joining networks inferred with NETWORK for both scenarios, with the inferred (and real) character changes annotated along edges.

Neighbour-nets inferred with SplitsTree 4.13.1 for both scenarios, based on the mean (Hamming) pairwise distances.

The two (perfectly tree-like) graphs, one parsimony-based, the other distance-based, look identical, and place all of the taxa exactly where they should be: the ancestors on the nodes ("medians"), and their (latest) descendants at the tips. But note that in the case of the Neighbour-net this is a visual illusion / approximation: in fact, the ancestors are actually connected by zero-length edges to the node they appear to be sitting on.

Given that both scenarios used here produce trivial, straightforward to interpret, data patterns (see the first figures), the failure of the traditional tree inferences to get it completely right can be a bit unsettling. Trees including primitive-old and derived-new forms are common in the (palaeontological) literature, and typically show many branches lacking high support (note that only ML produced a bootstrap support >90 for a true-tree branch, and only for the linear evolution scenario). To address evolution over time, networks should hence be standard applications, rather than the exception. Cladograms should be long gone, as they show very little beyond the most trivial.

If we want trees (and many of us want trees!), we need tree inferences that can optimize an older taxon on an internal branch or node, to accommodate potentially ancestral forms.

Related blog posts

In Clades, cladograms, cladistics, and why networks are inevitable, I argue that we cannot get around networks when we aim to study taxa from different time scales using their morphologies.

Digging deeper: Population dynamics and individual-based fossil phylogenies raises the question of what we deal with when we use individual fossils (i.e. long-dead individuals) as OTUs in our phylogenetic inferences.
Monophyletic groups in networks by David gives an introduction into (fringe) terminology. What to do when dealing with more than a single most-recent common ancestor and past reticulation?

Networks and most recent common ancestors by David discusses the concepts of conservative MRCAs (most recent common ancestors), fuzzy MRCAs and (alternative) LCA — lowest (last) common ancestors in the face of reticulation.

In Stacking neighbour-nets: ancestors and descendants, I outline how one may (and why one should) stack Neighbour-nets to analyse the evolutionary history of a group including (mostly) fossil representatives.

The first Darwinian evolutionary tree[s] show features one rarely finds in a modern-day phylogenetic tree: ancestral and descendant forms, ancestral taxa addressed as species and not higher taxa, and gradual transition between forms (post by David).

Tree metaphors and mathematical trees by David, which introduces János Podani's concept about "branching silhouettes" and how to depict an actual evolutionary tree.

Where have all the ancestors gone? discusses the common notion that we don't have to deal with ancestor-descendant problems in phylogenetics at all, because the scarcity of the (terrestrial) fossil records ensures to only find extinct side (sister) lineages. 

Monday, July 30, 2018

Networks of polysemous and homophonous words

When I was very young, maybe even before I went to school, we often played a game with my parents and grandparents, during which we had to select two homophonous words (that is, one word form that expresses two rather different meanings), and the other people had to guess which word we had selected. This game is slightly different from its Anglo-Saxon counterpart, the homophone game.

In Germany, this game is called Teekesselchen: "little teapot". Therefore, people now also use the word Teekesselchen to denote cases of homophonoy or very advanced polysemy. In this sense, the word Teekesselchen itself becomes polysemous, since it denotes both a little teacup, and the phenomenon that word forms in a given language may often denote multiple meanings.

Homophony and polysemy

In linguistics, we learn very early that we should rigorously distinguish the phenomenon of homophony from the phenomenon of polysemy. The former refers to originally different word forms that have become similar (and even identical) due to the effects of sound change — compare French paix "peace" and pet "fart", which are now both pronounced as []. The latter refers to cases where a word form has accumulated multiple meanings over time, which are shifted from the original meaning — compare head as in head of department vs. head as in headache.

Given the difference of the processes leading to homophony on the one hand and polysemy on the other, it may seem justified to opt for a strict usage of the terms, at least when discussing linguistic problems. However, the distinction between homophony and polysemy is not always that easy to make.

In German, for example, we have the same word Decke for "ceiling" and "blanket" (Geyken 2010). This may seem to reflect a homophony at first sight, given that the meanings are so different, so that it seems simpler to assume a coincidence. However, it is in fact a polysemy (cf. Pfeiffer 1993, s. v. «Decke»). This can be easily seen from the verb (be)decken "to cover", from which Decke was derived. While the ceiling covers the room, the blanket covers the body.

Given that we usually do not know much about the history of the words in our languages, we often have difficulties deciding whether we are dealing with homophonies or with polysemies when encountering ambiguous terms in the languges of the world. The problem of the two terms is that they are not descriptive, but explanative (or ontological): they do not only describe a phenomenon ("one word form is ambiguous, having multiple meanings"), but also the origin of this phenomenon (sound change or semantic change).

In this context, the recently coined term colexification (François 2008) has proven to be very helpful, as it is purely descriptive, referring to those cases where a given language has the same word form to express two or more different meanings. The advantage of descriptive terminology is that it allows us to identify a certain phenomenon but analyze it in a separate step — that is, we can already talk about the phenomenon before we have found out its specific explanation.

A new contribution

Having worked hard during recent years writing computer code for data curation and analysis (cf. List et al 2018a), my colleagues and I have finally managed to present the fascinating phenomena of colexifications (homophonies and polysemies) in the languages of the world in an interactive web application. This shows which colexifications occur frequently in which languages of the world.

In order to display how often the languages in the world express different concepts using the same word, we make use of a network model, in which the concepts (or meanings) are represented by the nodes in the networks, and links between concepts are drawn whenever we find that any of the languages in the sample colexifies the concepts. The following figure illustrates this idea.

Colexification network for concepts centering around "FOOD" and "MEAL".

This database and web application is called CLICS, which stands for the Database of Cross-Linguistic Colexifications (List et al. 2018b), and was published officially during the past week ( — it can now be freely accessed by all who are interested. In addition, we describe the database in some more detail in a forthcoming article (List et al. 2018c), which is already available in form of a draft.

The data give us fascinating insights into the way in which the languages of the world describe the world. At times, it is surprising how similar the languages are, even if they do not share any recent ancestry. My favorite example is the network around the concept FUR, shown below. When inspecting this network, one can find direct links of FUR to HAIR, BODY HAIR, and WOOL on one hand, as well as LEATHER, SKIN, BARK, and PEEL on the other. In some sense, the many different languages of the world, whose data was used in this analysis, reflect a general principle of nature, namely that the bodies of living things are often covered by some protective substance.

Colexification network for concepts centering around "FUR".

Although we have been working with these networks for a long time, we are still far from understanding their true potential. Unfortunately, nobody in our team is a true specialist in complex networks. As a result, our approaches are always limited to what we may have read by chance about all of those fascinating ways in which complex networks can be analyzed.

For the future, we hope to convince more colleagues of the interesting character of the data. At the moment, our networks are simple tools for exploration, and it is hard to extract any evolutionary processes from them. With more refined methods, however, it may even be possible to use them to infer general tendencies of semantic change in language evolution.


Geyken A. (ed.) (2010) Digitales Wörterbuch der deutschen Sprache DWDS. Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Berlin-Brandenburgische Akademie der Wissenschaften: Berlin.

François A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.) From Polysemy to Semantic Change, pp 163-215. Benjamins: Amsterdam.

List J.-M., M. Walworth, S. Greenhill, T. Tresoldi, R. Forkel (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3.2.

List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, R. Forkel (forthcoming) CLICS². An improved database of cross-linguistic colexifications: Assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2.

List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, and R. Forkel (eds.) (2018) CLICS: Database of Cross-Linguistic Colexifications. Max Planck Institute for the Science of Human History: Jena.

Pfeifer W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.

Monday, July 23, 2018

Sequence alignment is still an open computational problem

I recently submitted an invited manuscript about multiple sequence alignment to a bioinformatics journal, but it did not fare well with the reviewers (ominously, there were more than the usual two, and it took a couple of months to get the reviews). The bioinformatics referees simply rejected the notion that a multiple alignment is an object in its own right, which is the basic premise of the manuscript.

To explain this: if we think of the normal tabular arrangement of a multiple sequence alignment, then the historical relationships among the rows (the taxa) are drawn as a phylogeny, while the historical relationships among the columns represent the homologies among the characters. There is no necessary primary importance of the phylogeny relationships over the homology relationships. However, phylogenies are much more prominent in the literature; and, indeed, sequence alignment is often seen as nothing more than a pesky step on the way to getting a phylogeny.

However, if we accept this notion, that homology relationships are both important and interesting in their own right, then multiple sequence alignment is certainly still an open computational problem, because most automated sequence alignments currently do not represent homology relationships. Instead, they represent sequence similarity of various sorts, and thus they only represent homology to the extent that similarity reflects history. In fact, similarity = homology + analogy, and the latter is not trivial.

I have previously written about the topic of alignment-as-homology for the biological audience:
  • Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.
  • Morrison DA, Morgan MJ, Kelchner SA (2015) Molecular homology and multiple sequence alignment: an analysis of concepts and practice. Australian Systematic Botany 28: 46-62.
This new manuscript is intended to be the equivalent for the bioinformatics audience, explaining why homology ≠ similarity, and therefore why the current alignment algorithms are inadequate.

Rather than let it languish, and since it is likely to be the last single-author paper that I ever write, I tried to add it to the bioRxiv repository, for everyone to read. Sadly, their reviewers decided that it is insufficiently original, but is merely a summary of existing information. So, I guess that they are not impressed by the novel ideas, either.

I also tried the arXiv, which may seem to be more appropriate, given the audience, but they no longer recognize my user account, which means that the manuscripts I have there now exist in limbo. The world is apparently against my manuscript!

So, I am linking the paper here, instead:
Please have a look; and if you think it is worth it, then please spread the word. Moreover, if you are computationally inclined, then feel free to be inspired to tackle the problem described therein.

PS. I also once wrote a brief blog post about this:

Monday, July 16, 2018

A network of World happiness

This is a joint post by Guido Grimm and David Morrison.

You may never have heard of it, but the there is a World Happiness Report. This is sponsored by The Sustainable Development Solutions Network (SDSN) and The Global Happiness Council (GHC). Reports were produced in 2012, 2013, 2015 and 2017, but here we are going to look at the World Happiness Report 2018.

To quote the Report:
The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2018 ranks 156 countries by their happiness levels, and 117 countries by the happiness of their immigrants.
The rankings use data that come from the Gallup World Poll (GWP). The rankings are based on answers to the main life evaluation question asked in the poll. This is called the Cantril ladder: it asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The rankings are from nationally representative samples, for the years 2015-2017.
The Report is very comprehensive in its discussion of methodology, and its limitations. It is also very ambitious in its conclusions. The main focus of the 2018 Report is comparing the happiness of immigrants with their local counterparts. Interestingly, they found no important differences between these two groups.

More importantly for this blog, the raw data are provided in an Appendix, so that anyone can look at what is going on. We have decided to do just that.

The Report's happiness index

Below is the first little bit of Figure 2.2 (extracted from the report), which "shows the average ladder score (the average answer to the Cantril ladder question, asking people to evaluate the quality of their current lives on a scale of 0 to 10) for each country, averaged over the years 2015-2017." As you can see, the people who claim that they are happiest are those in the Nordic countries (Finland plus the Scandinavian countries: Norway, Denmark, Iceland and Sweden). These are the people whom the world's cultural cliché sees as sitting for half the year in the gloom! Apparently, you have all got it wrong.

As we have noted before, an index can often do a poor job of summarizing data, because it reduces complex data down to just one dimension. The Happiness Report tries to alleviate this limitation by adding information about some of the other variables that correlate with the Happiness score, using colors:
Each of these bars is divided into seven segments, showing our research efforts to find possible sources for the ladder levels. The first six sub-bars show how much each of the six key variables is calculated to contribute to that country’s ladder score, relative to that in a hypothetical country called Dystopia, so named because it has values equal to the world’s lowest national averages for 2015-2017 for each of the six key variables
However, we can do much better than this, by using all of these variables in a phylogenetic network. The key variables are (color-coded from left to right in the figure above):
  1. Gross Domestic Product (GDP) per capita is in terms of Purchasing Power Parity (PPP)
  2. Social support [the national average of the binary responses to the Gallup World Poll]
  3. The time series of healthy life expectancy at birth
  4. Freedom to make life choices [the national average of binary responses to the GWP question]
  5. Generosity [the residual of regressing the national average of GWP responses]
  6. Perceptions of corruption [ the average of binary answers to two GWP questions]
For the network, we simply put all of these variables into the analysis, along with the Happiness score.

[Technical details of our analysis: Qatar was deleted because it has too many missing values; each data variable was then standardized to zero mean and unit variance; the gower similarity was calculated, which ignores missing values, and this was converted to a distance; the distances were then displayed as a neighbor-net splits graph.]

A network analysis

The resulting network is shown next. Each point represents a country, with the name codes following the ISO-3166-1 standard. The spatial relationship of the points contains the summary information — points near each other in the network are similar to each other based on the data variables, and the further apart they are then the less similar they are. The points are color-coded based on major geographic regions (asterisks highlight single states that don't group with the rest of their geographical region). We have added some annotations for the major network groups, indicating which geographical regions are included — these groups are the major happiness groupings.

In this blog post we do not want to risk over-interpreting the data, as explained in the final paragraphs below. However, it is obvious that there are distinct patterns in the network. Happiness, and its correlates are not randomly distributed on this planet but, not unexpectedly, relate to the local socio-political situation.

Starting at the bottom-left, we have a geographically heterogeneous cluster of very well-off countries, either welfare states (as in northern Europe), capitalist democracies (eg. the USA, Singapore, Hong Kong), or oil-rich monarchies with high levels of public spending (as in the Middle East). Moving clockwise, the next cluster has much of the rest of the western and central European countries, along with the financially well-off parts of South America and Asia. The next cluster has many of the remaining eastern European countries, plus the nearest parts of Asia, where government spending on welfare is still apparent. Clearly, national wealth plays a large part in happiness, in spite of the well-known adage to the contrary.

This is followed, at the top-middle of the network, by a broad neighborhood (not a distinct cluster), where government spending on welfare is much less apparent, at least to an outsider. The countries here come from Europe, Asia, and Central plus South America (including, at the moment, Greece). Happiness and its correlates is reported to be much lower here.

To make this situation clearer, here is a version of the network with some of the happiness scores annotated — values are provided for the first and last 10th percentile of the happiness score, and the 10 largest (by population) countries in the world.

On the opposite side of the network, happiness is also apparently lower, but with a different set of correlations among the variables. There is a two-part cluster of geographically heterogeneous countries at the bottom-middle, plus a neighborhood at the bottom-right. The latter includes China and India, the two most populous countries (with one-third of our people), while Indonesia (4th) and Brazil (6th) are in the neighborhood at the top of the network.

Finally, the cluster at the right consists mostly of African countries, plus Pakistan (the 5th most-populous country). In this cluster, happiness is reported to be at its lowest observed level. Much of the world's monetary aid is spent in Africa, of course, to try to improve the situation, although there is clearly a long way to go. Not unexpectedly, most of the world's migrants come from the right-hand part of the network, which is one of the main focuses of the Happiness Report.

Final comments

It is interesting to note that the Bhutan (code BTN) government reportedly aims to increase the Gross National Happiness rather than the GDP (see Gross national happiness in Bhutan: the big idea from a tiny state that could change the world). The network shows that their 2015-2017 happiness is quite different to that of their geographical neighbors. However, it also suggests that they still have a long way to go.

We should finish the discussion with a general point about surveys, such as the Gallup Poll on which the Happiness Report is based. Respondents are not always completely honest when answering survey questions, which is why pre-election polls sometimes get it wrong — people are most serious when faced with an actual decision, rather than a question. All of the results here need to be interpreted in this light — they may not be far wrong, but they are unlikely to be completely right.

Apart from anything else, there can be cultural differences in the way in which the answers to the Gallup World Poll questions are treated. Does "happiness" really mean the same thing across all cultures? We know that "beauty" does not, and "freedom" does not; so why not "happiness"? After all, things like reported happiness are likely to be confounded with other feelings such as national pride. This issue could presumably be addressed by looking at other answers from the Gallup Poll.

Monday, July 9, 2018

Using splits graphs for multivariate data analysis

Data containing multiple measurements for each of a set of objects are usually too complex to be viewed easily in their raw form. Therefore, methods have been developed to usummarize the data down to something simpler. This is called multivariate data analysis.

One of the issues that needs to be addressed is that a data summary is designed to lose information. The goal is to somehow keep the most important information in the summary. Clearly, the simpler is the summary then the more information we are likely to lose.

This post is a simplistic introduction to why splits graphs, which were originally developed to summarize multivariate phylogenetic data, are usually very good data summaries. It compares the ability of maps, indexes and networks to summarize data.


A map is a 2-dimensional drawing of some piece of 4-dimensional space-time. For example, the map shown here represents the southern part of Scandinavia.

A map is quite successful as a data summary. It reduces the 4-dimensional world down to 2+ dimensions — latitude and longitude are represented accurately; we use symbols or colors/shading to represent altitude; and we choose one specific time (thus eliminating that dimension). We can therefore reconstruct much of the 3-dimensional world from looking at a map (ie. much of the original information is retained in the summary).

In our example, we can see even from a glance at the map that Denmark is as flat as a pancake, Norway is very hilly, and Sweden is somewhere in between. We can also see that Uppsala and Oslo are at the same latitude, and that the simplest way to get from Uppsala to Trondheim is likely to be via Östersund rather than Oslo.


An index is a linear ordering of numbers measuring some calculated characteristic of a set of objects. It condenses a series of measurements for each object down to a single number. The index shown here refers to the hotels in Östersund (which we might stay at on our way from Uppsala to Trondheim), and indicates the overall quality score from a well-known online booking site. The index summarizes a set of features of the hotels that might be of interest to potential guests.

Hotell Emma
Clarion Hotell Grand
Hotell Stortorget
Quality Hotell Frösö Park
Hotell Jämteborg
Best Western Hotell Ett
Best Western Hotell Gamla Teatern
Hotell Älgen
Hotell Zäta

Unfortunately, an index is rarely very successful as a data summary. It reduces multi-dimensional data down to only 1 dimension. Therefore, we cannot tell which dimensions contribute to each value of the index — the same value could arise in many different ways. We therefore cannot reconstruct any of the original dimensions — what goes into the summary cannot come back out (as it can for a map).

Free WiFi
Value for money
Quality Hotel
Frösö Park

In our example, two of the hotels have exactly the same index score, but this does not necessarily mean that the two hotels are the same as regards the quality features, as shown above. For instance, there are notable differences between them in Location and Value for Money, and even larger differences in Cleanliness and Facilities. This information is lost in the calculation of the quality index.


A splits graph (a type of phylogenetic network) is a 2-dimensional drawing of some multi-dimensional set of data, such as might be used to calculate an index. The network shown here is based on the same data used to calculate the quality index above.

A network reduces multi-dimensional data down to 2+ dimensions. Each object is represented as a point — the spatial relationship of the points (their neighborhood) has meaning; and the inter-connecting lines have meaning (they are groups supported by the data). Such a network is therefore much more successful as a summary than is an index. Like a map, it will be very successful for 3-dimensional data, with potentially reduced success as the number of dimensions increases — the rate of information loss will depend on how well-correlated are the dimensions.

In our example, the main pattern in the network shows the relative quality of the hotels, as measured by the index, descending from top to bottom (so that all of the information form the index is in the network). However, the graph also emphasizes the difference between the two hotels with identical index scores. Indeed, it shows us that the Quality Hotell Fröösö Park is probably more similar to the Clarion Hotell Grand than to the Hotell Stortorget.


There are other forms of multivariate data analysis that are often used instead of networks. Two common ones are: an ordination, which reduces multi-dimensional data down to 2 dimensions only; and a cluster tree, which reduces to 1 dimension only. These are therefore often less successful as data summaries. Indeed, a network is very much like a combination of an ordination and a cluster tree, with the best features of both methods and fewer of their limitations.

Further reading

How to interpret splits graph

Primer of Phylogenetic Networks

Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312.

Monday, July 2, 2018

Reticulation at its best — an example from the oaks

One particular case where networks turn out to be a versatile tool is the study of low-level evolutionary patterns. This is especially so when we leave the comfort zone of well-sorted molecular markers, and use more than a single individual per species. Our recently published data set on (mostly Mediterranean) oaks, provides a nice example of this.

Why so few people study oaks at the intra-generic level

Oaks are notoriously difficult to study because they don't bother too much about species boundaries (which can be more or less obvious) and – at one point – decided to not sort their plastids at all (and full plastomes, as I once saw for myself first-hand, won't help). Hence, all reasonable phylogenetic reconstructions of oak evolution have been based on genetic data from the nucleome. However, this imposes a new problem — the sequenced nuclear gene regions allow the recognition of the major lineages (which recently have been formalized), but the closer one comes to the species level the more difficult it is to resolve anything at all.

Even the famous ITS region, which includes the weakly constrained internal transcribed spacer ITS1, and the structurally quite constrained ITS2, and have been frequently advocated as plant barcodes, turns out to be a two-edged sword. Relationships between the major intra-generic lineages is relatively clear, the ITS is pretty divergent down to the species level, but at the individual level, one faces a intra-genomic divergence that often outmatches inter-species differentiation.

In some groups, like the most speciose and most widespread white oaks (sect. Quercus), identical ITS variants exist from individuals / species separated today by thousands of kilometers of ocean or icy wasteland. One possible explanation is that oaks have very large population sizes, and they are wind-pollinated, so that they have a high capacity to permanently homogenize their genepools. Plastids, on the other hand, are only transmitted via the large fruit, the acorns, and the main animal vector for distributing acorns, the jaybirds, are sedentary birds. Their backup-vector, the squirrels are known to hoard a lot of acorns in a single place, but not for migrating globally (unless we assist them).

Nonetheless, we readily notice that the intra-individual differentiation patterns appear not to be entirely random, and so in our study we moved to another nuclear multi-copy spacer known to be more variable than the ITS1 and ITS2 (hence, largely ignored by molecular phylogeneticists) — the 5S intergenic spacer (5S IGS). It didn't help too much for solving the white oak puzzle (in western Eurasia), but did give us new insights into the two other western Eurasian sections: Ilex and Cerris.

The 'host-associate' framework

A cloned 5S-IGS (or ITS) sequence is not a good OTU, because we are usually not interested in a clone phylogeny (a mere sequence genealogy), but in the phylogenetic relationships between the individuals or species carrying the cloned sequence variants: the nuclear spacer population. Even networks struggle with such data, and my colleague Markus Göker came up with the idea to treat this in the form of hosts, the individuals, and associates, the cloned sequences found in the individual (Göker & Grimm, BMC Evol. Biol. 2008 — open access). There are several options to transfer the primary clone (associate) data into individual (host) data.

Options that we tested for transferring associate data into host data.
CM = character matrix, DM = distance matrix. CMhosts, independent used were morphological matrices. ENT — entropy, FRQ — frequence, CON — strict consensus, MOD — modal consensus, and SIZ — sample size, are character transformations implemented in Markus' g2cef, PBC and MIN are distance transformations implemented in pbc (these and other little helper programmes can be found here).

Using three cloned (ITS) datasets, we found that for these data the "Phylogenetic Bray-Curtis" (PBC — see the next figure) distance transformation outperforms the other tested options.

Computation of the "Phylogenetic Bray-Curtis" distance. It's a modification of the Bray-Curtis dissimilarity using the minimum distance for each covered row/column instead absence/presence. H1/H2 = hosts with different sets of associates (A1–A6)

Incidental but interesting insights

Whenever I come into contact with such data I advise the use of the PBC distance transformation as the basis for the main individual-level network, but also to run the MIN distance transformation: MIN will just calculate the minimum inter-clone distance between the clone samples of two individuals, and use this as the inter-individual distance.

Neighbour-net using the MIN transformation

The MIN network (above) is quite bushy for these data, because we naturally have many shared 5S-IGS variants among individuals of the same species, but occasionally also shared by individuals of different species. Nonetheless, it visualizes some basic differentiation patterns in the clone sample: compare e.g. the coherent cluster 3, the crenata-suber lineage (the 'Cork Oaks') — all individuals share a pair of very similar to identical 5S-IGS clones; and the divergent cluster 4, the 'Vallonea' oaks — all individuals have different sets of clones, but uniuqe 5S-IGS variants separating them from all other Cerris oaks (long proximal edge bundle).

Furthermore, we have potential F1 hybrids (morphologically intermediate) in our sample, and such hybrids, e.g. tj08, should have very low (to zero) MIN distances with members of their parental lineages.

However, the PBC network (below) is as beautiful as it gets — I really love this transformation, as it always comes up with something usable and interpretable.

Neighbor-net based on PBC-transformed inter-individual distances. See Simeone et al. (PeerJ PrePrints 2018 — open access pre-print) for a discussion.

However, this network was a last minute addition, because a happy little "accident" happened along the way, and the networks we were working with and looking at while drafting the paper where not PBC networks, as I thought.

It happened this way. Also implemented in Markus' little helper program are AVG, the average inter-clone distance, and MAX, the maximum inter-clone distance. AVG and MAX don't result in a proper distance matrix, because the diagonal will be the average or maximum distance between the clones of a single individual, and not all-zero as it should be (for MIN it's always zero). [We discussed a few options to modify AVG and MAX to ensure a zero diagonal, but couldn't devise something that makes sense.]

However, the SplitsTree program didn't bother about an all-zero diagonal, so the AVG and MAX transformed distance matrices will produce a Neighbor-net. So, what I assumed were PBC networks were in fact AVG networks.

Neighbor-net based on AVG-transformed inter-individual distances.

It took me quite long to recognize this "error" because, in contrast to the AVG (and MAX) networks I looked at when we did the 2008 paper, the one for the oaks made a lot of sense. Notably, the suspected F1 hybrids were perfectly resolved spanning up according boxes, and the species aggregates (clusters) did make sense regarding the general geographic setting, the history of the region under study, and their morphology.

Same graph as above, highlighting known or potential F1-hybrids spanning up according boxes.

For these data (with a minimum of four clones available per individual, individuals covering all species, and including the entire range of the section in western Eurasia), the AVG network better shows the potential F1 hybrids (or introgrades) than the (more methodologically sophisticated) PBC network. However, the latter makes more sense regarding speciation processes and the history of the group (because, the distance is a "phylogenetic" version of the well-known Bray-Curtis distances).

A "cactus-oak" fusion graph depicting nuclear and plastid differentiation (and evolution) in Quercus Group Cerris.

Take-home message

First, it's always good to delegate work you can do by heart to somebody new to it! This forces its propagation, which is important. More importantly, though, one has ones preferences and established analysis pipelines, and they may have become restricted in scope. I mainly used the -a (AVG), -i (MIN) and -x (MAX transformation) options in the little helper program to quickly summarize some of the primary differentiation data — for example, individuals have identical clones (MIN = 0), intra-individual divergence may be higher or not than inter-individual (MAX intra-individual > MIN inter-individual), and individuals may have strongly divergent clones (high MAX). AVG was computed and tabulated but never cherished by me. I always looked at the MIN transformed networks, since this provides a valid distance matrix, but then ignored them. But I never again tried to infer a Neighbor-net based on AVG or MAX transformations after our 2008 paper.

Second, Neighbor-nets are so quick to infer that there is no resource- or logic-related reason to not just run whatever distance one has on hand or can easily establish. Maybe even the biologically less-sound will reveal some interesting aspect (there are a lot of biological arguments that can be put forward for dismissing AVG distances in favour of PBC distances)

Paper (pre-print) and open data
Simeone MC, Cardoni S, Piredda R, Imperatori F, Avishai M, Grimm GW, Denk T. 2018. Comparative systematics and phylogeography of Quercus Section Cerris in western Eurasia: inferences from plastid and nuclear DNA variation. PeerJ Preprints 6: e26995v1.
Primary data and analysis files are included in the Online Supplemantary Archive: Simeone et al., PeerJ Preprints, doi: 10.7287/peerj.preprints.26995v1/supp-4. (See Readme.txt included in the topfolder of the archive.) 

Monday, June 25, 2018

Horizontal and vertical language comparison

In the traditional handbooks on historical language comparison, one can often find the claim that there are two fundamentally different, but equally important, means of linguistic reconstruction. One is usually called "external reconstruction" (or alternatively the "comparative method"), and one is called "internal reconstruction". If we think of sequence comparison in historical linguistics in the form of a table, in which concepts are arranged on the vertical axis, and different languages on the horizontal axis, we can look at the two different modes of language comparison (external vs. internal) as the horizontal and the vertical axes of the table. Horizontal language comparison refers to external reconstruction — scholars compare forms (not necessarily of the same meaning) across the horizontal axis, that is, across different languages. Internal language comparison is vertical — scholars search inside one and the same language for structures that allow to infer its older stages.

In past blog posts I have been talking a lot about horizontal / external language comparison, for which especially the notion of sound correspondences is crucial. But in the same way in which we use the evidence across languages to infer the past states of a given language family, we can make use of language-internal evidence to learn more about the history — not only of a given language,- but also of a group of languages.

Vertical Language Comparison

A classical example of vertical or internal language comparison is the investigation of paradigms, that is, the inflection systems of the verbs or nouns in a given language. This, of course, makes sense only if the respective languages have verbal or nominal morphology, ie. if we find differences in the verb forms for the first, second, or third person singular or plural, or for the case system. The principle would not work in Chinese, although we have different means to compare languages without inflection vertically, as I'll illustrate below.

As a simplified example of internal reconstruction, consider the verbal paradigm of the verb esse "to be" in Latin:
Person Singular Plural
first sum sumus
second es estis
third est sunt

If you try to memorize this pattern, you will quickly realize that it is not regular, and you will have difficulties to identify patterns that assist in memorizing the forms. A much more regular pattern would be the following:
Person Singular Plural
first es-um es-umus
second es-Ø es-tis
third es-t es-unt

This pattern would still require us to memorize six different endings, but we could safely remember that the beginning of all forms is the same, and that there are six different endings, accounting for person and number at the same time (which is anyway typical for inflecting languages).

An alternative pattern that would be easier to remember is the following one:
Person Singular Plural
first es-um s-umus
second es-ø s-tis
third es-t s-unt

While it may seem that this pattern is slightly more complicated at first glance, it would still be more regular than the pattern we actually observe, and we would now have two different aspects expressing the meaning of the different forms: the alternation of the root es- vs. s- accounts for the singular-plural distinction, while the endings express again both number and person.

If we look at older stages of Latin, we can, indeed, find evidence for the first person singular, which was written esom in ancient documents (see Meier-Brügger 2002 for details on the reconstruction of this paradigm in Indo-European). If we look at other languages, like Sanskrit and Ancient Greek, we can further see that our alternation between es- and s- in the root (thus our last example) comes also much closer to the supposed ancient state, even if we don't find complete evidence for this in Latin alone.

What we can see, however, is that the inspection of alternating forms of the same root can reveal ancient states of a language. The key assumption is that observed irregularities usually go back to formerly regular patterns.

Horizontal language comparison

The classical example for horizontal or external language comparison is the typical wordlists in which words with similar meanings across different languages are arranged in tabular form. I have mentioned before that it was in great part Morris Swadesh (1909-1967) who popularized the simple tabular perspective that puts a concept and its various translations in the center of historical language comparison. Before the development of this concept-based approach to historical linguistics, scholars would pick examples based on their similarity in form, allowing for great differences in the semantics of the words being assigned to the same slot of cognate words; and this exclusively form-based approach to external language comparison is still the prevalent one in most branches of historical linguistics.

No matter what approach we employ in this context — be it the concept- or the form-based — as long as we compare forms across different languages, we carry out external language comparison, and our main concern is then the identification of regular sound correspondences across the languages in our sample, which enable us to propose ancestral sounds for the ancestral language.

Problems of vertical language comparison

As can be seen from my above example of the inflection of esse in Latin, it is not obvious how the task of internal language comparison could be formalized and automated. There are two main reasons for this. First, inflection paradigms vary greatly among the languages of the world, which makes it difficult to come up with a common way to investigate them.

Second, since we are usually looking for irregular cases that we try to explain as having evolved from former regularities, it is clear that our data will be extremely sparse. Often, it is only the paradigm of one word that we seek to explain, as we have seen for Latin esse, and patterns of irregularities across many verbs are rather rare (although we can also find examples for this). As a result, internal reconstruction is dealing with even fewer data than external reconstruction, where data are also not necessarily big.

Formalizing the language-internal analysis of word families

Despite the obvious problems of exploiting the language-internal perspective in historical language comparison, there are certain types of linguistic analysis that are amenable to a more formal treatment in this area. One example that we are currently testing is the inference and annotation of word families within a given language. It is well known that large number of words in human languages are not unrelated atomic units, but have themselves been created from smaller parts. Linguists distinguish derivation and compounding as the major techniques here, by which new words are created from existing ones.

Derivation refers to those cases where a word is being modified by a form unit that could not form a word of its own, usually a suffix or a prefix. As an example, consider the suffix -er in English which can be attached to verbs in order to form a noun that usually describes the person that regularly carries out the action denoted by the original verb (eg. examineexaminer, teachteacher, etc.). While the original verb form exists without the suffix in the English language, the form -er only occurs as part of verbs. In contrast to derivation, compounding refers to the process by which two word forms that can be used in isolation are merged to form a new expression (compare foot and ball with football).

Searching for suffixes and compounds in unannotated language data is a very difficult task. Although scholars have been working on automatic methods that split a given monolingual dictionary into its smallest meaning-bearing form units (morphemes), these methods usually only work on very large datasets (Creutz and Laugs 2005). Trained linguists, on the other hand, can easily detect patterns, even when working on smaller datasets of a few hundred words.

The reason why linguists are successful in analysing the morphology of languages, in contrast to machine-learning approaches, is that they make active use of their external knowledge about the potential semantics underlying the patterns, while current methods for automatic morpheme detection usually only consider the forms, and disregard the semantics. Semantics, however, are important to distinguish words that form a true family (in that they share cognate material) from words that are similar only due to chance.

It is clear that languages may have words that sound alike but convey different meanings. As an extreme example, consider French paix [] "peace" vs. pet [] "fart".Although both words are pronounced the same, we know that they are not cognate, going back to different ancestral forms, as is also reflected in the French writing system. But even if we lacked the evidence of the French orthography, we could easily justify that the words do not form a family, since (a) their meaning is quite different, and (b) their genus is different as well (la paix vs. le pet). An automatic method that disregards semantics and external evidence (like the orthography or the gender of nouns in our case) cannot distinguish words that are similar due to chance from words that are similar due to their history.

As a further example illustrating the importance of semantics, consider the data for Achang, a Burmish language, spoken in Myanmar (data from Huáng 1992), which is shown in the following graphic (derived from the EDICTOR tool and analyzed by Nathan W. Hill).

Word families in Achang, a Burmish language.

In this figure, we can see six words which all share tɕʰi⁵⁵ (high numbers represent tones) as their first part. As we can see from the detailed analysis of these compounds in Achang, which is given in the column "MORPHEMES" in the figure, our analysis claims that the form tɕʰi⁵⁵, which expresses the concepts "foot" or "leg" in isolation, recurs in the words for "hoof", "claw", "knee", and "thigh", but not in the word for ""ant". While the semantic commonalities among the former are plausible, as they all denote body parts which are closely related to "feet" or "legs", we do not find any transparent motivation for why the speakers should have used a compound containing the word for "foot" to denote an ant. Although we cannot demonstrate this at this point, we are hesitant to add the Achang word for "ant" to the word family based on compounds containing the word for "foot".

Bipartite networks of word families

For the time being, we cannot automate this analysis, since we lack data for the testing and training of potential algorithms. We can, however, formalize it in a very straightforward way: with help of a bipartite network (see Hill and List 2017). Bipartite networks are networks with two kinds of nodes, which are usually thought of as representing different types. While we can easily assign different types to all nodes in any network we are dealing with, bipartite networks only allow us to link nodes of different types. In our bipartite network of word families, the first type of nodes represent the forms of the words, while the second type represent the meanings attributed to the sub-parts of the words. In the figure above, the former can be found in the column "tokens", where the symbol "+" marks the boundaries, and the latter can be found in the column "MORPHEMES".

The following figure shows the bipartite network underlying the word family relations following from our analysis of words built with the morpheme "foot" in Achang.

Bipartite network of word families: nodes in red text represent the (reconstructed) meaning of the morphemes, and blue nodes the words in which those occur as parts.


The bipartite network above shows only a small part of the word family structure of one language, and the analysis and formalization of word families with help of bipartite networks thus remains exemplary and anecdotal. I hope, however, that the example illustrates how important it is to keep in mind that language change is not only about sound shifts that can be analyzed with help of language-external, horizontal comparison. Investigating the vertical (the language-internal) perspective of language evolution is not only fascinating, offering many so far unresolved methodological problems, it is at least as important as the horizontal perspective for a proper understanding of the dynamics underlying language change.


Creutz M. and Lagus K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, 2005, 81.

Hill N. and List J.-M. (2017) Challenges of annotation and analysis in computer-assisted language comparison: A case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1. 47–76.

Meier-Brügger M. (2002) Indogermanische Sprachwissenschaft. de Gruyter: Berlin.

Huáng Bùfán 黃布凡 (1992) Zàngmiǎn yǔzú yǔyán cíhuì [A Tibeto-Burman lexicon]. Zhōngyāng Mínzú Dàxué 中央民族大学 [Central Institute of Minorities]: Běijīng 北京.

Monday, June 18, 2018

To boldy go where no one has gone before – networks of moons

This is a joint post by Timothy Holt and Guido Grimm

One ‘a-phylogenetic’ application of phylogenetic methods is the classification of stellar (in the widest sense) objects, so-called "astrocladistics" (see Didier Fraix-Burnet’s dedicated blog: Traditionally, the objects would be characterized and their (dis)similarity translated into a plot (eg. using PCoA) or a tree (eg. a UPGMA tree). Such cluster analysis frameworks would then be the basis for the classification of the objects.

In ‘astrocladistics’, phylogenetic trees that fulfill the maximum parsimony or minimum evolution criteria, are used instead. But why should we stop with trees (see the prior blog post Astrocladistics: a network analysis)? For this post, we have used the matrices of a recent astrocladistic paper by Holt et al. (2018) to highlight an as yet under-explored application of phylogenetic methods in classification: exploratory data analysis (EDA).

Why exploratory data analysis

As noted in the earlier post on astrocladistics, one problem is that one infers phylogenetic trees based on a data sets that are not the product of an evolutionary process. Some objects may evolve from others (eg. a satellite may evolve from planetary ring matter), but this is not a dichotomous splitting process through time. And any non-dichotomous process can lead to tree-incompatible signals, which will then hamper tree inference in a biological context. Any tree using astral objects (galaxies, stars, planets, moons) as OTUs is per se a faux phylogeny (some examples for faux phylogenies are collected here and here).

Another problem is a data-inherent bias. The matrices are coded in a fashion that reflects an a priori hypothesis of derivation. For instance, by inferring that objects farther away are older and closer ones are younger, we can make hypotheses about maturation of galaxies, and hard-code this hypothesis into the data matrix. This will infer a tree that was coded into the matrix.

Guido’s starting argument is that when our main goal is classification and not inferring evolutionary relationships, the topology of the tree (or alternative trees) is the least of our concerns. What we want to know is to what degree our data converges to the same groupings, supports coherent classes. This is exploratory data analysis, and Neighbor-nets are then a powerful tool to visualize any differentiation pattern (see some recent a-biological examples: U.S. gun legislation, cryptocurrencies, where to retire Worldwide and within the USA)

Instead of inferring trees, as in the original paper about two satellite systems (Jupiter and Saturn), here we use the matrices to infer Neighbor-nets, map character support (non-parametric bootstrap support) on the resultant networks, and discuss the prospects and perils of ‘astro-Neighbor-nets’ when it comes to classification of astronomical bodies.

Data properties and analysis set-up

In order to construct the matrices, three different types of characteristics were used: dynamical, physical and compositional. Dynamical characteristics are the positions of the various satellites, how far they are away from the planet (semi-major axis), their inclination to the plane and eccentricity of their orbit. Several of the satellites also orbit opposite to the planet's rotation (they are on a retrograde orbit), which is also code. Physical characteristics are two properties of the satellites: their albedo, or how reflective they are, and their density. Any characteristics related to mass and size are specifically avoided, as this would hide any parent/daughter relationships resulting from breakups. The compositional characteristics are the most numerous ones in the analysis. These are binary characteristics indicating the presence/absence of chemical species, eg. water, iron, methane, etc.

Five of the characteristics, semi-major axis, inclination, eccentricity, albedo and density, are ordered and continuous. These prose a problem for standard cladistic analysis using parsimony, which needs discrete character states. Hence, these characteristics are binned using a python program. Each character-set is binned independently, and for each of the Jovian and Saturnian systems. The aforementioned python program iterates the number of bins until a linear regression model between binned and unbinned sets achieves a coefficient of determination (r2) score of > 0.99. All characteristics are binned in a linear fashion, with the majority increasing in progression. The exception to the linear increase is the density character set, with a reversed profile. All of the continuous, binned characteristic sets are (by definition) ordered characters.

Thus, the matrices comprised two sorts of characters with strongly different properties, when it comes to explicit inferences: binary characters, and highly ordered characters (the binned ones) with up to 11 states. For the graphs used here, we didn’t apply any weighting, which means that in the most extreme case complete difference in a binned character counts 11-times more than a difference in any of the binary characters. This bias is compensated to some degree by the number of binary characters (33, with 31 variable) vs. binned characters (5), when restricting the analysis to well-known planetary objects.

The matrices are comprehensive, and include little-known objects with a lot of missing data (>80% of the characters cannot be scored), which should be included. A matrix-based classification makes most sense, when one uses character sets that are defined for most or all of the objects. Thus, to see how the little-known objects relate to the well-known, we eliminated all poorly covered characters, leaving us with two binary and five binned ones. To not lose the information from the binary characters when calculating the inter-object distances, we gave them a weight of 7–8. This ensures that a 0↔1 difference in a binary character more or less equals the maximum possible difference in a binned character (on average 8 bins for the Jupiter dataset, and7 for the Saturn data set).

Fig. 1 The orbits of the satellites in the Jovian satellite system. Colours represent traditionally recognized groups.

The Jovian moons (and ring)

The Jovian system (Fig. 1) is dominated by the famous Galilean satellites (moons): Io, Europa, Ganymede, and Callisto. In between these moons and Jupiter there is a faint ring system, and four small satellites, the Amalthea family. Outside the Galilean system, there is a system of 67 small irregular satellites that have much wilder orbits, some going in the opposite direction to the other satellites. These are thought to be captured asteroids.

As seen in Fig. 2, the data analysis supports a somewhat tree-like network. The Galilean Group (the large moons), the association of Amalthea with Jupiter’s Main Ring, and the Himalia Family, but it rejects the traditional division of the remaining well-known moons (captured asteroids) into three families: two of the three Pasiphae Family satellites are very similar to Carme. Although Ananke is somewhat different, it is substantially more similar to Carme and the Pasiphae Family satellites than the inter-group differentiation found elsewhere. One commonly shared idea about classification is that one should erect classes that have similar quality and are defined by high intra-class coherence and inter-class differentiation.

Fig. 2 Neighbor-net of Jovian moons, well-covered by data (<1% or no missing data). Colouration addresses traditionally recognised groups, same colours used as in Fig. 1.

By reducing the character set to those characters that are defined for most or all moons, we naturally take away some of the potential differentiation. Nonetheless, the resulting graph (Fig. 3) provides a structure that may well be used to place less-known objects, identify their closest best-known counterpart(s), erect a classification, and discuss current classification schemes.

Fig. 3 Neighbor-net of all Jovian moons based on a distance-matrix reflecting the known (scored) similarity and differences. Colouration as above.

We can see that we lose differentiation within the well-known (and well-supported; Fig. 2) groups, especially regarding the distinctness between members of the Galilean Group and to the Amaltheae Family. However, the basic structure of the graph remains the same. Based on the scored data, the Ananke, Pasiphae and Carmes Families are not supported. A sub-division may be possible, but would require some re-shuffling of the moons. For instance, a group including Ananke-like satellites would not include Euporie, Iocaste and Hermippe, but may include Callirhoe. A Pasiphae s.str. group would make sense when excluding Aoede, Helike, Sinope, Autonoe, and Eurydome, with the latter three being (nearly) identical to Carmes or members of the Carmes Family.

Fig. 4 The orbits of the satellites in the Saturnian satellite system. Colours represent traditionally recognised groups.

The Saturnian moons and rings

The Saturn system (Fig. 4) is similar to of the Jupiter one.

The ring structure is one of Saturn’s most distinctive features, with structures seen even with a modest telescope. Imbedded in the rings are small moon-lets. The co-orbitals Janus and Epimetheus, just outside the main rings, swap orbits every four years. There are eight mid-sized satellites, including Titan, a small world in itself with a methane-based weather system. Of the other icy satellites, Mimas, Enceladus, Tethys, Dione and Rhea, are embedded in the diffuse E-ring. The source of this ring is cryovolcanic plumes on Enceladus, a possible location for life beyond Earth.

Unique to the Saturn system are Trojan satellites, in the same orbit as their parent satellites, one 60o ahead, and 60o behind. Tethys has Telesto and Calypso as Trojan satellites, while Helene and Polydeuces are Trojan satellites of Dione. Between the orbits of Mimas and Enceladus, there are the Alkyonides (Methone, Anthe and Pallene) recently discovered by the Cassini spacecraft. Each of the Alkyonides have their own faint ring arcs comprised of similar material to the satellite.

As with Jupiter, there is also a system of 38 small irregular satellites outside the inner system. This system is dominated by Phoebe — at ~240 km across it is six times the size of the next largest irregular. It is also the only irregular satellite to have its photo taken, with the Cassini spacecraft flying within 2,000 km of the surface, taking high-res images as it went. Using these new data, a picture is emerging of Phoebe as a captured outer solar system object.

For Jupiter’s smaller sibling (Saturn), a less tree-like network is inferred (Fig. 5). Since it is not the product of an evolutionary process in a biological sense (ie. a phylogeny), but instead including patterns related to parent/descendant relationships (rings-moons, breakups), we should not necessarily expect tree-like graphs.

Fig. 5 Neighbor-net of well-known Saturnian planetary objects (<1% or no missing data). Colouration addresses traditionally recognised groups
using the same colours than in Fig. 4.

Nonetheless, the graph could be the basis of an objective classification. The elements of the ‘Main Ring Group’ have high intra-group coherence, but also include Calypso and Telesto of the ‘E Ring Group’. On the other hand, the ‘Outer Satellite Group’ is very heterogeneous. One straightforward option would be to fuse this group with the E Ring Group; another is to exchange Enceladus for Hyperion. The Norse Group’s (= Phoebe Family) representative Phoebe is clearly distinct from any other object and would need a class of its own.

As in the case of Jupiter ,we can add and try to classify the remaining little-known objects (Fig. 6), to some degree.

Fig. 6 Neighbor-net of all Saturnian moons and rings based on a distance-matrix reflecting the known (scored) similarity and differences. Colouration as above.

In contrast to Jupiter, the reduced character set (just four characters, one binary, three binned-ordered) loses the differentiation between objects of the Main Ring, E Ring and Outer Satellite groups included in Fig. 5. They are virtually identically for these characters. The two groups not covered in Fig. 5, the Siarnaq (named after Inuit gods) and Albiorix Family (Gallic gods), are close to each other. The Albiorix Family forms a distinct subset of the Siarnaq Family. The moons of the coherent Phoebe Family (named after Norse mythological figures) are all close to each other, and this group includes various newly discovered satellites. Interesting is also the position of the Phoebe Ring compared to its name-giving moon and the remainder of the Phoebe Family.

Comparison with the tree-based analyses

In comparison with the astrocladistical work of Holt et al. (2018), the network-based analysis captures most of the meta-structure of the satellite classifications.

Compared with the Jovian trees, the network-based analysis shows the distinction between the inner Galilean group and the outer ‘irregular’ satellites and separates the Himalia family. The differences are in how each of the analyses handles the retrograde irregulars, the Pasiphae, Carme and Ananke families. It should be noted that these bodies are woefully under-studied, and have very little information available, making any inferences difficult.

In Holt et al.’s trees, the Ananke, Carme and Sinope subfamiles are unresolved, but are supported using Multivariate Hierarchical Cluster Analysis (example provided in Fig. 7). This method uses clustering in parameter space to justify collisional families. Though the particular members are different, the network-based analysis still identifies clusters around the largest irregular satellites, Anake, Sinope, Carme and Pasiphae. This further supports further the theory that these families are remnants of collision breakups. As usual with science, there is far more work to be done here.

Fig 7: Clustering of several Jovian Irregular satellites in three dimensional parameter space using Semi-major axis (a), eccentricity (e) and inclination (i).

In the Saturnian system, the outer satellites also prove to be problematic. Holt et at. split the Aegir and Ymir subfamilies from within the Phoebe family. These subfamilies are distinct from Phoebe and its ring, following a narrative of a different origin for Phoebe and the rest of the irregular satellites. The capture of Phoebe would have major disruptive effects on the satellites. As the dynamical characteristics play such a large role, they are the only information available for some of the satellites, so that little sub-structuring can be seen. As with the Jovian irregular satellites, more information is needed.

The inner system of Saturn also warrants mention, particularly the case of Telesto and Calypso, the Trojans of Tethys. In the network analysis, they are associated with main ring objects, rather than with Tethys itself. There is a possibility that these two Trojans are captured main-ring objects, and this would support that hypothesis. Dione, and its Trojan Helene, are both closely associated with one another in both analyses, indicating a parent/daughter relationship (keep in mind that phylogenetic trees cannot discern between parent/daughter and sister relationships).

Phoebe as seen by the Cassini spacecraft, NASA/JPL/Space Science Institute, PIA06064 (the NASA provides more than 1000 media files covering the Cassini-Huygens mission)

Boldly gone – networks as tools in classification
The idea discussed here appears worth exploring — using distance-based or other (meta-)phylogenetic networks for the classification of objects not necessarily following any phylogeny. It has some obvious advantages over astrocladistics (especially when using maximum parsimony as the tree-optimizing criterion) or traditional classification methods (PCoA, simple clustering approaches):
  • Distance-matrices are easy and quick to generate based on any data; and they can also be used for more traditional classification means such as PCoA.
  • Neighbor-nets are very quick to calculate, and can capture more aspects of the actual differentiation than can cluster analysis (e.g. UPGMA trees, PCoA) or astrocladistic methods; in some sense they represent a fusion of the best aspects of both approaches.
  • In contrast to a tree, where tree-incompatible signals can massively distort branch-length patterns, or rogue objects interfere with establishing a finely resolved topology, a Neighbor-net can be straightforwardly interpreted regarding group coherence.
Perhaps, the main disadvantage is of this approach is the need for a distance matrix with meaningful pairwise distances. If missing data distort the general (dis)similarity patterns, then Neighbor-net may have branching (edging) artifacts.

However, using the Neighbor-net as a basis for classification, groups also allows us to quickly test for character sampling bias, eg. by re-calculating the distance input matrix using weighting schemes, or different distance calculations (eg. instead of binning the continuous characters, they could be used as-is), or reduced character or taxon sets. Also, when it comes to classifying non-living objects, it’s always good to keep it as simple as possible, while being able to explore the signal in the data matrix.

More results, the data matrices used, and the template analysis files can be found on figshare. The archive includes also the (simple) NEXUS-formatted files with the PAUP* command blocks we used for the analyses. The one for Jupiter is fully annotated with comments on the code lines for PAUP* to assist inexperienced users and to facilitate export (and subsequent) import into SplitsTree.

The archive includes also the code for and results of a full bootstrap-support analysis (currently two optimality criteria: Least-squares and Maximum parsimony, Maximum likelihood to be added) — even when preferring the astrocladistic approach, networks are handy to summarize the bootstrap pseudoreplicate sample.


Holt TR, Brown AJ, Nesvorný D, Horner J, Carter B (2018) Cladistical analysis of the Jovian and Saturnian satellite systems. Astrophysical Journal 859(2): 97, 20 pp; arXiv: 1706.0142