Monday, July 30, 2018

Networks of polysemous and homophonous words


When I was very young, maybe even before I went to school, we often played a game with my parents and grandparents, during which we had to select two homophonous words (that is, one word form that expresses two rather different meanings), and the other people had to guess which word we had selected. This game is slightly different from its Anglo-Saxon counterpart, the homophone game.

In Germany, this game is called Teekesselchen: "little teapot". Therefore, people now also use the word Teekesselchen to denote cases of homophonoy or very advanced polysemy. In this sense, the word Teekesselchen itself becomes polysemous, since it denotes both a little teacup, and the phenomenon that word forms in a given language may often denote multiple meanings.

Homophony and polysemy

In linguistics, we learn very early that we should rigorously distinguish the phenomenon of homophony from the phenomenon of polysemy. The former refers to originally different word forms that have become similar (and even identical) due to the effects of sound change — compare French paix "peace" and pet "fart", which are now both pronounced as []. The latter refers to cases where a word form has accumulated multiple meanings over time, which are shifted from the original meaning — compare head as in head of department vs. head as in headache.

Given the difference of the processes leading to homophony on the one hand and polysemy on the other, it may seem justified to opt for a strict usage of the terms, at least when discussing linguistic problems. However, the distinction between homophony and polysemy is not always that easy to make.

In German, for example, we have the same word Decke for "ceiling" and "blanket" (Geyken 2010). This may seem to reflect a homophony at first sight, given that the meanings are so different, so that it seems simpler to assume a coincidence. However, it is in fact a polysemy (cf. Pfeiffer 1993, s. v. «Decke»). This can be easily seen from the verb (be)decken "to cover", from which Decke was derived. While the ceiling covers the room, the blanket covers the body.

Given that we usually do not know much about the history of the words in our languages, we often have difficulties deciding whether we are dealing with homophonies or with polysemies when encountering ambiguous terms in the languges of the world. The problem of the two terms is that they are not descriptive, but explanative (or ontological): they do not only describe a phenomenon ("one word form is ambiguous, having multiple meanings"), but also the origin of this phenomenon (sound change or semantic change).

In this context, the recently coined term colexification (François 2008) has proven to be very helpful, as it is purely descriptive, referring to those cases where a given language has the same word form to express two or more different meanings. The advantage of descriptive terminology is that it allows us to identify a certain phenomenon but analyze it in a separate step — that is, we can already talk about the phenomenon before we have found out its specific explanation.

A new contribution

Having worked hard during recent years writing computer code for data curation and analysis (cf. List et al 2018a), my colleagues and I have finally managed to present the fascinating phenomena of colexifications (homophonies and polysemies) in the languages of the world in an interactive web application. This shows which colexifications occur frequently in which languages of the world.

In order to display how often the languages in the world express different concepts using the same word, we make use of a network model, in which the concepts (or meanings) are represented by the nodes in the networks, and links between concepts are drawn whenever we find that any of the languages in the sample colexifies the concepts. The following figure illustrates this idea.

Colexification network for concepts centering around "FOOD" and "MEAL".

This database and web application is called CLICS, which stands for the Database of Cross-Linguistic Colexifications (List et al. 2018b), and was published officially during the past week (http://clics.clld.org) — it can now be freely accessed by all who are interested. In addition, we describe the database in some more detail in a forthcoming article (List et al. 2018c), which is already available in form of a draft.

The data give us fascinating insights into the way in which the languages of the world describe the world. At times, it is surprising how similar the languages are, even if they do not share any recent ancestry. My favorite example is the network around the concept FUR, shown below. When inspecting this network, one can find direct links of FUR to HAIR, BODY HAIR, and WOOL on one hand, as well as LEATHER, SKIN, BARK, and PEEL on the other. In some sense, the many different languages of the world, whose data was used in this analysis, reflect a general principle of nature, namely that the bodies of living things are often covered by some protective substance.

Colexification network for concepts centering around "FUR".

Although we have been working with these networks for a long time, we are still far from understanding their true potential. Unfortunately, nobody in our team is a true specialist in complex networks. As a result, our approaches are always limited to what we may have read by chance about all of those fascinating ways in which complex networks can be analyzed.

For the future, we hope to convince more colleagues of the interesting character of the data. At the moment, our networks are simple tools for exploration, and it is hard to extract any evolutionary processes from them. With more refined methods, however, it may even be possible to use them to infer general tendencies of semantic change in language evolution.

References

Geyken A. (ed.) (2010) Digitales Wörterbuch der deutschen Sprache DWDS. Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Berlin-Brandenburgische Akademie der Wissenschaften: Berlin. http://dwds.de

François A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.) From Polysemy to Semantic Change, pp 163-215. Benjamins: Amsterdam.

List J.-M., M. Walworth, S. Greenhill, T. Tresoldi, R. Forkel (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3.2. http://dx.doi.org/10.1093/jole/lzy006

List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, R. Forkel (forthcoming) CLICS². An improved database of cross-linguistic colexifications: Assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2. https://doi.org/10.1515/lingty-2018-0010

List J.-M., S. Greenhill, C. Anderson, T. Mayer, T. Tresoldi, and R. Forkel (eds.) (2018) CLICS: Database of Cross-Linguistic Colexifications. Max Planck Institute for the Science of Human History: Jena. http://clics.clld.org

Pfeifer W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.

Monday, July 23, 2018

Sequence alignment is still an open computational problem


I recently submitted an invited manuscript about multiple sequence alignment to a bioinformatics journal, but it did not fare well with the reviewers (ominously, there were more than the usual two, and it took a couple of months to get the reviews). The bioinformatics referees simply rejected the notion that a multiple alignment is an object in its own right, which is the basic premise of the manuscript.

To explain this: if we think of the normal tabular arrangement of a multiple sequence alignment, then the historical relationships among the rows (the taxa) are drawn as a phylogeny, while the historical relationships among the columns represent the homologies among the characters. There is no necessary primary importance of the phylogeny relationships over the homology relationships. However, phylogenies are much more prominent in the literature; and, indeed, sequence alignment is often seen as nothing more than a pesky step on the way to getting a phylogeny.

However, if we accept this notion, that homology relationships are both important and interesting in their own right, then multiple sequence alignment is certainly still an open computational problem, because most automated sequence alignments currently do not represent homology relationships. Instead, they represent sequence similarity of various sorts, and thus they only represent homology to the extent that similarity reflects history. In fact, similarity = homology + analogy, and the latter is not trivial.

I have previously written about the topic of alignment-as-homology for the biological audience:
  • Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.
  • Morrison DA, Morgan MJ, Kelchner SA (2015) Molecular homology and multiple sequence alignment: an analysis of concepts and practice. Australian Systematic Botany 28: 46-62.
This new manuscript is intended to be the equivalent for the bioinformatics audience, explaining why homology ≠ similarity, and therefore why the current alignment algorithms are inadequate.

Rather than let it languish, and since it is likely to be the last single-author paper that I ever write, I tried to add it to the bioRxiv repository, for everyone to read. Sadly, their reviewers decided that it is insufficiently original, but is merely a summary of existing information. So, I guess that they are not impressed by the novel ideas, either.

I also tried the arXiv, which may seem to be more appropriate, given the audience, but they no longer recognize my user account, which means that the manuscripts I have there now exist in limbo. The world is apparently against my manuscript!

[ Note: This issue has now been resolved, and the manuscript can be accessed as arXiv:1808.07717 ]

So, I am linking the paper here, instead:
Please have a look; and if you think it is worth it, then please spread the word. Moreover, if you are computationally inclined, then feel free to be inspired to tackle the problem described therein.

PS. I also once wrote a brief blog post about this:

Monday, July 16, 2018

A network of World happiness


This is a joint post by Guido Grimm and David Morrison.

You may never have heard of it, but the there is a World Happiness Report. This is sponsored by The Sustainable Development Solutions Network (SDSN) and The Global Happiness Council (GHC). Reports were produced in 2012, 2013, 2015 and 2017, but here we are going to look at the World Happiness Report 2018.


To quote the Report:
The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2018 ranks 156 countries by their happiness levels, and 117 countries by the happiness of their immigrants.
The rankings use data that come from the Gallup World Poll (GWP). The rankings are based on answers to the main life evaluation question asked in the poll. This is called the Cantril ladder: it asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The rankings are from nationally representative samples, for the years 2015-2017.
The Report is very comprehensive in its discussion of methodology, and its limitations. It is also very ambitious in its conclusions. The main focus of the 2018 Report is comparing the happiness of immigrants with their local counterparts. Interestingly, they found no important differences between these two groups.

More importantly for this blog, the raw data are provided in an Appendix, so that anyone can look at what is going on. We have decided to do just that.

The Report's happiness index

Below is the first little bit of Figure 2.2 (extracted from the report), which "shows the average ladder score (the average answer to the Cantril ladder question, asking people to evaluate the quality of their current lives on a scale of 0 to 10) for each country, averaged over the years 2015-2017." As you can see, the people who claim that they are happiest are those in the Nordic countries (Finland plus the Scandinavian countries: Norway, Denmark, Iceland and Sweden). These are the people whom the world's cultural cliché sees as sitting for half the year in the gloom! Apparently, you have all got it wrong.


As we have noted before, an index can often do a poor job of summarizing data, because it reduces complex data down to just one dimension. The Happiness Report tries to alleviate this limitation by adding information about some of the other variables that correlate with the Happiness score, using colors:
Each of these bars is divided into seven segments, showing our research efforts to find possible sources for the ladder levels. The first six sub-bars show how much each of the six key variables is calculated to contribute to that country’s ladder score, relative to that in a hypothetical country called Dystopia, so named because it has values equal to the world’s lowest national averages for 2015-2017 for each of the six key variables
However, we can do much better than this, by using all of these variables in a phylogenetic network. The key variables are (color-coded from left to right in the figure above):
  1. Gross Domestic Product (GDP) per capita is in terms of Purchasing Power Parity (PPP)
  2. Social support [the national average of the binary responses to the Gallup World Poll]
  3. The time series of healthy life expectancy at birth
  4. Freedom to make life choices [the national average of binary responses to the GWP question]
  5. Generosity [the residual of regressing the national average of GWP responses]
  6. Perceptions of corruption [ the average of binary answers to two GWP questions]
For the network, we simply put all of these variables into the analysis, along with the Happiness score.

[Technical details of our analysis: Qatar was deleted because it has too many missing values; each data variable was then standardized to zero mean and unit variance; the gower similarity was calculated, which ignores missing values, and this was converted to a distance; the distances were then displayed as a neighbor-net splits graph.]

A network analysis

The resulting network is shown next. Each point represents a country, with the name codes following the ISO-3166-1 standard. The spatial relationship of the points contains the summary information — points near each other in the network are similar to each other based on the data variables, and the further apart they are then the less similar they are. The points are color-coded based on major geographic regions (asterisks highlight single states that don't group with the rest of their geographical region). We have added some annotations for the major network groups, indicating which geographical regions are included — these groups are the major happiness groupings.


In this blog post we do not want to risk over-interpreting the data, as explained in the final paragraphs below. However, it is obvious that there are distinct patterns in the network. Happiness, and its correlates are not randomly distributed on this planet but, not unexpectedly, relate to the local socio-political situation.

Starting at the bottom-left, we have a geographically heterogeneous cluster of very well-off countries, either welfare states (as in northern Europe), capitalist democracies (eg. the USA, Singapore, Hong Kong), or oil-rich monarchies with high levels of public spending (as in the Middle East). Moving clockwise, the next cluster has much of the rest of the western and central European countries, along with the financially well-off parts of South America and Asia. The next cluster has many of the remaining eastern European countries, plus the nearest parts of Asia, where government spending on welfare is still apparent. Clearly, national wealth plays a large part in happiness, in spite of the well-known adage to the contrary.

This is followed, at the top-middle of the network, by a broad neighborhood (not a distinct cluster), where government spending on welfare is much less apparent, at least to an outsider. The countries here come from Europe, Asia, and Central plus South America (including, at the moment, Greece). Happiness and its correlates is reported to be much lower here.

To make this situation clearer, here is a version of the network with some of the happiness scores annotated — values are provided for the first and last 10th percentile of the happiness score, and the 10 largest (by population) countries in the world.


On the opposite side of the network, happiness is also apparently lower, but with a different set of correlations among the variables. There is a two-part cluster of geographically heterogeneous countries at the bottom-middle, plus a neighborhood at the bottom-right. The latter includes China and India, the two most populous countries (with one-third of our people), while Indonesia (4th) and Brazil (6th) are in the neighborhood at the top of the network.

Finally, the cluster at the right consists mostly of African countries, plus Pakistan (the 5th most-populous country). In this cluster, happiness is reported to be at its lowest observed level. Much of the world's monetary aid is spent in Africa, of course, to try to improve the situation, although there is clearly a long way to go. Not unexpectedly, most of the world's migrants come from the right-hand part of the network, which is one of the main focuses of the Happiness Report.

Final comments

It is interesting to note that the Bhutan (code BTN) government reportedly aims to increase the Gross National Happiness rather than the GDP (see Gross national happiness in Bhutan: the big idea from a tiny state that could change the world). The network shows that their 2015-2017 happiness is quite different to that of their geographical neighbors. However, it also suggests that they still have a long way to go.

We should finish the discussion with a general point about surveys, such as the Gallup Poll on which the Happiness Report is based. Respondents are not always completely honest when answering survey questions, which is why pre-election polls sometimes get it wrong — people are most serious when faced with an actual decision, rather than a question. All of the results here need to be interpreted in this light — they may not be far wrong, but they are unlikely to be completely right.

Apart from anything else, there can be cultural differences in the way in which the answers to the Gallup World Poll questions are treated. Does "happiness" really mean the same thing across all cultures? We know that "beauty" does not, and "freedom" does not; so why not "happiness"? After all, things like reported happiness are likely to be confounded with other feelings such as national pride. This issue could presumably be addressed by looking at other answers from the Gallup Poll.

Monday, July 9, 2018

Using splits graphs for multivariate data analysis


Data containing multiple measurements for each of a set of objects are usually too complex to be viewed easily in their raw form. Therefore, methods have been developed to usummarize the data down to something simpler. This is called multivariate data analysis.

One of the issues that needs to be addressed is that a data summary is designed to lose information. The goal is to somehow keep the most important information in the summary. Clearly, the simpler is the summary then the more information we are likely to lose.

This post is a simplistic introduction to why splits graphs, which were originally developed to summarize multivariate phylogenetic data, are usually very good data summaries. It compares the ability of maps, indexes and networks to summarize data.

Maps

A map is a 2-dimensional drawing of some piece of 4-dimensional space-time. For example, the map shown here represents the southern part of Scandinavia.

A map is quite successful as a data summary. It reduces the 4-dimensional world down to 2+ dimensions — latitude and longitude are represented accurately; we use symbols or colors/shading to represent altitude; and we choose one specific time (thus eliminating that dimension). We can therefore reconstruct much of the 3-dimensional world from looking at a map (ie. much of the original information is retained in the summary).


In our example, we can see even from a glance at the map that Denmark is as flat as a pancake, Norway is very hilly, and Sweden is somewhere in between. We can also see that Uppsala and Oslo are at the same latitude, and that the simplest way to get from Uppsala to Trondheim is likely to be via Östersund rather than Oslo.

Indexes

An index is a linear ordering of numbers measuring some calculated characteristic of a set of objects. It condenses a series of measurements for each object down to a single number. The index shown here refers to the hotels in Östersund (which we might stay at on our way from Uppsala to Trondheim), and indicates the overall quality score from a well-known online booking site. The index summarizes a set of features of the hotels that might be of interest to potential guests.

Hotell Emma
Clarion Hotell Grand
Hotell Stortorget
Quality Hotell Frösö Park
Hotell Jämteborg
Best Western Hotell Ett
Best Western Hotell Gamla Teatern
Hotell Älgen
Hotell Zäta
   8.9
   8.7
   8.6
   8.6
   8.3
   8.1
   8.0
   7.9
   7.8

Unfortunately, an index is rarely very successful as a data summary. It reduces multi-dimensional data down to only 1 dimension. Therefore, we cannot tell which dimensions contribute to each value of the index — the same value could arise in many different ways. We therefore cannot reconstruct any of the original dimensions — what goes into the summary cannot come back out (as it can for a map).



Staff
Location
Cleanliness
Comfort
Facilities
Breakfast
Free WiFi
Value for money
Hotell
Stortorget
8.9
9.4
9.1
8.5
7.7
8.5
9.1
8.3
Quality Hotel
Frösö Park
8.7
8.9
8.3
8.2
8.8
8.5
8.7
8.9

In our example, two of the hotels have exactly the same index score, but this does not necessarily mean that the two hotels are the same as regards the quality features, as shown above. For instance, there are notable differences between them in Location and Value for Money, and even larger differences in Cleanliness and Facilities. This information is lost in the calculation of the quality index.

Networks

A splits graph (a type of phylogenetic network) is a 2-dimensional drawing of some multi-dimensional set of data, such as might be used to calculate an index. The network shown here is based on the same data used to calculate the quality index above.

A network reduces multi-dimensional data down to 2+ dimensions. Each object is represented as a point — the spatial relationship of the points (their neighborhood) has meaning; and the inter-connecting lines have meaning (they are groups supported by the data). Such a network is therefore much more successful as a summary than is an index. Like a map, it will be very successful for 3-dimensional data, with potentially reduced success as the number of dimensions increases — the rate of information loss will depend on how well-correlated are the dimensions.


In our example, the main pattern in the network shows the relative quality of the hotels, as measured by the index, descending from top to bottom (so that all of the information form the index is in the network). However, the graph also emphasizes the difference between the two hotels with identical index scores. Indeed, it shows us that the Quality Hotell Fröösö Park is probably more similar to the Clarion Hotell Grand than to the Hotell Stortorget.

Alternatives

There are other forms of multivariate data analysis that are often used instead of networks. Two common ones are: an ordination, which reduces multi-dimensional data down to 2 dimensions only; and a cluster tree, which reduces to 1 dimension only. These are therefore often less successful as data summaries. Indeed, a network is very much like a combination of an ordination and a cluster tree, with the best features of both methods and fewer of their limitations.

Further reading

How to interpret splits graph

Primer of Phylogenetic Networks

Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312.

Monday, July 2, 2018

Reticulation at its best — an example from the oaks


One particular case where networks turn out to be a versatile tool is the study of low-level evolutionary patterns. This is especially so when we leave the comfort zone of well-sorted molecular markers, and use more than a single individual per species. Our recently published data set on (mostly Mediterranean) oaks, provides a nice example of this.

Why so few people study oaks at the intra-generic level

Oaks are notoriously difficult to study because they don't bother too much about species boundaries (which can be more or less obvious) and – at one point – decided to not sort their plastids at all (and full plastomes, as I once saw for myself first-hand, won't help). Hence, all reasonable phylogenetic reconstructions of oak evolution have been based on genetic data from the nucleome. However, this imposes a new problem — the sequenced nuclear gene regions allow the recognition of the major lineages (which recently have been formalized), but the closer one comes to the species level the more difficult it is to resolve anything at all.

Even the famous ITS region, which includes the weakly constrained internal transcribed spacer ITS1, and the structurally quite constrained ITS2, and have been frequently advocated as plant barcodes, turns out to be a two-edged sword. Relationships between the major intra-generic lineages is relatively clear, the ITS is pretty divergent down to the species level, but at the individual level, one faces a intra-genomic divergence that often outmatches inter-species differentiation.

In some groups, like the most speciose and most widespread white oaks (sect. Quercus), identical ITS variants exist from individuals / species separated today by thousands of kilometers of ocean or icy wasteland. One possible explanation is that oaks have very large population sizes, and they are wind-pollinated, so that they have a high capacity to permanently homogenize their genepools. Plastids, on the other hand, are only transmitted via the large fruit, the acorns, and the main animal vector for distributing acorns, the jaybirds, are sedentary birds. Their backup-vector, the squirrels are known to hoard a lot of acorns in a single place, but not for migrating globally (unless we assist them).

Nonetheless, we readily notice that the intra-individual differentiation patterns appear not to be entirely random, and so in our study we moved to another nuclear multi-copy spacer known to be more variable than the ITS1 and ITS2 (hence, largely ignored by molecular phylogeneticists) — the 5S intergenic spacer (5S IGS). It didn't help too much for solving the white oak puzzle (in western Eurasia), but did give us new insights into the two other western Eurasian sections: Ilex and Cerris.

The 'host-associate' framework

A cloned 5S-IGS (or ITS) sequence is not a good OTU, because we are usually not interested in a clone phylogeny (a mere sequence genealogy), but in the phylogenetic relationships between the individuals or species carrying the cloned sequence variants: the nuclear spacer population. Even networks struggle with such data, and my colleague Markus Göker came up with the idea to treat this in the form of hosts, the individuals, and associates, the cloned sequences found in the individual (Göker & Grimm, BMC Evol. Biol. 2008 — open access). There are several options to transfer the primary clone (associate) data into individual (host) data.

Options that we tested for transferring associate data into host data.
CM = character matrix, DM = distance matrix. CMhosts, independent used were morphological matrices. ENT — entropy, FRQ — frequence, CON — strict consensus, MOD — modal consensus, and SIZ — sample size, are character transformations implemented in Markus' g2cef, PBC and MIN are distance transformations implemented in pbc (these and other little helper programmes can be found here).

Using three cloned (ITS) datasets, we found that for these data the "Phylogenetic Bray-Curtis" (PBC — see the next figure) distance transformation outperforms the other tested options.

Computation of the "Phylogenetic Bray-Curtis" distance. It's a modification of the Bray-Curtis dissimilarity using the minimum distance for each covered row/column instead absence/presence. H1/H2 = hosts with different sets of associates (A1–A6)

Incidental but interesting insights

Whenever I come into contact with such data I advise the use of the PBC distance transformation as the basis for the main individual-level network, but also to run the MIN distance transformation: MIN will just calculate the minimum inter-clone distance between the clone samples of two individuals, and use this as the inter-individual distance.

Neighbour-net using the MIN transformation

The MIN network (above) is quite bushy for these data, because we naturally have many shared 5S-IGS variants among individuals of the same species, but occasionally also shared by individuals of different species. Nonetheless, it visualizes some basic differentiation patterns in the clone sample: compare e.g. the coherent cluster 3, the crenata-suber lineage (the 'Cork Oaks') — all individuals share a pair of very similar to identical 5S-IGS clones; and the divergent cluster 4, the 'Vallonea' oaks — all individuals have different sets of clones, but uniuqe 5S-IGS variants separating them from all other Cerris oaks (long proximal edge bundle).

Furthermore, we have potential F1 hybrids (morphologically intermediate) in our sample, and such hybrids, e.g. tj08, should have very low (to zero) MIN distances with members of their parental lineages.

However, the PBC network (below) is as beautiful as it gets — I really love this transformation, as it always comes up with something usable and interpretable.

Neighbor-net based on PBC-transformed inter-individual distances. See Simeone et al. (PeerJ PrePrints 2018 — open access pre-print) for a discussion.

However, this network was a last minute addition, because a happy little "accident" happened along the way, and the networks we were working with and looking at while drafting the paper where not PBC networks, as I thought.

It happened this way. Also implemented in Markus' little helper program are AVG, the average inter-clone distance, and MAX, the maximum inter-clone distance. AVG and MAX don't result in a proper distance matrix, because the diagonal will be the average or maximum distance between the clones of a single individual, and not all-zero as it should be (for MIN it's always zero). [We discussed a few options to modify AVG and MAX to ensure a zero diagonal, but couldn't devise something that makes sense.]

However, the SplitsTree program didn't bother about an all-zero diagonal, so the AVG and MAX transformed distance matrices will produce a Neighbor-net. So, what I assumed were PBC networks were in fact AVG networks.


Neighbor-net based on AVG-transformed inter-individual distances.

It took me quite long to recognize this "error" because, in contrast to the AVG (and MAX) networks I looked at when we did the 2008 paper, the one for the oaks made a lot of sense. Notably, the suspected F1 hybrids were perfectly resolved spanning up according boxes, and the species aggregates (clusters) did make sense regarding the general geographic setting, the history of the region under study, and their morphology.

Same graph as above, highlighting known or potential F1-hybrids spanning up according boxes.

For these data (with a minimum of four clones available per individual, individuals covering all species, and including the entire range of the section in western Eurasia), the AVG network better shows the potential F1 hybrids (or introgrades) than the (more methodologically sophisticated) PBC network. However, the latter makes more sense regarding speciation processes and the history of the group (because, the distance is a "phylogenetic" version of the well-known Bray-Curtis distances).


A "cactus-oak" fusion graph depicting nuclear and plastid differentiation (and evolution) in Quercus Group Cerris.

Take-home message

First, it's always good to delegate work you can do by heart to somebody new to it! This forces its propagation, which is important. More importantly, though, one has ones preferences and established analysis pipelines, and they may have become restricted in scope. I mainly used the -a (AVG), -i (MIN) and -x (MAX transformation) options in the little helper program to quickly summarize some of the primary differentiation data — for example, individuals have identical clones (MIN = 0), intra-individual divergence may be higher or not than inter-individual (MAX intra-individual > MIN inter-individual), and individuals may have strongly divergent clones (high MAX). AVG was computed and tabulated but never cherished by me. I always looked at the MIN transformed networks, since this provides a valid distance matrix, but then ignored them. But I never again tried to infer a Neighbor-net based on AVG or MAX transformations after our 2008 paper.

Second, Neighbor-nets are so quick to infer that there is no resource- or logic-related reason to not just run whatever distance one has on hand or can easily establish. Maybe even the biologically less-sound will reveal some interesting aspect (there are a lot of biological arguments that can be put forward for dismissing AVG distances in favour of PBC distances)

Paper (pre-print) and open data
Simeone MC, Cardoni S, Piredda R, Imperatori F, Avishai M, Grimm GW, Denk T. 2018. Comparative systematics and phylogeography of Quercus Section Cerris in western Eurasia: inferences from plastid and nuclear DNA variation. PeerJ Preprints 6: e26995v1.
Primary data and analysis files are included in the Online Supplemantary Archive: Simeone et al., PeerJ Preprints, doi: 10.7287/peerj.preprints.26995v1/supp-4. (See Readme.txt included in the topfolder of the archive.)