Wednesday, March 5, 2014

Recognizing groups in splits graphs

Splits graphs are produced by distance-based network methods such as NeighborNet and Split Decomposition, by character-based methods such as Median Networks and Parsimony Splits, and by tree-based methods such as Consensus Networks and SuperNetworks. They represent sets of node clusters that may overlap. If the clusters are nested then the graph will be tree-like, but if they overlap then the graph will show complex reticulation patterns. In the latter case, there is no simple way to summarize the patterns as a set of "groups" of nodes, although there is clearly a strong tendency in the literature for practitioners to try to do so.

I have written before about How to interpret splits graphs, in which the edges in the graph represent separation between two clusters of nodes in the network (ie. they split the graph in two). Recognizing groups of nodes should therefore be based on the splits. Ideally, each group of nodes should represent a split in the network, preferably a well-supported split.

However, if the split pattern is complex then recognizing groups of nodes will also be complex. This can be seen in the following splits graph, which is taken from the paper by Robert M. Ross, Simon J. Greenhill and Quentin D. Atkinson (2013. Population structure and cultural geography of a folktale in Europe. Proceedings of the Royal Society B 280: 20123065). The network shows the relationships among 32 ethnolinguistic cultures based on the characteristics of one of their folktales.

This network is not very tree-like, and yet the authors recognize five main ethnolinguistic groups (shown in different colors). Inspection of these groups reveals:
  • The light-orange group represents a well-supported split in the graph, and is thus uncontroversial; but none of the other groups are represented by a single split.
  • The pink group represents two splits, one clustering English, Irish, Scottish and Danish, and one clustering Danish, Latvian and German. These splits are incompatible with only one other minor split, and so the group is relatively uncontroversial.
  • The green group also represents two splits, one clustering Armenian and Turkish and one clustering Turkish and Greek. These are well-supported splits, with only minor incompatibility with other splits, and so perhaps this group is also uncontroversial.
  • The purple group is supported by a single split only if Greek is included in the group. Clearly, this conflicts with the green grouping. However, without Greek there is not much in the way of splits that support the purple grouping.
  • There is a very poorly supported split that unites the dark-orange group only if Bulgarian and Czech are included in the group. There are three well-supported splits that combine to support the group provided that Bulgarian is included. In both cases this conflicts with the purple grouping.
So, at least two of the recognized groups can be considered doubtful, as groups, based on the network alone. The authors' motivation for their groupings is at least partially based on geographical considerations:
The NeighbourNet in figure 2 represents graphically the pattern of regional clustering in folktale variation. The five clusters we identify provide insights into possible cultural spheres of influence in Europe since the folktale’s inception.
Nevertheless, it seems unwise to recognize all of the five colored regions of the network as "groups" or "clusters" of nodes, since it is not obvious that the network actually supports them all as groups. Perhaps we should call them "neighborhoods" or some other similar term, so as not to be misleading. We could define a neighborhood as a collection of nodes in close proximity in the splits graph but not necessarily representing any unique combination of well-supported splits.

No comments:

Post a Comment