The Genealogical World of Phylogenetic Networks: Time consistency

Showing posts with label Time consistency. Show all posts

Monday, June 10, 2019

Why don't people draw evolutionary networks sensibly?

In phylogenetics there are two types of network:

those where the network edges have a time direction, whether explicit or implied; and
those where the edges are undirected.

The latter networks are among the most valuable tools ever devised for the exploration of multivariate data patterns; and this blog is replete with examples drawn from all fields that produce quantitative data (see the Analyses blog page). The first type of network, however, is the only one that can display hypothesized evolutionary histories — that is, they can truly be called evolutionary networks.

Evolutionary networks have a set of characteristics that are essential in order to successfully display biological histories, such as:

no directed cycles, because otherwise one of the descendants would be its own ancestor;
time consistency, meaning that reticulations in the network only occur between contemporaries.

The latter requirement is not needed for the history of human artifacts, because the ideas on which those artifacts are based can be recorded, and then not used until much later — ideas can "leap forward" in time. There are a number of examples of this in this blog, as discussed in last week's post (A phylogenetic network outside science).

However, time consistency is pretty much universal in biology (see the post on Time inconsistency in evolutionary networks). Natural hybridization and introgression require two living organisms in order to occur, as does horizontal gene transfer. This is basic biology, at least outside the laboratory.

So, the question posed in this post's title refers to the fact that so many people draw their evolutionary networks in a manner that appears to violate time consistency.

Consider this example (from: Interspecies hybrids play a vital role in evolution. Quanta Magazine):

Note that the reticulation edges (the dashed lines) represent gene transfers by introgression or hybidization, and yet none of them are drawn vertically, as they would need to be in order to be time consistent (since time travels from left to right).

It might be argued that most of these are not all that important in practice, but the one to the left quite definitely matters very much. It shows gene transfer between: (i) an organism that speciated 3.65 million years ago and (ii) an organism that is the descendant of one that speciated 3.47 million years ago. The 180,000 years between those two events are not irrelevant; and they make the claimed gene transfer impossible.

One might think that this is simply the general media misunderstanding the network requirements, but this is not so. The diagram is actually a quite accurate representation of the one from the original scientific publication (from: Genome-wide signatures of complex introgression and adaptive evolution in the big cats. Science Advances 3: e1700299; 2017.):

The network shows the same series of hybridizations / introgressions. However, this time three sets of gene transfers are shown to be time consistent, represented by the horizontal arrows (since time flows from top to bottom). Two of the three diagonal arrows (light blue and orange) could be made time consistent (ie. drawn horizontally), although the authors have chosen not to do so, apparently for artistic reasons. However, the first reticulation cannot be made time consistent, for the reason outlined above.

So, people, please think about what you are drawing, and don't show things that are biologically impossible,

Wednesday, May 21, 2014

Phylogenetics of computer viruses?

There is a difference between phylogenetics and clustering or classification. The latter processes put objects into groups based on some intrinsic features, but the former uses their intrinsic features to expresses their evolutionary history. Not all objects have an evolutionary history, even though they can all be put into groups. Furthermore, even objects that do have a history do not necessarily have an evolutionary history. Evolution involves ancestor-descendant relationships (as well as sister-group relationships), and not all of history involves ancestors and descendants.

This distinction is important for the use of phylogenetics as a metaphor for the history of non-biological objects. Outside of biology, many things are claimed to have a "phylogenetic history", including languages and most human artifacts. As I have noted before, one has to be careful when applying this metaphor (see False analogies between anthropology and biology).

One particular example that I have encountered involves the development of computer viruses and other malware (Iliopoulos et al. 2008). Metaphorically, such viruses can be seen to be phylogenetically related, because new viruses are often based on previous ones — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the metaphor is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Sorkin (1994) seems to have been the first to discuss the possibility of computer virus evolution, but the first empirical attempt to reconstruct a digital phylogeny appears to have been by Hull (1995b), who studied the Stoned computer virus (a virus that infected the boot sector of PCs between 1990 and 1995), as shown below.

Since then, phylogenetics has been a popular topic in the study of computer malware (eg. Goldberg et al. 1996; Carrera & Erdélyi 2004; Karim et al. 2005a,b; Ma et al. 2006; Wehner 2007; Walenstein et al. 2007; Wagener et al. 2008: Hayes et al. 2009; Khoo & Lió 2011; Guan et al. 2012). As noted by Webster & Malcolm (2007) these studies "present classifications of malware based on phylogenetic trees, in which the lineage of computer viruses can be traced and a 'family tree' of viruses constructed based on similar behaviors." (Note that there are other possible uses of phylogenetics related to computer programming, as exemplified by Ji et al. 2008.)

In all cases the phylogeny was produced using a distance-based clustering algorithm (ie. the tree is a phenetic one). Some of the distances are well motivated in terms of historical changes in the intrinsic attributes of the malware, but that does not necessarily make the resulting phenogram a phylogeny. So, the methods use basic clustering techniques to produce a tree, thus treating classification as phylogenetics (or phenetics as phylogenetics). This certainly clusters the objects, but there is no necessary reason for the clustering to reflect phylogeny.

Thus, a simple concept of clustering is inadequate, even though it can be used to construct a tree. A phylogenetic tree expresses the nested hierarchy formed by the shared derived character states, but not by anything else. A tree expresses nested clusters, but it is a form of "special nesting" that expresses a phylogeny. Only one form of tree is relevant to phylogenetics, and trees formed in other ways are likely to be suitable only under the simplest circumstances.

Furthermore, the general conception in these papers of a virus phylogeny as tree-like is clear. As noted by Hull (1995a):

Computer viruses evolve in complex ways not usually encountered in nature. The transplantation of large segments of computer code from one virus to another need not represent evolutionary relationship, for example. A newer virus may just represent a debugged or patched earlier version. The virus author may have deliberately incorporated parts of other viruses as a short cut, or because the plagiarized code is useful. If the virus incorporates code generating 'engines', similar code may appear in viruses with no other similarities. Structural similarities deriving from functional similarities likewise derive from several sources.

As we now know, these sorts of evolutionary events are usually found in nature, but they create reticulate histories, involving horizontal as well as vertical evolution. That is, computer virus phylogenetics involves reticulation — new computer code takes bits and pieces from various previous viruses. In particular, there is also what is called "oblique" evolution, in which there is horizontal evolution between generations. This is a characteristic of many histories involving human artifacts (see Time inconsistency in evolutionary networks), and it allows information to "time travel", so that the information available for horizontal transmission can come from the distant past as well as from the present.

So, malware evolution is not tree-like. Only two of the papers cited above seem to acknowledge this fact. Khoo & Lió (2011) were quite conventional in using splits graphs rather than unrooted trees to display their data, although they do not specify the algorithm for producing the networks. They do, however, claim that "networks were more useful for visualising short nop-equivalent code metamorphism than trees".

Goldberg et al. (1996) were more innovative, and analyzed their data using what they called a "phyloDAG", which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network). Interestingly, they note that "Beyond the computer virus realm for which it was conceived, the phyloDAG is also a plausible model for evolution of bacterial populations." Indeed, the possibility of multiple roots has been explicitly suggested for prokaryote phylogenetics (see Can networks have multiple roots?). I wouldn't doubt that it is also feasible for language history.

References

Carrera E, Erdélyi G (2004) Digital genome mapping – advanced binary malware analysis. Virus Bulletin Conference 2004.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Guan Q, Tang Y, Liu X (2012) A malware homologous analysis method based on sequence of system function. Advanced Science and Technology Letters, ASTL 15: Advanced Computer Science and Technology. Science and Engineering Research Support Society, Sandy Bay, Tasmania, Australia.

Hayes M, Walenstein A, Lakhotia A (2009) Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5: 335-343.

Hull DB (1995a) Computer viruses: naming and classification. Virus Bulletin Sept: 15-17.

Hull DB (1995b) Computer viruses: naming and classification, part II. Virus Bulletin Oct: 16-17.

Iliopoulos D, Adami C, Ször P (2008) Darwin inside the machines: malware evolution and the consequences for computer security. Virus Bulletin Conference 2008.

Ji J-H, Park S-H, Woo G, Cho H-G (2008) Generating pylogenetic tree of homogeneous source code in a plagiarism detection system. International Journal of Control, Automation, and Systems 6: 809-817.

Karim ME, Walenstein A, Lakhotia A (2005a) Malware phylogeny using maximal pi-patterns. Proceedings of the EICAR 2005 Conference, pp 156-174.

Karim ME, Walenstein A, Lakhotia A, Parida L (2005b) Malware phylogeny generation using permutations of code. Journal in Computer Virology 1: 13-23.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Ma J, Dunagan J, Wang HJ, Savage S, Voelker GM (2006) Finding diversity in remote code injection exploits. Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, pp 53-64. ACM, New York.

Sorkin GB (1994) Grouping related computer viruses into families. Proceedings of the IBM Security ITS 1994.

Wagener G, State R, Dulaunoy A (2008) Malware behaviour analysis. Journal in Computer Virology 4: 279–287.

Walenstein A, Hayes M, Lakhotia A (2007) Phylogenetic comparisons of malware. Virus Bulletin Conference 2007.

Webster M, Malcolm G (2007) Classification of computer viruses using the theory of affordances. Second International Workshop on the Theory of Computer Viruses.

Wehner S (2007) Analyzing worms and network traffic using compression. Journal of Computer Security 15: 303-320. (arXiv:cs/0504045v1, 2007)

Wednesday, July 4, 2012

Time inconsistency in evolutionary networks

The temporal ordering of the nodes (and branches) is usually treated as an important feature in an evolutionary network of biological organisms, because the order must be time consistent (Baroni et al. 2004, 2006; Moret et al. 2004). That is, for reticulation events the "horizontal" gene flow can only occur between species that are contemporaries. So, speciation events occur successively but reticulation events occur instantaneously (Sang and Zhong 2000).

For example, it would be unrealistic to hypothesize either a hybridization or a horizontal gene transfer event between a species and one of its own ancestors. Furthermore, each reticulation event must not only be consistent on its own but must be consistent in relation to all of the other events.

Mathematically, inconsistency creates directed pseudo-cycles in the network graph, so that it is not acyclic, as required for an evolutionary history (see previous blog post). Time consistency is thus seen as a useful means of validating a network as a potential biological history, and can even be used as a criterion for choosing among otherwise equally optimal networks.

However, evolutionary analysis is not applied only to biological organisms. It has also been applied to the study of languages (Atkinson & Gray 2005) and to cultural objects (Collard et al. 2006). Indeed, Darwin himself recognized early on that it would be important to show that language (a characteristic solely of humans) had a natural origin and that it develops in a genealogical fashion (ie. it has a pedigree).

Thus, both language and cultural objects have an historical component that can be studied, and both can fit into an evolutionary framework of variation + transmission + selection (Dagg 2011). Moreover, the evolutionary history also consists of both vertical and horizontal transmission. This means that the same data-analysis techniques can potentially be applied to biology, language and culture (Heggarty et al. 2010; Gray et al. 2010).

The issue that I wish to raise here is that time consistency is not a requirement of the evolution of either language or cultural objects, the way that it is for biological organisms. Organisms store the information (that is vertically and horizontally transmitted) in genes that they carry with them, which is what restricts reticulation to occurring only between contemporaries. However, language and culture store their "information" externally, either in the minds of people or in permanent or semi-permanent records (either written or pictorial). Thus, the information available for horizontal transmission can come from the distant past, as well as from the present. *

It is important to note that for language and culture the biological ideas of vertical and horizontal transmission of genetic information need modification (Cavalli-Sforza and Feldman 1981). Vertical (or descending) transmission still involves faithful copying of the information (with perhaps some losses or minor modifications). Lateral transfer, however, can be either horizontal transmission (between contemporary generations) or oblique transmission (between different generations), and it is the latter that allows time-travel of information.

Lateral transfer in this context may be a form of hybridization, in which new concepts are added from elsewhere (eg. synonymous words), but is likely to be a form of HGT in which concepts are simply replaced with something from elsewhere (eg. a new word effectively replaces an old word). Recombination, in which concepts are mutually exchanged, may be rather rare.

As an illustration, Dagg (2011) provides some interesting examples of lateral transfer in the parts of mouse traps. For example, he notes that: "Torsion power may have been transmitted laterally from Egyptian torsion traps to prefabricated dead-fall traps." These traps need not be contemporaneous, because the ideas being transferred may be from pictures or descriptions of old traps rather than from concurrently existing traps. (Joachim Dagg also has a couple of blog posts where he further discusses the evolution of mouse traps: post 1 — post 2.)

As an alternative example, Johnson et al. (1989) provide an evolutionary network showing the history of the various software (mostly) and hardware components of the revolutionary Xerox 8010 "Star" Information System (ie. computer), introduced in April 1981. Note that almost all of the lateral transfer events (single arrows; mostly hybridization) are time inconsistent. To quote the authors: "Although Star was conceived as a product in 1975 and was released in 1981, many of the ideas that went into it were born in projects dating back over three decades."

Fig. 8 – How systems influenced later systems.
This graph summarizes how various systems related to Star have influenced one another over the years. Time progresses downwards. Double arrows indicate direct successors (i.e., follow-on versions). Many "influence arrows" are due to key designers changing jobs or applying concepts from their graduate research to products.

The implications of time-travelling laterally transferred information for network construction methods may be unfortunate, in the sense that evolutionary networks in biology may be quite different from those for language and culture, with the latter pair requiring somewhat different methods. At a minimum, the requirements for choosing among alternative networks will be different.

A quick look at the current literature involving network analysis of languages and cultural artifacts shows an almost universal use of unrooted graphs, most often a Neighbor-Net, Reduced-Median or Median-Joining network. Such networks cannot directly represent evolutionary history because there is no time direction in the graph. This type of analysis thus neatly side-steps the issue of representing time-travelling information in an evolutionary diagram; and it suggests that social scientists have not yet considered the consequences of the potential lack of time consistency in their data.

* Footnote: I suppose that I should be precise, and note that a modern gene bank does allow genetic information to time travel, as well.

References

Atkinson QD, Gray RD (2005) Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Baroni M, Semple C, Steel M (2004) A framework for representing reticulate evolution. Annals of Combinatorics 8: 391–408.

Baroni M, Semple C, Steel M (2006) Hybrids in real time. Systematic Biology 55: 46–56.

Cavalli-Sforza LL, Feldman MW (1981) Cultural Transmission and Evolution. Princeton University Press, Princeton.

Collard M, Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169–184.

Dagg JL (2011) Exploring mouse trap history. Evoluton: Education and Outreach 4: 397–414.

Gray RD, Bryant D, Greenhill SJ (2010) On the shape and fabric of human history. Philosophical Transactions of the Royal Society of London series B 365: 3923-3933.

Heggarty P, Maguire W, McMahon A (2010) Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Johnson J, Roberts TL, Verplank W, Smith DC, Irby C, Beard M, Mackey K (1989) The Xerox "Star": a retrospective. IEEE Computer 22: 11-29.

Moret BME, Nakhleh L, Warnow T, Linder CR, Tholse A, Padolina A, Sun J, Timme R (2004) Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1: 13–23.

Sang T, Zhong Y (2000) Testing hybridization hypotheses based on incongruent gene trees. Systematic Biology 49: 422–434.