Tuesday, March 21, 2017

Computer viruses and phylogenetic networks

I have written before about the Phylogenetics of computer viruses. This is an example of the use of phylogenetics as a metaphor for the history of non-biological objects. By analogy, computer viruses and other malware can be seen to be phylogenetically related, because new viruses are usually generated using existing malicious computer code — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the analogy is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Furthermore, the model of historical change in computer viruses is often the same as that for biological viruses — recombination rather than substitution. That is, like real viruses, new computer viruses are often created by recombining chunks of functional information from pre-existing viruses, rather than by an accumulation of small changes. Coherent subsets of the current computer code are combined to form the new programs.

From this perspective, it is unexpected that the principal phylogenetic model in the study of computer viruses has been a tree rather than a network — a recombinational history requires a network representation, not a tree, and thus malware evolution is not tree-like. As noted by Liu et al. (2016): "Although tree-based models are the mainstream direction, they are not suited to represent the reticulation events which have happened in malware generation."

In my previous (2014) post, I noted only two known papers that used a network rather than a tree to represent malware evolution:
  • Goldberg et al. (1996) analyzed their data using what they call a phyloDAG, which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network; described in more detail in Phylogenetics of computer viruses);
  • Khoo & Lió (2011) used splits graphs rather than unrooted trees to display their data, although they did not specify the algorithm for producing their networks.
Unfortunately, malware researchers have continued to pursue the idea that a phylogeny is simply a form of classification, and have therefore stuck to the idea of producing a tree-like phylogeny using some form of hierarchical agglomerative clustering algorithm (eg. Bernardi et al. 2016).

More positively, however, some papers have appeared that have instead pursued the idea of using a network model rather than a tree:
  • Liu et al. (2016) provided median-joining networks, which are unrooted splits graphs, to display relationships within each of three different virus groups;
  • Jang et al. (2013) infered a directed acyclic graph using a minimum spanning tree algorithm, with a post-processing step to allow nodes to have multiple parents;
  • Anderson et al. (2014) presented a novel algorithm based on a graphical lasso, which builds the phylogeny as an undirected graph, to which directionality is then added using a post-hoc heuristic;
  • Oyen et al. (2016) "present a novel Bayesian network discovery algorithm for learning a DAG [directed acyclic graph] via statistical inference of conditional dependencies from observed data with an informative prior on the partial ordering of variables. Our approach leverages the information on edge direction that a human can provide and the edge presence inference which data can provide."
It is important to note that only the works producing a directed graphs can represent a phylogeny — the other works produce unrooted graphs that may or may not reflect phylogenetic history. The bayesian work of Oyen et al. (2016) is particularly interesting:
Directionality is inferred by the learning process, but in many cases it is difficult to infer, therefore prior information is included about the edge directions, either from human experts or a simple heuristic. This paper introduces a novel approach to combining human knowledge about the ordering of variables into a statistical learning algorithm for Bayesian structure discovery. The learning algorithm with our prior combines the complementary benefits of using statistical data to infer dependencies while leveraging human knowledge about the direction of dependencies.


Anderson B, Lane T, Hash C (2014) Malware phylogenetics based on the multiview graphical lasso. Lecture Notes in Computer Science 8819: 1-12.

Bernardi ML, Cimitile M, Mercaldo F (2016) Process mining meets malware evolution : a study of the behavior of malicious code. Proceedings of the 2016 Fourth International Symposium on Computing and Networking, pp 616-622. IEEE Computer Society Washington, DC.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Jang J, Woo M, Brumley D (2013) Towards automatic software lineage inference. Proceedings of the Twenty-Second USENIX Conference on Security, pp 81-96. USENIX Association, Berkeley, CA.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Liu J, Wang Y, Wang Y (2016) Inferring phylogenetic networks of malware families from API sequences. Proceedings of the 2016 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, pp 14-17. IEEE Computer Society Washington, DC.

Oyen D, Anderson B, Anderson-Cook C (2016) Bayesian networks with prior knowledge for malware phylogenetics. The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence Artificial Intelligence for Cyber Security: Technical Report WS-16-03, pp 185-192. Association for the Advancement of Artificial Intelligence, Palo Alto, CA.

No comments:

Post a Comment