The Genealogical World of Phylogenetic Networks: June 2014

Saturday, June 28, 2014

Phylogenetic network Millennium problems

It was 14 years ago that the Millennium started, but there are therefore still 986 years left to solve the following seven phylogenetic network Millennium problems. These are not necessarily the most important problems to solve from a biological point of view, but are challenging computational problems that have (at least) some biological relevance. The problems are all about phylogenetic networks, except for Problems 2 and 7 which are about the closely related topic of agreement forests. Solving these problems will not be rewarded with $1,000,000 but only with eternal fame.

In each of these problems, a phylogenetic network on X is a directed acyclic graph with a single root and no vertices that have only one incoming and only one outgoing arc, and in which each leaf is labelled by an element of X and each element of X labels one leaf.

Problem 1. Is the Hybridization Number problem fixed-parameter tractable (FPT) if the input is an unrestricted set of nonbinary trees and the only parameter is the hybridization number? Hybridization Number is the following problem. Given a finite set X, a collection T of rooted (possibly nonbinary) phylogenetic trees on X and a natural number k, decide if there exists a rooted phylogenetic network on X that displays all trees from T and has reticulation number at most k. See e.g. (van Iersel, Kelk, 2013) for more detailed definitions.

Problem 2. Does there exist a polynomial-time 2-approximation algorithm for MAF on two binary trees? Maximum Agreement Forest (MAF) on two binary trees can be defined as follows. Given a finite set X and two rooted binary phylogenetic trees on X, what is the minimum number number of components in a forest on X that can be obtained from each of the input trees by deleting vertices, deleting edges and suppressing indegree-1 outdegree-1 vertices? For a 2.5-approximation see (Shi, You, Feng, 2014).

Problem 3. Is there an FPT algorithm for finding a level-k phylogenetic network consistent with a given dense set of rooted triplets, if k is the parameter? A rooted triplet is a phylogenetic tree with three leaves. A set of rooted triplets is called dense if it contains at least one triplet for each combination of three leaves. A network is level-k if it can be turned into a tree by deleting at most k edges per biconnected component. This problem is known to be solvable in polynomial time if k is fixed, see (Habib and To 2012).

Problem 4. Is Tree Containment polynomial-time solvable or NP-hard for reticulation visible networks? Tree Containment is the problem of deciding if a given phylogenetic network displays a given tree. A phylogenetic network is called reticulation visible if for each reticulation there exists some leaf such that each path from the root to that leaf passes through that reticulation. Tree Containment is known to be NP-hard for general networks and for some restricted classes of networks; see (Kanj, Nakhleh, Than, Xia, 2008) and (van Iersel, Semple, Steel 2010).

Problem 5. Is there a constant-factor approximation algorithm for computing the softwired parsimony score of a binary tree-child network and a binary character? Given a network and a character state (0 or 1) for each leaf, the softwired parsimony score is the minimum number of state-changes in any tree (on all leaves) displayed by the network, over all possible assignments of states to the internal vertices. A phylogenetic network is called tree-child if each non-leaf vertex has at least one child that is not a reticulation. This problem does not have a constant factor approximation for general networks or for other (less severely) restricted classes of networks, unless P = NP (Fischer, van Iersel, Kelk, Scornavacca 2013).

Problem 6. Given k > 1, what is the maximum value of p such that for any set of rooted triplets there exists some level-k phylogenetic network on n leaves that is consistent with at least a fraction p of the input triplets? For k = 0 the maximum is p = 1/3 and for k = 1 it is roughly 0.48, see (Byrka, Gawrychowski, Huber, Kelk 2009).

Problem 7. Is there an O(c^n) algorithm for Maximum Acyclic Agreement Forest (MAAF) on two binary phylogenetic trees with c < 2? An acyclic agreement forest is an agreement forest (see above) for which the following directed graph D is acyclic. D has a vertex for each component of the forest and there is an arc from component A to component B if in at least one of the input trees there is a directed path from the root of A to the root of B. It is known that there exist an O*(2^n) algorithm for this problem (van Iersel, Kelk, Lekic, Stougie, 2013).

Wednesday, June 25, 2014

Non-phylogenetic trees

I recently published a post on Evolution and timelines, in which I pointed out that presenting historical data as a timeline is a very poor way of representing an evolutionary history. Evolutionary history is much better presented as a phylogeny, which will be either a tree or a network. However, this does not mean that all histories that are presented as a tree, for example, necessarily represent a phylogeny.

I have encountered a few examples of history-as-tree that seem to have very little connection to a phylogeny. That is, the relationships among the objects are presented along the branches of a tree, but the relationships along the branches seem to be little more than a timeline. So, the whole structure is simply a series of interconnected timelines.

Consider this first example, which is a poster purporting to show for the USA:

the evolution of jazz in its more than one hundred year history. From Archaic to Avant Garde, from blues to bebop, from radio to fusion, from spirituals to swing, from Armstrong to Zawinul, the jazz pedigree presents the diverse history and development of jazz in a clear way.

Perhaps it is the strong central trunk that gives it away as a non-phylogeny. The side-branches do group the jazz performers roughly by genre, but that is all they do. The actual title is a bit more accurate about the content — it is a "Story" rather than a phylogeny.

This poster is accompanied by a European counterpart with an even stronger central trunk. It is labeled as a "Community", but it still claims to "display the history and development of European jazz".

As another example, in 1946, the magazine P.M.published a tree by Ad Reinhardt with a sardonic view of modern American art. [Thanks to Joachim Dagg for alerting me to this example.]

At least there is no central trunk this time, but the clustering of artists along the branches seems to have less to do with phylogenetic history than with artistic genre (and satire). There was a follow-up example 15 years later, in which the sardonic humor plays much the strongest role in the relationships represented.

Finally, here is an example of a timeline that really should be represented using a phylogenetic tree. It is difficult to believe that the group of professions illustrated form a transformational series, as implied by the timeline that is actually shown. Most of the entrepreneur groups depicted actually still exist to this day, rather than being extinct, and so we have here a history of variational evolution, instead of a transformation.

Monday, June 23, 2014

Phylognetic trivia

Phylogenetics plays no part in games like Trivial Pursuit, but the web offers more opportunities. The Fun Trivia web site, for example, offers a page on Phylogenetics. You should try it, and see how well you do.

The answers (and explanations) are quite good, but the wording of some of the questions leaves a lot to be desired.

Wednesday, June 18, 2014

Blog posts and formal publications

Monday, June 16, 2014

Haeckel and the March of Progress

In a previous blog post (Tattoo Monday VIII), I noted that the usual "March of Progress" image that the general public associates with the concept of "evolution" is originally based on the frontispiece to Thomas Henry Huxley's book Evidence as to Man's Place in Nature (1863. Williams & Norgate, London). A century later, this image was expanded and updated in the book Early Man by the anthropologist Francis C. Howell (1965. Time-Life International, New York) — this picture, with labels, can be viewed here.

What is perhaps less well known is that Ernst Haeckel also made a contribution to this genre. Shown here are the frontispiece and title page of Haeckel’s Natürliche Schöpfungsgeschichte (1868. Verlag von Georg Reimer, Berlin), usually translated as "The History of Creation". This book was Haeckel's attempt to introduce the idea of evolution to the German-speaking general public, after his detailed specialist two-volume book Generelle Morphologie der Organismen (1866. Verlag von Georg Reimer, Berlin). This previous book was difficult to read, and was also full of invective against doubters and supposed opponents; so a more readable approach was needed (the original text itself was apparently derived from one of his student's notes taken during Haeckel's lectures!).

The frontispiece lithograph (by Gustav Müller) is labeled as "The family group of the Catarrhines". It was notoriously supposed to demonstrate (as explained on page 555 of the book) "the highly important fact" that the "lowest humans" stand "much nearer" to the "highest apes" than to the "highest human". The various images are labeled (from "highest" to "lowest"):

"Indo-German"
"Chinese"
"Fuegian"
"Australian Negro"
"African Negro"
"Tasmanian"
gorilla
chimpanzee
orangutang
gibbon
proboscis monkey
mandrill.

The book was a best seller, and remained in print until the 1920s. Fortunately, the frontispiece was quickly changed. For example, in the 4th edition (1873) the frontispiece was a collage of various calcareous sponges, and in the 8th edition (1889) it was a picture of Haeckel himself (as it also was for the 5th and all subsequent editions). The book actually went through 12 editions, with the number and composition of the figure plates changing several times, in addition to the changes to the frontispiece.

Wednesday, June 11, 2014

Evolution and timelines

Any history can be represented as a timeline, but a timeline diagram does not necessarily show an evolutionary history. Unfortunately, this does not stop people from putting the word "evolution" on their timeline diagrams.

A timeline simply represents the timing of certain events. These events are presumably related in some way, but they do not necessarily refer to the history of a set of objects, or even concepts, as we might expect for an evolutionary history. Here is classic example of a perfectly valid timeline that refers to a disparate set of objects / concepts.

Apparently we are expected to infer from this timeline that McDonald's attitude to providing the public with information about the nutritional value of their fast-food products has changed over the decades. But the idea that this changed attitude might involve some sort of evolutionary process is stretching an analogy a bit too far. The timeline certainly represents a journey, as claimed, but not an evolutionary one.

For most members of the general public, "evolution" is a story of the transformation of some object or idea through time, with each stage replacing the previous one. This is a simple story with a beginning, a middle and (possibly) an end. The story can usually be presented as a timeline, of course, with each stage of the transformation arranged in the correct time order. For a biologist, this is a transformation series, representing "transformational evolution", which follows the history of a single lineage through time (ie. a history chain).

There are plenty of examples of this use of a timeline to represent transformational evolution. For instance, consider corporate logos, such as those of these two well-known beverage manufacturers. Each new logo replaced the previous one, thus providing an analogy to evolution of a single object.

The word "evolution" as used here is not one that a biologist would use, but many other people would do so. Bank notes in the USA show a similar phenomenon — in this case, the people involved appear to get younger through time! [The same thing happens on the $100 bill, as well.]

We can even take the idea of transformational evolution and use it for prediction, as was done by Takeshi Fukuda in 2002:

However, biologists do not see the evolution of organisms in this way, at all. For them, evolution is a process of variation, with lots of new forms appearing and some old ones disappearing. So, rather than an ordered series of forms, each one replacing the previous one through time, biologists see an increasing diversity of forms that is counter-acted by loss of forms (ie. extinction). This is "variational evolution" rather than transformational evolution.

Variational evolution is usually represented using a phylogeny, which will be a network or a tree, depending on the particular history, rather than a timeline chain. A phylogeny shows the relationships among a wide variety of objects, many of which will exist (or have existed) at the same time. There may have been replacement of some objects by others, but in general it is the diversity of objects existing at the same time that is of principal interest.

The issue here is that a timeline is a poor way of representing variational evolution. A timeline enforces a linear ordering of relationships, solely because "time's arrow" has one direction only. But a linear temporal order cannot reflect the complex evolutionary relationships among the objects.

Consider this example from McDonald's in Canada. There is a clear timeline here but it does not refer to transformational evolution — instead, it refers to variational evolution. These breakfast items have not necessarily replaced each other, and thus their evolutionary relationships are more complex than can be represented by a timeline.

Indeed, many of these breakfast items are still on the menu today, including: Egg McMuffin, Scrambled Eggs, Hash Browns, Hot Cakes and Sausage, Sausage McMuffin, Sausage McMuffin with Egg, Breakfast Burritos (Sausage), Bagel (Bacon, Egg Cheese, Steak, Egg Cheese), and the Fruit 'N Yoghurt Parfait.

Here is another seemingly simple image from McDonald's but with the same complexity problem — it is variational not transformational.

And finally, here is a much more complex history from Apple computers:

A timeline shows the timing of certain events, which do not necessarily involve replacement. It might be a useful way to represent transformational evolution, but it is a poor way to represent variational evolution. A phylogeny is much more appropriate.

Monday, June 9, 2014

Network publications

Some of you may have noticed that the Who is Who in Phylogenetic Networks database is currently down while it is being moved to a new server. It will soon be back, but in the meantime here is a graph showing the number of publications in the database in its most recent iteration. The data are grouped according to their authors, with the top 20 most prolific authors separated on the left. The next 9 authors are included solely to get myself onto the graph.

There are no real surprises here; and there are plenty of other authors in the database with fewer publications to their name. The database data will be updated when the web site returns, in which case I might update this graph.

Wednesday, June 4, 2014

Phylogenetic networks as multivariate data displays

Over the past two years or so of blogging, I have presented a number of empirical examples in which I have used splits graphs as general multivariate data summaries, rather than using them for the analysis of what we might call strictly phylogenetic data. I have listed these analyses at the bottom of this post.

There have been two reasons for doing these analyses. First, I wish to emphasize that unrooted networks are a form of data display rather than being evolutionary diagrams. That is, they do not display evolutionary history, in the same manner as is intended for rooted phylogenetic trees, for example. Unrooted networks can be a valuable tool for exploring phylogenetic data, but they do not display a phylogeny. They are a form of exploratory data analysis.

Second, these networks form part of a much larger class of methods for the analysis of multivariate data. Indeed, I believe that they are a very valuable part of this class. One way to illustrate this has been to analyze a whole series of datasets that have little to do with phylogenetic analysis. That is, the data are not necessarily even related to a historical trend. This illustrates just what can be done with these methods.

I have now formalized this point of view in a peer-reviewed publication:

Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4(4): in press (Online Early). doi:10.1002/widm.1130

Abstract:
Exploratory data analysis (EDA) involves both graphical displays and numerical summaries of data, intended to evaluate the characteristics of the data as well as providing a form of data mining. For multivariate data, the best-known visual summaries include discriminant analysis, ordination and clustering, particularly metric ordinations such as Principal Components Analysis. However, these techniques have limiting mathematical assumptions that are not always realistic. Recently, network techniques have been developed in the biological field of phylogenetics that address some of these limitations. They are now widely used in biology under the name phylogenetic networks, but they are actually of general applicability to any multivariate dataset. Phylogenetic networks are fast and relatively easy to calculate, which makes them ideal as a tool for EDA. This review provides an overview of the field, with particular reference to the use of what are called splits graphs. There are several types of splits graph, which summarize the multivariate data in different ways. Example analyses are presented based on the neighbor-net graph, which seems to be the most generally useful of the available algorithms. This should encourage the more widespread use of these networks whenever a summary of a multivariate dataset is required.

If you don't have subscription access to the journal, you can contact me for a PDF copy.

Blog posts with multivariate data summaries:

Datasets involving temporal patterns

Network analysis of Genesis 1:3
Network of ancient Thai bronze Buddha images
Language history and language weirdness
Pacific rock art - ordinations and networks
The network history of the Carnival of Evolution
The rise and fall of "David"

Datasets with no phylogenetic pattern

Eurovision Song Contest 2006: a network analysis
Network analysis of scotch whiskies
Network analysis of Bordeaux wine critics
Network analysis of Bordeaux wine critics II
A network analysis of Médoc wines
Eurovision Song Contest 2012: a network analysis
Phylogenetic network of the FIFA World Cup
Astrocladistics: a network analysis
Network analysis of McDonald's fast-food
Is there good and bad fast-food?
The mysterious rankings in Forbes' Celebrity 100
Network analysis of Michelin starred restaurants
Network analysis of New York neighborhoods
A network analysis of Simon and Garfunkel
Network analysis of Manhattan apartment buildings
A network analysis of the Bundesliga
Networks of the "Sight & Sound" film polls
A network analysis of London's theatres in 1965
The acoustics of the Sydney Opera House
A network of New Zealand's livestock regions
A network analysis of airplane disasters
World ice hockey champions — a network
Fast-food maps — a network analysis
Single-malt scotch whiskies — a network
Which cars are good, really?
The Netherlands is more than just tulips and sea-dykes
Automated natural language processing
Cancer rates and diagnosis

Theoretical considerations

Distortions and artifacts in Principal Components Analysis analysis of genome data
Networks can outperform PCA ordinations in phylogenetic analysis
Multivariate data displays are not always necessary

Monday, June 2, 2014

Cancer rates and diagnosis

Most people don't want to think about cancer, but everyone over the age of 40 should do so, and do so regularly. This is because early diagnosis dramatically increases your chance of survival, and early diagnosis is almost entirely up to you — you will be the first person to notice the symptoms. To put it bluntly, you can ignore the first signs of cancer and thus live for another five years or so, or you can got to a doctor and thus live for another twenty. The choice is yours, not Fate's.

Cancer is a disease of old age. Back in the dim dark past, people usually lived only until about 40 years of age, and so cancer was not a big problem. It is doubtful that it was a major cause of death amongst humans. But as we slowly but surely have increased our life expectancy, cancer has become more and more of an issue. For example, for people in the USA in 2010, cancer was the No. 1 cause of death for people aged between 40 and 80, as shown in this table. Indeed, you will note that for females cancer was in one of the top two spots for all age groups.

The incidence of cancer varies dramatically among different organs, and this variation has itself changed over time. In the USA, lung cancer has been the biggest cause of cancer deaths for males since the 1950s and for females since the 1980s, as shown in the next graph. In both cases the death rate has decreased since the recent active attempts to reduce the smoking epidemic. [Note: medically it is considered to be an epidemic, just as obesity is currently considered an epidemic in the Western world.]

The second biggest cause of death for males is prostate cancer, and for females it is breast cancer, followed in both cases by cancer of the colon and/or rectum. In all of these cases the death rate is decreasing through time, ostensibly due to changes in risk factors but most importantly the introduction of screening.

Screening is particularly important for cancer of the reproductive organs. There is only one "major" cancer-related organ for males (the prostate), but for females there are three: the breasts, the uterus and the ovaries. As you can see in the next graph, the stage at which the cancer is first diagnosed varies quite a lot among these organs.

To explain the stages: (i) localized means that the cancer is confined to one part of the organ concerned; (ii) regional means that the cancer has spread throughout the organ; and (iii) distant means that it has spread from there to other nearby organs. For effective treatment, and therefore maximum probability of survival, the cancer needs to be detected before it has reached the third stage. You will note that this is particularly problematic for ovarian cancer, which is usually not diagnosed until this stage.

Interestingly, death rates due to cancer are not randomly distributed in space, as shown in the next graph for the states of the USA. The data analyzed were for death rate (per 100,000 people) for 2006-2010 for the most common types of cancer (breast, colorectum, lung, non-hodgkin lymphoma, pancreas, prostate). I used the manhattan distance to evaluate the multivariate relationships in the data, and displayed this using a NeighborNet.

The graph shows the relationships among the different states. States near each other in the network have similar death rates for the different cancers, while states further apart are progressively more different from each other.

The states mostly form a gradient of increasing cancer death rates from the top-left towards the bottom of the network. Utah stands out because it has much the lowest death rates for colorectum, lung, and pancreas cancers amongst both men and women. DC stands out because it has nearly twice the rate of prostate cancer deaths than the other locations, presumably due to the distinctly older-male biased population of Washington city.

Finally, we can look at some international data. These data are solely for ovarian cancer, involving the 1-year survival rate after diagnosis. The data refer to survival in three different age classes of women (15-49, 50-69, 70-99 years old) for the three different stages at diagnosis. They were analyzed in the same way as above to produce the network.

On average, the Canadian survival is the highest, followed by the Norwegians, with the females from NSW (in Australia) faring the worst (particularly in the oldest age group). However, these data are rather mixed. For example, the Danish survival is worse than average for the oldest age group in the localised stage but better than average in the regional stage.

There is clearly a long way to go in the diagnosis and treatment of cancers.

Sources of data

Siegel R, Ma J, Zou Z, Jemal A (2014) Cancer Statistics, 2014. CA: A Cancer Journal for Clinicians 64: 9-29.

Maringe C, Walters S, Butler J, Coleman MP, Hacker N, Hanna L, Mosgaard BJ, Nordin A, Rosen B, Engholm G, Gjerstorff ML, Hatcher J, Johannesen TB, McGahan CE, Meechan D, Middleton R, Tracey E, Turner D, Richards MA, Rachet B, ICBP Module 1 Working Group (2012) Stage at diagnosis and ovarian cancer survival: evidence from the International Cancer Benchmarking Partnership. Gynecologic Oncology 127: 75-82.

[Declaration of interest: I have had skin cancer for nearly 30 years, which is a product of growing up in Australia, the country with the highest rate of skin cancer in the world; and recently three female members of my family have been diagnosed with cancer of their reproductive organs. So, I'm thinking about it even if you haven't been.]