The Genealogical World of Phylogenetic Networks: August 2013

Wednesday, August 28, 2013

Why are there conflicting placental roots?

Last week I noted that there has been recent activity concerning the "placental root" problem, in which different genetic datasets support different phylogenetic trees for the root of the placental mammal clade (Conflicting placental roots: network or tree?). There are two articles (by Morgan et al. and Romiguier et al.) in the current issue of Molecular Biology & Evolution that address this problem with genomic data, and find two different well-supported trees.

This is an issue that I also addressed in a much earlier post (EDA or post-optimality analysis of phylogenetic data?), based on the genomic dataset of Meredith et al., in which I concluded:

It is not immediately obvious that a tree-building analysis is going to be of much use for this dataset. There is certainly some "power of building phylogenies from large densely sampled datasets", but this does not automatically mean that those phylogenies will be tree-like. Evolution involves a more diverse process than that.

In all of these cases, sophisticated substitution models (nucleotide or amino acid) were used as the basis for building a phylogenetic tree, whereas the network analysis of Hallström & Janke suggests that mammalian evolution may not be strictly bifurcating.

My interest in this blog post is in investigating the relative roles on the data and the substitution models in producing the phylogenetic trees. I use splits graphs of the recent data (using the SplitsTree program) as an exploratory data analysis, to visualize the signals in the datasets and which trees they might support under different circumstances.

The analyses

Any phylogenetic analysis depends on the quality of the data, in terms of the sampling of both taxa and characters. Both Morgan et al. and Romiguier et al. used the protein-coding sequences for most of the 40 currently available mammalian genomes.

However, it is worth noting at the outset that the sampling of the root taxa is rather poor. The root involves the relative relationships of the Xenarthra and the Afrotheria, and yet there are only two sampled Xenarthra species and three sampled Afrotheria (the remaining taxa are split between the Laurasiatheria and Euarchontoglires). Perhaps we are asking too much in expecting these data to resolve the root at all.

We can start the investigation with the data of Morgan et al., based on the concatenated amino acid sequences. The first NeighborNet analysis uses the simplest model possible, the hamming distance (which is simply the number of alignment differences between the taxa). I have colour-coded the four taxonomic groups, for convenience.

Note that all four taxaonomic groups appear to be monophyletic (ie. they are each supported by a unique split), as also is the Xenarthra+Afrotheria group. However, the raw data attach the outgroup to the placental group away from both the Xenarthra and the Afrotheria. Indeed, the data suggest that the Insectivora (Sorex+Erinaceus) are candidates as the sister to the rest of the placentals.

The effect of the substitution model on the data analysis can be evaluated by including a more sophisticated genetic distance. I have chosen the JTT amino-acid model, with the inclusion of a proportion of invariant sites (estimated by SplitsTree to be 30%). The corresponding NeighborNet is shown in the second graph.

This network attaches the outgroup near the "expected" taxa (Xenarthra, Afrotheria), although the location of Sorex is rather problematic. However, the split supporting the group Xenarthra+Afrotheria as the sister to the rest of the placentals is still very small, being ranked only 28th of the 82 non-trivial splits that involve at least one placental species. So, even this simple model does not provide strong support for the root location. However, it seems obvious that the root location is being determined as much by the substitution model as by the data, suggesting that the data cannot provide convincing evidence alone.

We can now proceed to study the data of Romiguier et al., based on the maximum-likelihood gene trees (GTR+GAMMA model) from the 560 genes, rather than the original alignment data. Here I have used a Consensus Network that displays all of those splits occurring in at least 24% of the trees. This percentage is the smallest that produces only a single reticulation in the network.

So, the most ambiguous part of the set of trees (ie. where there is most conflict among the trees) turns out to be where the outgroup attaches to the placental group. This is hardly surprising. What is more interesting is that the split support for each of the three alternative attachment points is very similar:
Outgroup+Xenarthra 0.00566
Xenarthra+Afrotheria 0.00557
Outgroup+Afrotheria 0.00496

So, the gene-tree data do not favour any one of the three alternative placental roots.

Conclusion

It is clear from these exploratory analyses that the genomic data do not, on their own, provide conclusive evidence regarding the root of the placental clade. The approach of Morgan et al. and Romiguier et al. has been to use a tree model based on sophisticated substitution models, thus arriving at conclusions that depend as much on their models as on the data. They used different models and got different trees, based on roughly the same data.

This is one approach to phylogenetics, to use more sophisticated models; but an alternative is to recognize that evolution itself is sophisticated, and therefore does not necessarily produce a dichotomous tree. In this case, it seems more likely that the conflicting signals at the placental root reflect non-tree-like processes (such as hybridization), so that tree-based analyses are inappropriate, no matter how fancy the models are.

References

Hallström, Janke (2010) Mammalian evolution may not be strictly bifurcating. Molecular Biology & Evolution 27: 2804-2816.

Meredith et al. (2011) Impacts of the Cretaceous terrestrial revolution and KPg extinction on mammal diversification. Science 334(6055): 521-524.

Morgan et al. (2013) Heterogeneous models place the root of the placental mammal phylogeny. Molecular Biology & Evolution 30: 2145-2156.

Romiguier et al. (2013) Less is more in mammalian phylogenomics: AT-rich genes minimize tree conflicts and unravel the root of placental mammals. Molecular Biology & Evolution 30: 2134-2144.

Monday, August 26, 2013

The acoustics of the Sydney Opera House

Growing up in Sydney in the 1960s (as I did) meant watching the politically ludicrous construction of the Sydney Opera House, a building now celebrating its 40th birthday. Indeed, in the late 1970s I went to a talk by Peter Hall, who had eventually taken over the design of the building interior, who said that the first job he was given was to decide exactly what shade of white the exterior tiles should be!

As was noted at the time of its completion: "It was a brilliant conception, but fatally flawed." In fact, it had, and still has, many flaws; but the one being referred to is the acoustics. Recently, in a ranking by Limelight magazine of 20 performance halls for classical music in Australia, ranked by the professionals who have to play in them repeatedly, the Concert Hall came 18th and the Opera Theatre came 20th (ie. dead last). [Note: The Sydney Opera House also has a Drama Theatre, and some other performance spaces, as well as the two large halls.]

For a catalog of some of the comments about the acoustics by performers, see Darryn King's article: This is not an opera house: beautiful on the outside — the tragedy of Bennelong Point.

Here, I use a network to explore the problems with these acoustics.

The Opera House

Construction started in 1959 and was scheduled to be completed in 1963 at a cost of $A7 million. The building was finally opened in 1973 at a final cost of $A102 million. The New South Wales state government even had to institute a governmental lottery (the Opera House Lottery) to pay for it. I guess that this makes it as much a monument to successful gambling as to music.

The Opera House has since become iconic, of course, being one of the most recognized buildings in the world. It is now a UNESCO World Heritage Site, unlike the tram sheds that previously occupied the site (the Fort Macquarie Tram Depot). It has four resident companies: Opera Australia, the Australian Ballet, the Sydney Theatre Company, and the Sydney Symphony Orchestra. Visitors flock to the site, given its spectacular habour-side location; and Sydney would not be Sydney without it.

However, the complex has been plagued by problems as a performance venue. Indeed, the Sydney Theatre Company argued against using the Drama Theatre; and they eventually got their own theatre in a converted warehouse on the other side of the Sydney Harbour Bridge. The Opera Theatre has very cramped space for the orchestra. Moreover, the opera company cannot use sets from touring productions, because there are no wings to move scenery on and off the stage, necessitating one-off designs. Furthermore, many of the seats in the Concert Hall have very poor views.

But what is worst is that the acoustics in the two major halls are awful, and always have been.

In the Limelight ranking mentioned above, the Perth Concert Hall was the clear winner. It was built at the same time as the Sydney Opera House for a cost of $A3.2 million (the original budget was $A3.1 million). Sadly, the architectural style is what is commonly called Brutalist, which gives you some idea of what it looks like (the Sydney one is Expressionist). The important point, however, is that the Perth hall itself is shoe-box shaped, like almost all the great concert halls of the world, and unlike the Sydney ones.

Most of the Sydney problems stemmed from the original design, which was not a set of detailed plans as specified by the selection committee, but simply a series of conceptual drawings by the Danish architect Jørn Utzon. These drawings did not even match the original specifications. For example, the original brief specified a large hall for 3,000 people and a small hall for 1,200, but Utzon's acoustic plan had only 2,800 seats in the large hall, and even this was an over-estimate of how many could be fitted into his plan. Moreover, the building did not actually fit onto the specified site (a narrow peninsula sticking out into the harbour). Needless to say, the committee originally rejected the design out of hand; but they were over-ruled by outside circumstances (created principally by Eero Saarinen, a Finnish-born architect, who was at the same time involved in the design of the Trans World Flight Center at New York's John F. Kennedy International Airport, which not co-incidentally looks like a flattened version of the Opera House).

So, Utzon was given the job, and he turned out to be a very dogmatic person to work with. Indeed, he was eventually forced to resign (in 1966), but not before he had created many headaches for everyone else concerned. Ove Arup (the structural engineer) seems to have been the hero in dealing with Utzon's outrageous architectural demands. Peter Murray, in his history of the saga, notes:

Following Utzon's resignation, the acoustic advisor, Lothar Cremer, confirmed to the Sydney Opera House Executive Committee that Utzon's original acoustic design only allowed for 2,000 seats in the main hall and further stated that increasing the number of seats to 3,000 as specified in the brief would be disastrous for the acoustics.

Initially (1973), the poor acoustics were dealt with by suspending acoustic clouds (or donuts) from the ceiling, as shown in the photo above, apparently with small success. This was mainly so the performers could actually hear each other playing. Still, the conductor of the Sydney Symphony Orchestra, Edo de Waart, has since claimed that the Concert Hall has all the acoustics of a car park. [Note: Another Limelight magazine survey ranked the SSO as the clear winner among Australia's six state symphony orchestras, which makes its home in the SOH doubly ironic.]

So, what did the people of Sydney get for their money? A building that was 10 years late, cost 14 times as much as budgeted, had a capacity only 90% of what was asked, and with the worst acoustics in the country. I guess they are lucky that most people think it looks better than the tram-sheds ("it should be seen and not heard").

The acoustics were re-designed in 2009 (see Kirkegaard et al. 2010; Taylor & Claringbold 2010), but it is difficult to make major changes to a heritage building, even though it is being refurbished after 40 years, at an estimated cost of $A1,100 million.

The Fort Macquarie Tram Depot

The acoustics

It is therefore instructive to look at Utzon's original design presentation from March 1958, known as the Red Book, and the comments about the acoustics therein. Each design consultant (one each for structures, acoustics, mechanical services, electrical engineering, theatre techniques) prepared their own section of the book. The acoustics section was prepared by Vilhelm Lassen Jordan.

Of particular interest is the section entitled "Some Examples of Existing Large Halls and their Acoustic Data", which compares several characteristics (Volume, Number of seats, Volume per seat, Reverberation time empty, Reverberation time with audience) for seven existing concert halls, plus the proposed Sydney concert hall. As Jordan noted: "Satisfactory acoustics are based on a number of factors: reverberation time, sound distribution, sound diffusion and the overall dimensions." Of these, reverberation time was a key factor in the acoustics problems created by trying to fit the required number of seats into the hall as originally designed by Utzon.

I have analyzed the Red Book acoustic data using a phylogenetic network as a tool for exploratory data analysis. The analysis follows the same procedure as that for A network analysis of London's theatres. So, concert halls that are closely connected in the network are similar to each other based on their acoustic characteristics, and those that are further apart are progressively more different from each other.

The network clearly shows the claim being made in the 1958 Red Book — the new Concert Hall will be similar to the Concertgebouw, in Amsterdam, which is still rated as one of the top five halls in the world (see the ranking by Beranek 2003, 2004).

What a load of nonsense! In Leo Beranek's ranking of concert halls and opera houses throughout the world, the Sydney Opera House Concert Hall is ranked 53rd out of his 58 listed halls, whereas the Concertgebouw is ranked 5th. Something really went wrong, somewhere. Of the other halls shown in the network, the Göteborg Konserthus is ranked in a collection of equal halls at 21-40th, Usher Hall is 43rd, and Royal Festival Hall is 46th. (St Andrews Hall burnt down in 1962, and the two Danish halls are not listed by Beranek.) These rankings seem to be quite consistent with their relative locations in the network, except for the Sydney hall.

However, things are not always what they seem. It turns out that acoustic data do not always reflect the musical quality of a concert hall, as perceived by human ears. Beranek's ranking is based on the judgement of professional performers and music lovers, and we can quantitatively compare this judgement to various acoustic characteristics of most of the 58 ranked halls. Acoustical consultant Magne Skålevik has provided data for ten acoustic variables for 52 of the halls; and I have performed a network analysis of these data as well.

So, the network is based on the measured acoustic characteristics; and I have highlighted in red the top 11 ranked halls as rated by the listeners. Most of these top halls have similar acoustic qualities (ie. they cluster together in the network), with the Großer Musikvereinsaal, in Vienna, as the top ranker. Of these halls, eight were built before 1908 (the Großer Musikvereinsaal was built in 1870), which shows you how little we have learned recently about designing concert halls.

What is most interesting, however, is that the network shows that there are eight other halls with similar acoustic qualities to the top-ranked halls but with much lower rankings. These include the Tokyo Suntory Hall (ranked 17), Meyerhoff Symphony Hall (20), De Doelen Concertgebouw, Leipzig Gewandhaus, Kyoto Concert Hall, Tokyo Metropolitan Art Space, Christchurch Town Hall (all 21-40), and the Sydney Opera House Concert Hall (53). Apparently, there is more that meets the ear than can be measured by acoustic instruments.

I was also surprised to note that 7 of the bottom 15 ranked halls are in the U.K., although the only other two listed U.K. halls are ranked in the top dozen (St David's Hall, Cardiff, 10th; Colston Hall, Bristol, 12th).

What is also notable is that the reverberation time of almost all of the top-ranked halls is shorter than that of the Sydney Opera House Concert Hall (at 2.2 seconds). As Beranek (2004) notes: "For halls that generally feature standard orchestral repertoires ... the mid-frequency reverberation times should ... optimally [be] between 1.8 and 2.1 sec with the hall fully occupied." The Concert Hall thus reverberates for 0.2-0.3 seconds longer than the top-ranked halls, and this seems to be the ultimate source of its acoustic failure.

Kittani Morrison Photography

Conclusion

To quote Beranek (2004):

Architects design for clients, and either may have specific goals in mind. ... The architect may wish to build a monument that the public will travel far to see and that will win international awards. Either through lack of knowledge or interest, architects and owners may fail to build for, arguably, the most important feature of a hall for musical performance: how the acoustics of such a creation will or should sound.

Utzon built an architectural masterpiece, not a space for music performances.

Failure to consider the external design of the building, however, can also be fatal. As noted with regard to another opera house designed by a Dane (Henning Larsen's Greatest Building was also his Greatest Failure):

Nowhere is Larsen's power to change a city's skyline on better display than in Copenhagen, where his Opera House dominates the waterfront, the undisputed icon of the harbor's transformation from a naval-industrial base to a cultural center. At half a million square feet and 14 stories, it is one of the city’s largest buildings ... It was Henning Larsen’s signature achievement, and, he later wrote, "my greatest failure." He thought it looked like a toaster.

Finally, it is worth noting that at the time of its construction the Sydney Opera House was architecturally unique (even given the TWA Flight Center, referred to above), but this is no longer so. In 1986, the Lotus Temple was opened in New Delhi, India, which has rather obvious similarities to the SOH, although it is much smaller (and has pools instead of a harbour). It was designed (starting in 1976) by Iranian-born architect Fariborz Sahba, who used three rings of nine shells each around a nine-sided dome to imitate a lotus flower (rather than imitating the boat sails that inspired Utzon). It is apparently more visited even than the Taj Mahal (3.5 million per year versus 3 million). I have no idea about its acoustics.

References

Leo L. Beranek (2003) Subjective rank-orderings and acoustical measurements for fifty-eight concert halls. Acta Acustica 89: 494-508.

Leo L. Beranek (2004) Concert Halls and Opera Houses: Music, Acoustics, and Architecture, 2nd edition. Springer-Verlag, New York.

R. Lawrence Kirkegaard, Timothy E. Gulsrud, Shimby McCreery (2010) Acoustics of the Sydney Opera House Concert Hall, Part Two: The acoustician's perspective. Proceedings of 20th International Congress on Acoustics, ICA 2010, 23-27 August 2010, Sydney, Australia.

Peter Murray (2004) The Saga of Sydney Opera House: The Dramatic Story of the Design and Construction of the Icon of Modern Australia. Spon Press, London.

Lisa Taylor, David Claringbold (2010) Acoustics of the Sydney Opera House Concert Hall, Part One: The client's perspective. Proceedings of 20th International Congress on Acoustics, ICA 2010, 23-27 August 2010, Sydney, Australia.

Wednesday, August 21, 2013

Conflicting placental roots: network or tree?

In this blog we champion networks as a fundamental model for phylogenetics. Networks are more general than trees, in the sense that some networks are more tree-like than are others. However, I have noted before that the current trend in phylogenetics seems to be to try to use more and more complex trees as the phylogenetic model, rather than embracing networks as a more flexible model (Resistance to network thinking).

An interesting example of this trend is in the current issue of Molecular Biology & Evolution. There are two articles that investigate the root of the placental clade, by Morgan et al. and Romiguier et al., along with an editorial commentary by Teeling & Hedges.

The "placental root" problem has been difficult to resolve as a bifurcating process because different genetic datasets support different trees. As noted by Teeling & Hedges: "Untangling the root of the evolutionary tree of placental mammals has been nearly an impossible task. The good news is that only three possibilities are seriously considered ... Now, two groups of researchers have scrutinized the largest available genomic data sets bearing on the question and have come to opposite conclusions". The three alternative tree histories for the clade root are shown in the figure.

Both of the new empirical studies are based on the protein-coding sequences for most of the 40 currently available mammalian genomes. Morgan et al. use heterogenous substitution models to account for tree and dataset heterogeneity, and get strong support for option (c). Romiguier et al. divide their dataset into GC-rich and AT-rich genes, conclude that the GC-rich genes are most likely to suffer from long-branch attraction, and get strong support from the AT-rich genes for option (a).

Teeling & Hedges continue: "Needless to say, more research is needed." No! Previous genome-scale analyses of more than one million amino acid sites from orthologous protein-coding genes have not rejected any of the three alternatives, despite the statistical estimate that 20,000 amino acid sites should be sufficient to resolve the question at this level of divergence given the tree structure, branch lengths, and number of substitutions (Hallström & Janke 2010). Doesn't this mean that we have enough evidence already?

Clearly, the conflicting results should lead the reader to at least consider the idea that something might be wrong with the underlying tree model itself. Both of these new analyses are still based on tree models, no matter how sophisticated those models might be (see also the several other papers cited by Teeling & Hedges), and no matter how much data are involved.

An alternative perspective is provided by Hallström & Janke (2010): "Mammalian evolution may not be strictly bifurcating". Their network analysis of retroposon insertion data supports an alternative hypothesis for the history of placentals: the early divergences involved incomplete lineage sorting and hybridization. Neither of these two evolutionary processes is accounted for in the tree models of Morgan et al. and Romiguier et al., but both can be integral parts of a network model.

Conclusion

I think that we can see the suggested move from trees to networks as a form of Kuhnian paradigm shift. In Kuhn's historical model, during the period of "normal science" the failure of results to conform to the current paradigm is not seen as refuting the paradigm, but instead is seen as resulting from errors by researchers (e.g. use of inadequate models, acquisition of unreliable data). However, in the Kuhn model, as anomalous results accumulate a new paradigm emerges that subsumes the old results along with the anomalous results, forming a single new framework or paradigm.

Non-tree-like phylogenetic results are currently not seen by most phylogeneticists as refuting the paradigm of a phylogenetic tree, but instead are the result of inadequate phylogenetic tree-models and/or insufficient data (as exemplified by Salichos and Rokas 2013). Nevertheless, these results can also be seen as refuting that paradigm. In that case, a shift to network thinking would embrace all of the tree results as well as the non-tree ones, and would thus form a viable new paradigm.

We should not really call this a Kuhnian "revolution", of course, since tree-thinking and network-thinking are not incompatible, but rather the one is an extension of the other.

Note: There is a follow-up post — Why are there conflicting placental roots?

References

Hallström BM, Janke A (2010) Mammalian evolution may not be strictly bifurcating. Molecular Biology & Evolution 27: 2804-2816.

Morgan CC, Foster PG, Webb AE, Pisani D, McInerney JO, O’Connell MJ (2013) Heterogeneous models place the root of the placental mammal phylogeny. Molecular Biology & Evolution 30: 2145-2156.

Romiguier J, Ranwez V, Delsuc F, Galtier N, Douzery EJP (2013) Less is more in mammalian phylogenomics: AT-rich genes minimize tree conflicts and unravel the root of placental mammals. Molecular Biology & Evolution 30: 2134-2144.

Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327-331.

Teeling EC, Hedges SB (2013) Making the impossible possible: rooting the tree of placental mammals. Molecular Biology & Evolution 30: 1999-2000.

Monday, August 19, 2013

World Heritage proposal in systematics

Since most systematists do not know it, I thought that I should mention that there is actually a suggested World Heritage initiative called "The Rise of Systematic Biology". I mention this now because there is currently a public display about it here in Uppsala (Sweden).

Technically, this is what the UNESCO World Heritage Centre calls a "tentative proposal". It is being co-ordinated by the Uppsala County Administrative Board [Länsstyrelsen i Uppsala län] but involves a cultural landscape nomination of 12 sites in 8 countries. The "tentative" nature of the proposal reflects the fact that the four Swedish landscapes were agreed by the Swedish National Heritage Board [Riksantikvarieämbetet] back on 15 June 2009, and they submitted their part of the proposal to UNESCO on 2 December 2009. However, the proposal will not be complete until the other countries have formally agreed to nominate their sites.

Talks are currently underway with the relevant authorities in the proposed partner countries. The final nomination will be submitted to UNESCO by the Swedish government at the earliest in February 2015. In that case, a UNESCO decision should be made by the middle of 2016.

The proposal centres around Carl von Linné and his students. In Sweden, the cultural landscapes include his birthplace (Råshult), his Uppsala University garden (Linnéträdgården) and house (Linnémuseet), his personal home (Hammarby), and some of the areas around Uppsala where he conducted his botanical excursions (Herbationes Upsalienses). Elsewhere, they include places associated with his studies (Hortus Botanicus, Leiden, in the Netherlands), and the travels of his students (in Australia, France, Japan, South Africa, United Kingdom, United States of America).

The current status of the proposal can be viewed at the UNESCO World Heritage Centre site, and a PDF English-language summary of the proposal (listing the other cultural sites) is available from the Uppsala County Administrative Board site.

PS. After my recent visit to Råshult, I have now visited all of the suggested sites in Sweden, Australia and the Netherlands.

Wednesday, August 14, 2013

How to construct a consensus network from the output of a bayesian tree analysis

In an earlier blog post I argued that We should present bayesian phylogenetic analyses using networks. The rationale for this is that a bayesian analysis is concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. In phylogenetics based on Markov Chain Monte Carlo (MCMC) methods, which produce a set of trees sampled in proportion to their posterior probabilites, the tree topologies can thus be summarized using a consensus network. This should be more in keeping with the bayesian philosophy than is producing a single tree, the so-called MAP tree. The MAP tree is based on combining those taxon partitions with the greatest frequency in the MCMC sample, so that the probability distribution is reduced down to a single tree with posterior probability values on the branches. On the other hand, a network produced from all of the partitions that appear in the MCMC sample, weighted according to their frequency, would be much closer to the bayesian aim. [Note: For a clarification of this point, see Leonardo de Oliveira Martins' comment at the end of this post.]

The practical issue with trying to do this is that at the moment it is not straightforward to get the consensus network from the output of any of the bayesian computer programs. These programs usually produce a file containing all of the sampled trees (from which the burn-in trees can be deleted). The simplest way to get the consensus network would be to use the SplitsTree or Spectronet programs to produce the network directly from this treefile. This can be done for files with a small number of trees; but no-one recommends doing bayesian analysis with a small number of MCMC-sampled trees. When the treefile contains tens of thousands of trees this is pushing the limit of SplitsTree and Spectronet, and they crash.

An alternative is to use the smaller "trprobs" file that is provided by, for example MrBayes, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree and SectroNet do not currently read tree weights in treefiles. So, Holland et al. (2005, 2006) produced a Python script to create the required input files, which can then be input to SplitsTree or SpectroNet. A copy of this script is provided here.

However, this approach is still limited, and so it is not the approach that I used in the example analysis provided in my previous blog post. The MrBayes program, for example, also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than from the trees, and so this information can also be used instead of the treefile. This can be provided to SplitsTree in a nexus-formatted Splits block (derived from the bipartitions) rather than in a Trees block (derived from the treefile).

There are two practical problems with this approach. First, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions, in order to decide how many of the bipartitions should be included in the network. This makes the process tedious. Second, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. I will try to explain this process here.

Manual procedure

The nexus-format file used for my previous analysis is here. It contains the original sequence data (the Data block), the instructions for MrBayes (the MrBayes block), the treefile produced by MrBayes (the Trees block), and the bipartitions information (the Splits block). This should reproduce the first consensus network shown in the previous blog post, and thus it shows you what a nexus-formatted file looks like.

The Splits block looks like this. The first column of the Matrix is simply an index to label the splits; the second column is the bipartition weight; and the third column lists the taxa in one of the two parts of each bipartition (you can choose either partition, but clearly it is quicker to list the taxa in the smaller partition).

BEGIN Splits; [Bipartitions occurring in >5% of the trees]
DIMENSIONS ntax=17 nsplits=46;
FORMAT labels=no weights=yes confidences=no intervals=no;
MATRIX
[1] 1.000000 1,
[2] 1.000000 2,
[3] 1.000000 3,
[4] 1.000000 4,
[5] 1.000000 5,
[6] 1.000000 6,
[7] 1.000000 7,
[8] 1.000000 8,
[9] 1.000000 9,
[10] 1.000000 10,
[11] 1.000000 11,
[12] 1.000000 12,
[13] 1.000000 13,
[14] 1.000000 14,
[15] 1.000000 15,
[16] 1.000000 16,
[17] 1.000000 17,
[18] 1.000000 12 13 14 15 16 17,
[19] 1.000000 16 17,
[20] 1.000000 12 13 14 15,
[21] 0.990441 9 10,
[22] 0.986401 12 13,
[23] 0.981162 3 4 7,
[24] 0.950994 2 5 6 8 9 10 11 12 13 14 15 16 17,
[25] 0.940455 3 4,
[26] 0.884189 12 13 15,
[27] 0.858771 5 6 8 9 10 11 12 13 14 15 16 17,
[28] 0.467503 5 6 8 11 12 13 14 15 16 17,
[29] 0.359641 6 8,
[30] 0.327114 6 11,
[31] 0.299736 5 11,
[32] 0.298256 6 8 11 12 13 14 15 16 17,
[33] 0.264989 11 12 13 14 15 16 17,
[34] 0.264549 6 8 11,
[35] 0.229872 6 11 12 13 14 15 16 17,
[36] 0.218812 5 6 8 11,
[37] 0.165447 5 11 12 13 14 15 16 17,
[38] 0.146118 9 1012 13 14 15 16 17,
[39] 0.144918 6 8 12 13 14 15 16 17,
[40] 0.135209 5 68 9 10 11,
[41] 0.130750 6 12 13 14 15 16 17,
[42] 0.114961 5 12 13 14 15 16 17,
[43] 0.109871 5 9 10,
[44] 0.105942 14 15,
[45] 0.084963 2 5 6 8 9 10 11,
[46] 0.070264 5 9 10 11 12 13 14 15 16 17,
;
END; [Splits]

The information about the taxa occurring in each bipartiton is taken from the following table, which appears in the MrBayes output. The ID is used as the first column of the Splits block; and the asterisks have simply been converted to the relevant taxon number (the Partition columns represent taxa 1–17, in order, so that an asterisk indicates that the particular taxon is included in that partition).

ID -- Partition
-----------------------
1 -- .****************
2 -- .*...............
3 -- ..*..............
4 -- ...*.............
5 -- ....*............
6 -- .....*...........
7 -- ......*..........
8 -- .......*.........
9 -- ........*........
10 -- .........*.......
11 -- ..........*......
12 -- ...........*.....
13 -- ............*....
14 -- .............*...
15 -- ..............*..
16 -- ...............*.
17 -- ................*
18 -- ...........******
19 -- ...............**
20 -- ...........****..
21 -- ........**.......
22 -- ...........**....
23 -- ..**..*..........
24 -- .*..**.**********
25 -- ..**.............
26 -- ...........**.*..
27 -- ....**.**********
28 -- ....**.*..*******
29 -- .....*.*.........
30 -- .....*....*......
31 -- ....*.....*......
32 -- .....*.*..*******
33 -- ..........*******
34 -- .....*.*..*......
35 -- .....*....*******
36 -- ....**.*..*......
37 -- ....*.....*******
38 -- ........**.******
39 -- .....*.*...******
40 -- ....**.****......
41 -- .....*.....******
42 -- ....*......******
43 -- ....*...**.......
44 -- .............**..
45 -- .*..**.****......
46 -- ....*...*********
-----------------------

The bipartition weights come from this table, which also appears in the MrBayes output. The relevant information is in column three. The IDs are the same as in the previous table. (Note that IDs 1–17 have a "Probab." of 1.000000 by definition.)

Summary statistics for informative taxon bipartitions
ID #obs Probab. Sd(s)+ Min(s) Max(s) Nruns
------------------------------------------------------------------
18 100008 1.000000 0.000000 1.000000 1.000000 8
19 100008 1.000000 0.000000 1.000000 1.000000 8
20 100008 1.000000 0.000000 1.000000 1.000000 8
21 99052 0.990441 0.001920 0.987521 0.993121 8
22 98648 0.986401 0.002285 0.984481 0.991041 8
23 98124 0.981162 0.004318 0.974082 0.986721 8
24 95107 0.950994 0.010545 0.939765 0.964723 8
25 94053 0.940455 0.004211 0.934885 0.946564 8
26 88426 0.884189 0.019000 0.851532 0.919526 8
27 85884 0.858771 0.018347 0.839453 0.885449 8
28 46754 0.467503 0.033135 0.432525 0.528278 8
29 35967 0.359641 0.072244 0.286057 0.512679 8
30 32714 0.327114 0.048225 0.243341 0.390609 8
31 29976 0.299736 0.048974 0.233661 0.382049 8
32 29828 0.298256 0.050175 0.201584 0.358851 8
33 26501 0.264989 0.049249 0.185345 0.345732 8
34 26457 0.264549 0.039611 0.204304 0.316375 8
35 22989 0.229872 0.042602 0.141909 0.275578 8
36 21883 0.218812 0.031232 0.160867 0.246460 8
37 16546 0.165447 0.051763 0.116151 0.251500 8
38 14613 0.146118 0.023326 0.094152 0.172466 8
39 14493 0.144918 0.018488 0.108631 0.164947 8
40 13522 0.135209 0.020371 0.106072 0.156707 8
41 13076 0.130750 0.017912 0.108391 0.161507 8
42 11497 0.114961 0.017107 0.096072 0.143429 8
43 10988 0.109871 0.016253 0.075194 0.123830 8
44 10595 0.105942 0.018071 0.072074 0.136069 8
45 8497 0.084963 0.016141 0.065035 0.102872 8
46 7027 0.070264 0.021858 0.051276 0.108951 8
------------------------------------------------------------------

So, some of the information in these two output tables is used to manually produce the Splits block, in the appropriate format, as shown. It would also be possible to write a script to automate this process (e.g. in Perl or Python).

Producing the separate files with Splits blocks containing different percentages of bipartitions is straightforward. As shown above, for the example data there are 46 bipartitions needed for the weights to sum to >0.95 (which is the MrBayes default number). If we choose 0.90 as the sum instead, then only the first 44 bipartitions are needed for the example data, while 0.85 requires only the first 37, and so on:
0.95 46
0.90 44
0.85 37
0.80 36
0.75 34
0.67 29
0.50 27
All that is needed is to delete the unwanted bipartitions from the Splits block, and then save the result as a separate nexus file.

Thanks to Mark for asking me to provide this blog post.

Monday, August 12, 2013

Who first used the term "phylogenetic network"?

In a previous blog post (Ngrams and phylogenetics) I noted that Google Books first detects the use of the term "phylogenetic network" in the late 1970s. However, the history of the use (and meaning) of this expression is actually rather confused.

Sadly, none of the early uses of the term actually referred to a reticulating network — they all referred to a non-reticulated tree. Indeed, the very first illustrations of phylogenetic networks (way back in the 1700s) were actually described as trees: “arbre généalogique” (French: family tree). (NB. This terminology is quite natural, because a family pedigree is actually a network when both parents are included from each generation — each offspring is then a hybrid of two parents.)

Grant (1953) came close to modern usage when he used the expression "phylogenetic net" to refer to what we would now call a "hybridization network" in his study of the plant genus Gilia:

The occurrence of sporadic hybridizations during the course of Angiosperm evolution may be the factor which has caused this group to grow up, not as a phylogenetic tree, but as a gigantic, snarled phylogenetic net.

He illustrated this with a rather reticulate genealogy.

Unfortunately, Holmquist (1978) then muddied the water by using the expression "phylogenetic network" to refer to an abstract phylogenetic tree. Although not explicit, he used "tree" to refer to a rooted phylogenetic diagram and "network" to refer to the unrooted topology of the tree. This is very similar to the use of the term "network" in cladistics, as discussed in an earlier blog post (Some odd network definitions and terms). (NB. The cladistic idea was that an unrooted tree represents a set of rooted trees, one potential root per edge in the tree.)

Throughout the 1980s there were a number of literature uses of "phylogenetic network" that were a reference to Holmquist's paper, although the usage seems to have died out after that. There were also other references to unrooted trees as "networks", but the term "unrooted tree" finally became universal, instead of "network".

Avise et al. (1979a,b) then used the term "phylogenetic network" to refer to what is now called a "haplotype network". They manually created unrooted haplotype graphs by combining several gene networks. However, while there could be reticulations in the individual gene networks, the combined haplotype data had no reticulations, and they thus formed non-reticulate trees.

Throughout the 1980s there were a number of literature uses of "phylogenetic network" that were a reference to this work, although most people actually referred to the paper by Lansman et al. (1983), since this contained a detailed description of the manual technique. It seems to be Excoffier & Smouse (1994) who first generalized haplotype trees to networks, but they did not call them "phylogenetic networks". (NB. They used the union of all minimum spanning trees as a minimum spanning network.)

Finally, it was Bandelt (1994) who formalized the reference to unrooted splits graphs as "phylogenetic networks"; previously, these had been simply called "splits graphs". This use of the term "phylogenetic" does not refer to genealogies, of course, since the graphs are unrooted and thus have no time dimension.

Thus, before the 1990s the term "phylogenetic network" was actually used to refer to various forms of unrooted tree. During that decade the concept was explicitly generalized from trees to unrooted reticulate networks. Only Verne Grant was referring to a rooted genealogy, and he preferred to use the term "net" rather than "network".

That leaves open the question of who first used the expression "phylogenetic network" explicitly in reference to a genealogy.

References

Avise JC, Giblin-Davidson C, Laerm J, Patton JC, Lansman RA (1979 Dec) Mitochondrial DNA clones and matriarchal phylogeny within and among geographic populations of the pocket gopher, Geomys pinetis. Proceedings of the National Academy of Sciences of the United States of America 76(12): 6694-6698.

Avise JC, Lansman RA, Shade RO (1979 May) The use of restriction endonucleases to measure mitochondrial DNA sequence relatedness in natural populations. I. Population structure and evolution in the genus Peromyscus. Genetics 92: 279-295.

Bandelt H-J (1994) Phylogenetic networks. Verhandlungen des Naturwissenschaftlichen Vereins Hamburg 34: 51-71.

Excoffier L, Smouse PE (1994) Using allele frequencies and geographic subdivision to reconstruct gene trees within a species: molecular variance parsimony. Genetics 136: 343-359.

Grant V (1953) The role of hybridization in the evolution of the leafy-stemmed gillias. Evolution 7: 51-64.

Holmquist R (1978) A measure of the denseness of a phylogenetic network. Journal of Molecular Evolution 11: 225-231.

Lansman RA, Avise JC, Aquadro CF, Shapira JF, Daniel SW (1983) Extensive genetic variation in mitochondrial DNA's among geographic populations of the deer mouse, Peromyscus maniculatus. Evolution 37: 1-16.

Wednesday, August 7, 2013

Network of apple cultivars

As is always emphasized in this blog, it is best to explore the nature of any phylogenetic dataset, before proceeding to a formal data analysis. Usually, I discuss examples where important insights are revealed by using a phylogenetic network as a form of Exploratory Data Analysis. Here, instead, I note an example where there are few noteworthy features, in addition to those emphasized by the phylogenetic tree — some datasets really are tree-like.

The paper under discussion is:
Nikiforova S.V., Cavalieri D., Velasco R., Goremykin V. (2013) Phylogenetic analysis of 47 chloroplast genomes clarifies the contribution of wild species to the domesticated apple maternal line. Molecular Biology & Evolution 30: 1751-1760.

The data involve 47 chloroplast genomes from cultivated apple varieties and wild apple species (genus Malus). The nucleotide alignment is 134,553 bp; and the dataset is available in the Dryad database.

The authors did check some of the basic assumptions of their proposed phylogenetic analysis, such as whether the nucleotide substitutions are saturated and whether the nucleotide composition is homogeneous. The authors conclude that the data are very well-behaved: the alignment is unproblematic, so there is no ambiguity about homology; the P-distances = the corrected distances, so that it is unimportant which nucleotide substitution model is chosen; the nucleotide composition is homogeneous; and most of the site variation is binary. The authors conclude that: "phylogenetic signal is well preserved in the data and is not distorted by multiple substitutions and strong compositional bias."

This does not, however, examine whether the phylogenetic signal is tree-like or not. This is best done with a phylogenetic network. So, I have used a NeighborNet network based on the P-distances, as shown below.

NeighborNet network,
with some of the labels (names and bootstrap values) reproduced from the original tree.

In their tree-based analysis (a bootstrapped maximum-likelihood tree) the authors recognize five monophyletic groups (labeled A to E) plus the outgroup Pyrus. The network reveals that the major groups (A–E) are tree-like except for three things:

the A + B grouping has 87% bootstrap support in the tree-based analysis but is not supported by the network analysis;
the grouping of M. zhaojiaoensis with group C has 90% bootstrap support in the tree-based analysis but is not supported by the network analysis;
the relationship of M. fusca and M. micromalus to group A is not clear in the network.

Points (1) and (2) indicate that only branches with 100% bootstrap values (nothing less) are well-supported by the data. Indeed, the branches with 90% and 87% support are very short branches, so there is no significant character data support.

For point (3), the tree-building analysis makes a somewhat arbitrary decision to resolve the conflicting relationships — it shows M. fusca as the sister to group A, but it includes M. micromalus within the group.

Otherwise, the authors' confidence in their tree-based results seems to be well justified.

Monday, August 5, 2013

Ngrams and phylogenetics

Google Books is an archive of published books. For example, it is a very good place to search for scanned copies of books that are in the public domain, which can then be down-loaded as PDFs. It also has copies of modern in-print books, which can searched but not down-loaded. The Ngram Viewer is a part of Google Books. It plots a graph showing the number of book occurrences of a given expression through time from 1800-2008.

So, I thought that it might be interesting to search for a few expressions of relevance to readers of this blog. I will let the graphs speak for themselves.

Just to be clear about the scale of the vertical axis, I quote from the instructions:

What the y-axis shows is this: of all the bigrams [two-word expressions] contained in our sample of books written in English and published in the United States, what percentage of them are "phylogenetic network" or "phylogenetic tree"?

It is worth noting that Google Books contains some journal volumes, and so its definition of "book" is rather vague. Also, the dating of some of the books can best be described as bizarre.

We can expand the "phylogenetic network" search, to get more detail. [Note: we could also scale the graph above, as discussed in the below Comment by Joachim Dagg.] We could also compare this graph to Philippe Gambette's publication graph for Who is Who in Phylogenetic Networks, which doesn't show the same explosive growth after 2000. Mind you, before 2000 there weren't many books that could mention networks.

We could also try alternative "tree" searches. We might then ask: why does "evolutionary tree" die off as an expression after 2000?