Wednesday, August 14, 2013

How to construct a consensus network from the output of a bayesian tree analysis


In an earlier blog post I argued that We should present bayesian phylogenetic analyses using networks. The rationale for this is that a bayesian analysis is concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. In phylogenetics based on Markov Chain Monte Carlo (MCMC) methods, which produce a set of trees sampled in proportion to their posterior probabilites, the tree topologies can thus be summarized using a consensus network. This should be more in keeping with the bayesian philosophy than is producing a single tree, the so-called MAP tree. The MAP tree is based on combining those taxon partitions with the greatest frequency in the MCMC sample, so that the probability distribution is reduced down to a single tree with posterior probability values on the branches. On the other hand, a network produced from all of the partitions that appear in the MCMC sample, weighted according to their frequency, would be much closer to the bayesian aim. [Note: For a clarification of this point, see Leonardo de Oliveira Martins' comment at the end of this post.]

The practical issue with trying to do this is that at the moment it is not straightforward to get the consensus network from the output of any of the bayesian computer programs. These programs usually produce a file containing all of the sampled trees (from which the burn-in trees can be deleted). The simplest way to get the consensus network would be to use the SplitsTree or Spectronet programs to produce the network directly from this treefile. This can be done for files with a small number of trees; but no-one recommends doing bayesian analysis with a small number of MCMC-sampled trees. When the treefile contains tens of thousands of trees this is pushing the limit of SplitsTree and Spectronet, and they crash.

An alternative is to use the smaller "trprobs" file that is provided by, for example MrBayes, which contains only the unique trees along with a weight indicating their relative frequency. Unfortunately, SplitsTree and SectroNet do not currently read tree weights in treefiles. So, Holland et al. (2005, 2006) produced a Python script to create the required input files, which can then be input to SplitsTree or SpectroNet. A copy of this script is provided here.

However, this approach is still limited, and so it is not the approach that I used in the example analysis provided in my previous blog post. The MrBayes program, for example, also produces a partition table, showing the relative frequency of the bipartitions found in the sample of trees. The consensus network is actually produced from these bipartitions, rather than from the trees, and so this information can also be used instead of the treefile. This can be provided to SplitsTree in a nexus-formatted Splits block (derived from the bipartitions) rather than in a Trees block (derived from the treefile).

There are two practical problems with this approach. First, SplitsTree currently does not construct networks with different percentages of splits when data are input via a Splits block, only when the data are input via a Trees block. So, a series of Splits blocks needs to be constructed, each with the appropriate number of bipartitions, in order to decide how many of the bipartitions should be included in the network. This makes the process tedious. Second, MrBayes does not produce a nexus-format file with the bipartition information, and so the available information must be manually converted to a Splits block and put into a nexus file. I will try to explain this process here.

Manual procedure

The nexus-format file used for my previous analysis is here. It contains the original sequence data (the Data block), the instructions for MrBayes (the MrBayes block), the treefile produced by MrBayes (the Trees block), and the bipartitions information (the Splits block). This should reproduce the first consensus network shown in the previous blog post, and thus it shows you what a nexus-formatted file looks like.

The Splits block looks like this. The first column of the Matrix is simply an index to label the splits; the second column is the bipartition weight; and the third column lists the taxa in one of the two parts of each bipartition (you can choose either partition, but clearly it is quicker to list the taxa in the smaller partition).

BEGIN Splits; [Bipartitions occurring in >5% of the trees]
  DIMENSIONS ntax=17 nsplits=46;
  FORMAT labels=no weights=yes confidences=no intervals=no;
MATRIX
[1]   1.000000 1,
[2]   1.000000 2,
[3]   1.000000 3,
[4]   1.000000 4,
[5]   1.000000 5,
[6]   1.000000 6,
[7]   1.000000 7,
[8]   1.000000 8,
[9]   1.000000 9,
[10]   1.000000 10,
[11]   1.000000 11,
[12]   1.000000 12,
[13]   1.000000 13,
[14]   1.000000 14,
[15]   1.000000 15,
[16]   1.000000 16,
[17]   1.000000 17,
[18]   1.000000 12 13 14 15 16 17,
[19]   1.000000 16 17,
[20]   1.000000 12 13 14 15,
[21]   0.990441 9 10,
[22]   0.986401 12 13,
[23]   0.981162 3 4 7,
[24]   0.950994 2 5 6 8 9 10 11 12 13 14 15 16 17,
[25]   0.940455 3 4,
[26]   0.884189 12 13 15,
[27]   0.858771 5 6 8 9 10 11 12 13 14 15 16 17,
[28]   0.467503 5 6 8 11 12 13 14 15 16 17,
[29]   0.359641 6 8,
[30]   0.327114 6 11,
[31]   0.299736 5 11,
[32]   0.298256 6 8 11 12 13 14 15 16 17,
[33]   0.264989 11 12 13 14 15 16 17,
[34]   0.264549 6 8 11,
[35]   0.229872 6 11 12 13 14 15 16 17,
[36]   0.218812 5 6 8 11,
[37]   0.165447 5 11 12 13 14 15 16 17,
[38]   0.146118 9 1012 13 14 15 16 17,
[39]   0.144918 6 8 12 13 14 15 16 17,
[40]   0.135209 5 68 9 10 11,
[41]   0.130750 6 12 13 14 15 16 17,
[42]   0.114961 5 12 13 14 15 16 17,
[43]   0.109871 5 9 10,
[44]   0.105942 14 15,
[45]   0.084963 2 5 6 8 9 10 11,
[46]   0.070264 5 9 10 11 12 13 14 15 16 17,
;
END; [Splits]

The information about the taxa occurring in each bipartiton is taken from the following table, which appears in the MrBayes output. The ID is used as the first column of the Splits block; and the asterisks have simply been converted to the relevant taxon number (the Partition columns represent taxa 1–17, in order, so that an asterisk indicates that the particular taxon is included in that partition).

      ID -- Partition
      -----------------------
       1 -- .****************
       2 -- .*...............
       3 -- ..*..............
       4 -- ...*.............
       5 -- ....*............
       6 -- .....*...........
       7 -- ......*..........
       8 -- .......*.........
       9 -- ........*........
      10 -- .........*.......
      11 -- ..........*......
      12 -- ...........*.....
      13 -- ............*....
      14 -- .............*...
      15 -- ..............*..
      16 -- ...............*.
      17 -- ................*
      18 -- ...........******
      19 -- ...............**
      20 -- ...........****..
      21 -- ........**.......
      22 -- ...........**....
      23 -- ..**..*..........
      24 -- .*..**.**********
      25 -- ..**.............
      26 -- ...........**.*..
      27 -- ....**.**********
      28 -- ....**.*..*******
      29 -- .....*.*.........
      30 -- .....*....*......
      31 -- ....*.....*......
      32 -- .....*.*..*******
      33 -- ..........*******
      34 -- .....*.*..*......
      35 -- .....*....*******
      36 -- ....**.*..*......
      37 -- ....*.....*******
      38 -- ........**.******
      39 -- .....*.*...******
      40 -- ....**.****......
      41 -- .....*.....******
      42 -- ....*......******
      43 -- ....*...**.......
      44 -- .............**..
      45 -- .*..**.****......
      46 -- ....*...*********
      -----------------------

The bipartition weights come from this table, which also appears in the MrBayes output. The relevant information is in column three. The IDs are the same as in the previous table. (Note that IDs 1–17 have a "Probab." of 1.000000 by definition.)

      Summary statistics for informative taxon bipartitions
      ID   #obs      Probab.     Sd(s)+      Min(s)      Max(s)   Nruns 
      ------------------------------------------------------------------
      18  100008    1.000000    0.000000    1.000000    1.000000    8
      19  100008    1.000000    0.000000    1.000000    1.000000    8
      20  100008    1.000000    0.000000    1.000000    1.000000    8
      21   99052    0.990441    0.001920    0.987521    0.993121    8
      22   98648    0.986401    0.002285    0.984481    0.991041    8
      23   98124    0.981162    0.004318    0.974082    0.986721    8
      24   95107    0.950994    0.010545    0.939765    0.964723    8
      25   94053    0.940455    0.004211    0.934885    0.946564    8
      26   88426    0.884189    0.019000    0.851532    0.919526    8
      27   85884    0.858771    0.018347    0.839453    0.885449    8
      28   46754    0.467503    0.033135    0.432525    0.528278    8
      29   35967    0.359641    0.072244    0.286057    0.512679    8
      30   32714    0.327114    0.048225    0.243341    0.390609    8
      31   29976    0.299736    0.048974    0.233661    0.382049    8
      32   29828    0.298256    0.050175    0.201584    0.358851    8
      33   26501    0.264989    0.049249    0.185345    0.345732    8
      34   26457    0.264549    0.039611    0.204304    0.316375    8
      35   22989    0.229872    0.042602    0.141909    0.275578    8
      36   21883    0.218812    0.031232    0.160867    0.246460    8
      37   16546    0.165447    0.051763    0.116151    0.251500    8
      38   14613    0.146118    0.023326    0.094152    0.172466    8
      39   14493    0.144918    0.018488    0.108631    0.164947    8
      40   13522    0.135209    0.020371    0.106072    0.156707    8
      41   13076    0.130750    0.017912    0.108391    0.161507    8
      42   11497    0.114961    0.017107    0.096072    0.143429    8
      43   10988    0.109871    0.016253    0.075194    0.123830    8
      44   10595    0.105942    0.018071    0.072074    0.136069    8
      45    8497    0.084963    0.016141    0.065035    0.102872    8
      46    7027    0.070264    0.021858    0.051276    0.108951    8
      ------------------------------------------------------------------

So, some of the information in these two output tables is used to manually produce the Splits block, in the appropriate format, as shown. It would also be possible to write a script to automate this process (e.g. in Perl or Python).

Producing the separate files with Splits blocks containing different percentages of bipartitions is straightforward. As shown above, for the example data there are 46 bipartitions needed for the weights to sum to >0.95 (which is the MrBayes default number). If we choose 0.90 as the sum instead, then only the first 44 bipartitions are needed for the example data, while 0.85 requires only the first 37, and so on:
0.95  46
0.90  44
0.85  37
0.80  36
0.75  34
0.67  29
0.50  27
All that is needed is to delete the unwanted bipartitions from the Splits block, and then save the result as a separate nexus file.

Thanks to Mark for asking me to provide this blog post.

5 comments:

  1. Actually the MAP tree is simply the most frequent tree amongst the sampled ones, that is, the first one in the 'trprobs' file. The consensus tree, OTOH, is based on split frequencies. Both point estimates try to minimize the Bayesian risk, but based on distinct loss functions (e.g. http://sysbio.oxfordjournals.org/content/60/4/528). I don't see them as less or more Bayesian, but I understand your point that a network can represent much more information than a single tree -- and the beauty of a Bayesian analysis is in not neglecting the uncertainty ;)

    Thanks for the posts, by the way, I really enjoy this blog.

    ReplyDelete
    Replies
    1. Leonardo, Thanks for the clarification of an important point; and also for reading the blog. David

      Delete
  2. Really helpful post. Thanks.

    An alternative approach is use the 'consensusNet' function in Klaus Schliep's R package 'phangorn'. Of the full 10,000 trees, I sampled 1,000 post-burnin trees randomly to calculate the networks. I repeated this a few times to make sure the results did not change drastically.

    It was a very easy procedure.

    ReplyDelete
    Replies
    1. Thanks for telling us that, Rupert; it is not a package that I have had a chance to look at. Even 10,000 trees is quite a small sample for many people's datasets, and this is where reading all trees becomes problematic. David

      Delete
    2. Assuming your machine has enough RAM, there's no reason why R cannot process >10,000 trees. It just takes a bit of time, that's all. A job to leave running overnight perhaps.

      Maybe some benchmarking studies exist to see how this scales up with more trees/taxa etc?

      Delete