Wednesday, May 29, 2013

Should phylogenetic modelling proceed from simple to complex or vice versa?

In statistical model testing, models can be tested by starting with the simplest model and progressively adding model complexity until the desired level of model fit is achieved. Alternatively, one can start with the most complex model and progressively delete unnecessary components while maintaining the desired level of model fit. The first approach is constructive, in the sense that the model is constructed piece by piece (stepwise addition), while the second approach is reductive, in the sense that the full model is pared down to its simplest form (stepwise deletion).

This distinction in approaches to modelling is relevant to the difference between using trees and networks as phylogenetic models.

At the moment, the most common approach to phylogenetic analysis is the constructive one. One starts with the simplest model, a bifurcating tree, and assesses the degree to which it fits the data. If the fit is poor, as it often is with multi-gene data, especially if the gene data are concatenated, then complexity is added. For example, one might include incomplete lineage sorting (ILS) in the model, which allows the different genes to fit different trees, while still maintaing the need for a single dichotomous species tree. Alternatively, one might consider gene duplication-loss as a possible addition to the model, which is another major source of incompatibility between multi-gene data and a single species tree. Only if these additional complexities also fail to attain the desired degree of fit does one consider adding components of reticulate evolution to the model, such as hybridization or horizontal gene transfer (HGT).

The reductive (or simplification) approach, however, proceeds the other way. A general network model is used as the starting point. The various components of this model would include a dichotomous tree as a special case, along with ILS, duplication-loss, hybridization, and HGT as individual components. These special cases are evaluated simultaneously, and each one is dropped if it is contributing nothing worthwhile to the model fit. The final model consists of the simplest combination of components that still maintains the specified fit of data and model; this may indeed be a simple tree.

The main advantage of the latter approach is that all of the components of the model are evaluated simultaneously, so that their potential interactions can be quantitatively assessed. Components are dropped from the model only if they contribute nothing to the model, either independently or in synergy with the other components. That is, they are dropped only if they can be shown to be redundant.

This does not happen with the constructive approach to modelling. Here, the components are evaluated in some specified order, and components that are later in the order will not be evaluated unless the earlier components prove to be inadequate. These later components are thus potentially excluded from statistical consideration. This means that their possible contribution to biological explanation may never be quantitatively assessed.

So, in practice, evolutionary reticulation is considered to be a "last resort" in current phylogenetic analyses. It is considered as a possible biological explanation only if all else has already failed.

This philosophy seems to be as much a historical artifact as anything else. The first phylogenetic diagrams (by Buffon and Duchesne) were networks not trees, but they were replaced a century later by the tree model suggested by Darwin; and the tree has retained its primacy since that time. This leads naturally to the constructive approach to modelling, which is so prevalent in the current literature.

However, there is no necessary statistical superiority of the constructive approach to modelling. Indeed, statisticians seem to consider forward and backward selection of model components to be essentially equivalent, although they may lead to different models for any given dataset. The most commonly specified advantage of the constructive approach to modelling is that it is likely to avoid possible problems arising from having too many components in the model.

Nevertheless, the reductive approach has the distinct advantage of simultaneously evaluating all possible special cases of a network, and thus does not exclude any possible biological explanation that might apply to the observed data. This may provide more biological insight than does the construcive approach to phylogenetic modelling.

Monday, May 27, 2013

Charles Darwin's family pedigree network

It is widely known that Charles Darwin was married to his first cousin Emma Wedgwood. Emma came with a substantial dowery, being the grand-daughter of Josiah Wedgwood, the founder of the Wedgwood pottery firm (as, indeed, was Darwin himself, via his mother). Darwin already had a substantial allowance from his own father (a successful physician, real estate speculator, investor, and money lender), and the combined incomes allowed him to live the life of a "gentleman of independent means". He thus conducted his scientific work unhindered by the practical concerns of the rest of us.

What is perhaps less well known is that Darwin was interested in (and concerned about) the genetic effect on his children of his consanguineous marriage. He performed many experiments on inbreeding in plants, and demonstrated that the offspring of cross-fertilized plants were more vigorous and numerous than the offspring of self-fertilized plants. It occurred to him that the same thing might be true for animals, as well, including humans.

Furthermore, he thought that this might be an explanation for the unhealthy nature of his own children. Three of his ten children died young, and three more of them had long-term marriages that produced no offspring (implying infertility). These data stand out even within the Darwin-Wedgwood families, let alone outside it.

In birth order, the children were:
William Erasmus – married, no children
Anne Elizabeth – died young (tuberculosis)
Mary Eleanor – died young
Henrietta Emma – married, no children
George Howard – married, four children
Elizabeth – unmarried, no children (apparently had difficulties with words and pronunciation)
Francis – married twice, two children
Leonard – married twice, no children
Horace – married, three children
Charles Waring – died young

Part of the Darwin / Wedgwood pedigree is shown in the figure, which is taken from the 2010 paper by Tim M. Berra, Gonzalo Alvarez and Francisco C. Ceballos (Was the Darwin / Wedgwood dynasty adversely affected by consanguinity? BioScience 60: 376-383). Note that the family tree is drawn as a hybridization network (also called a "path diagram"), rather than a traditional family tree, which is an important point that I have previously emphasized for pedigrees (Family trees, pedigrees and hybridization networks).

The diagram shows only four of the people from Darwin's children's generation (including only one of his own children), but all four of these people (and their unshown siblings) are the offspring of first-cousin marriages. Indeed, Louisa Frances Wedgwood's parents were double first cousins (ie. they were cousins via both of their parents). These consanguineous marriages all involved the children of Josiah Wedgwood II (they are four of his eight children who survived to adulthood) — this is not a family tradition that should be encouraged. (You will note that Darwin's sister Caroline married Emma's brother Josiah III, thus literally keeping everything in the family.)

The inbreeding coefficient (the probability that at a given locus an individual receives two identical genes as a result of common ancestry) of Louisa Frances is 0.126, while that of the other three people is 0.063. Most of the other people in the Darwin / Wedgwood family have inbreeding coefficients of 0.000. Berra and his coauthors compared the child mortality with the inbreeding coefficients for four generations of the family, and concluded that there is a statistically significant relationship.

The data look like this for the 20 marriages in the final three generations:
                         Child mortality to 10 years
                                 =0     >0
Inbreeding coefficient =0        11      5
                       >0         1      3
Clearly, the second sample size is rather small, but the unconditional test of two independent proportions yields p=0.076. The relative risk is 2.4 (ie. the children of first-cousin marriages were >2 times more likely to die before 10 years of age than were the other children).

Darwin did not have easy access to these data, of course, but they justify his concern for the effect on humans of inbreeding. Indeed, he went so far as to suggest that the 1871 British census should enquire about consanguineous marriages ("the returns would show whether married cousins have in their households on the night of the census as many children as have parents who are not related; and should the number prove fewer, we might safely infer either lessened fertility in the parents, or which is more probable, lessened vitality in the offspring"). This suggestion was not implemented.

However, his son George (the oldest fertile child) did persue the matter of inbreeding. Indeed, he introduced the idea of using the frequency of occurrence of the same (birth) surname among married couples as a means to study the level of inbreeding in a population. Such surname models are still used in human population biology today.

Henri de Toulouse-Lautrec

Incidentally, many other famous people have married their first cousin, although unlike Darwin they did not necessarily have any children with them. For example, Albert Einstein married Elsa Löwenthal (née Einstein), his first cousin through their mothers and second cousin through their fathers; however, his three children were from his relationship with his first wife, Mileva Marić. H.G. Wells' first marriage was to Isabel Wells, a first cousin, but his four children were with his second wife and two of his lovers. Edgar Allan Poe's only marriage was to his cousin Virginia Clemm, but they had no children. [See the later post: Albert Einstein's consanguineous marriage]

Sadly, there are also well-known cases where the offspring of first cousins seem to have suffered badly. Perhaps the best known of these is the artist Henri de Toulouse-Lautrec. Henri's two grandmothers were sisters, so that his parents were first cousins, and he suffered from congenital health conditions that are usually attributed to genetic disorders. For example, Henri fractured his right thigh bone when he was 13 and his left at 14, and the breaks did not heal properly. His legs ceased to grow, so that he achieved the shape for which he is best known, with an adult-sized torso but child-sized legs. He died at the young age of 36. [See the later post Toulouse-Lautrec: family trees and networks]

First-cousin marriages have declined significantly since Darwin's time. According to Adam Kuper (2010. Incest and Influence: the Private Life of Bourgeois England. Harvard University Press), cousin marriages have declined from 1:25 marriages (among the upper middle classes) in the 19th century to 1:6,000 in the 1930s and 1:25,000 in the 1960s. Kuper's book provides an interesting insight into why such marriages were previously so common among the upper bourgeoisie and why they are much rarer now.

Wednesday, May 22, 2013

Are phylogenetic trees useful for domesticated organisms?

When looking at the population genetics literature I have noticed that many papers still present very traditional phylogenetic analyses, particularly in what can broadly be called agricultural studies. For instance, genetic distances might be calculated between the samples and a "tree of genetic relationships" presented based on UPGMA clustering.

The problem with this sort of approach to genotype data analysis is that it forces the data into an ultrametric tree, which has long been shown to be inappropriate as a model for evolutionary relationships. Furthermore, there is no indication of the robustness of this tree, nor even whether a tree model is appropriate in the first place.

As a specific example, we can look at the microsatellite data presented by Carimi et al. (2010) for various Sicilian grape cultivars. For grape varieties, where hybridization among cultivars has been the historical norm, an ultrametric tree seems singularly inappropriate.

Wine grapes have been grown on Sicily for more than 2,000 years, and at least 120 grape-vine cultivar names are known in the literature. The authors sampled 82 of the cultivars from the Institute of Plant Genetics (Palermo) germplasm collection, with 1-5 clones sampled per cultivar. They assessed six polymorphic microsatellite loci, producing diploid (co-dominant) data. Only 70 distinct genotypes were detected, which were then subjected to data analysis.

The authors used the "Simple Matching coefficient for co-dominant and multiallelic data" to estimate the genetic distances between samples. Unfortunately, this has been shown to have odd properties for diploid  microsatellite data (Kosman and Leanard 2005). Therefore, in my analysis I have used the simple metric of Kosman and Leonard (2005), instead, in which genotype distances are calculated as a proportion of the shared alleles at each locus (averaged across loci). This was calculated using the mmod R package (Winter 2012).

The authors then used the "UPGMA (Unweighted Pair-Group Method with Arithmetical Averages)" clustering method to produce their ultrametric tree from the distance data. This is the most commonly encountered agglomerative hierarchical clustering method to be found in the literature. Instead, I used a NeighborNet network to evaluate whether the data are tree-like, calculated using the SplitsTree program.

The resulting network is shown in the first graph. Cultivars that are closely connected in the network are similar to each other based on their microsatellite profiles, and those that are further apart are progressively more different from each other.

The network shows that there is very little hierarchical structure to the grape-vine microsatellite data. The data do not clearly distinguish "six main groups", as interpreted by the original authors based on their tree (which is shown below). [Note that one of the authors' groups (cluster E) is more heterogeneous than the others, and to be comparable should be divided into either two or three groups.]

Note that the network emphasizes two things: (1) there are no clear groupings of the grape cultivars, and (2) the data are rather "noisy", as microsatellite data often are (e.g. Leroy et al. 2009), with many incompatible signals.

As far as the phylogenetic history is concerned, there is no evidence of "several origins for Sicilian grape-vine germplasm", as interpreted by the authors. Instead, there seems to have been continuous mixing of the genotypes, probably including cultivars from elsewhere in Italy, and even further afield around the Mediterranean. This type of complex genetic history seems to be quite common in domesticated organisms, and a tree-based analysis is therefore unlikely to be appropriate for studying them; see, for example, Decker et al. (2009) for cows, Leroy et al. (2009) for horses, and Kijas et al. (2012) for sheep.


Carimi F, Mercati F, Abbate L, Sunseri F (2010) Microsatellite analyses for evaluation of genetic diversity among Sicilian grapevine cultivars. Genetic Resources and Crop Evolution 57: 703–719.

Decker J.E., Pires J.C., Conant G.C., McKay S.D., Heaton M.P., Chen K., Cooper A., Vilkki J., Seabury C.M., Caetano A.R., Johnson G.S., Brenneman R.A., Hanotte O., Eggert L.S., Wiener P., Kim J.-J., Kim K.S., Sonstegard T.S., Van Tassell C.P., Neibergs H.L., McEwan J.C., Brauning R., Coutinho L.L., Babar M.E., Wilson G.A., McClure M.C., Rolf M.M., Kim J., Schnabel R.D., Taylor J.F. (2009) Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences of the U.S.A. 106: 18644-18649.

Kijas J.W., Lenstra J.A., Hayes B., Boitard S., Porto Neto L.R., San Cristobal M., Servin B., McCulloch R., Whan V., Gietzen K., Paiva S., Barendse W., Ciani E., Raadsma H., McEwan J., Dalrymple B., other members of the International Sheep Genomics Consortium (2012) Genome-wide analysis of the world's sheep breeds reveals high levels of historic mixture and strong recent selection. PLoS Biology 10: e1001258.

Kosman E, Leonard KJ (2005) Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Molecular Ecology 14: 415–424.

Leroy G., Callède L., Verrier E., Mériaux J.C., Ricard A., Danchin-Burge C., Rognon X. (2009) Genetic diversity of a large set of horse breeds raised in France assessed by microsatellite polymorphism. Genetics Selection Evolution 41: 5.

Winter DJ (2012) mmod: an R library for the calculation of population differentiation statistics. Molecular Ecology Resources 12: 1158–1160.

Monday, May 20, 2013

Destroying the Tree of Life?

In my previous blog post (Resistance to network thinking) I noted that a phylogenetic network is a generalization of a phylogenetic tree because "a network simplifies to a tree if there are no incompatible phylogenetic signals". Given this, to me it has often seemed somewhat odd that so many of the people who are interested in generalizing the Tree of Life into a Network of Life use metaphors suggesting that the tree first needs to be destroyed.

This approach was popularized by Ford Doolittle, who entitled his 2000 Scientific American [282(2): 90–95] article "Uprooting the Tree of Life", although this particular metaphor had previously been used by, for example, Elizabeth Pennisi [Science 284: 1305-1307].

This approach reached its apogee with the ridiculous cover of New Scientist in January 2009. The cover accompanied an article by Graham Lawton now mildly entitled: "Why Darwin was wrong about the Tree of Life" [201(2692): 34-39], although the editor (Roger Highfield) originally called it "Axing Darwin's tree".

As was noted at the time, this cover was "a misdirected and entirely inappropriate piece of sensationalism", which did no one any good (least of all the editor). A subsequent Letter to the Editor [by Dennett, Coyne, Dawkins and Myers] noted: "Nothing in the article showed that the concept of the Tree of Life is unsound; only that it is more complicated than was realised before the advent of molecular genetics."

So, it seems likely that the tree needs to be neither axed nor uprooted, nor "trashed" [Laura Franklin-Hall], nor even "politely buried" [Michael Rose]. In many cases all that is is needed is some osculations between the branches. Indeed, most of the scientific discussion is about how many osculations there are, and how we can best detect where they are, rather than about destroying the tree itself. A network is more general than a tree, rather than being a fundamentally different structure. Nevertheless, some people, such as Michael Syvanen, have been quoted as saying: "We've just annihilated the Tree of Life", when referring to their new network.

Wednesday, May 15, 2013

Resistance to network thinking

Phylogeneticists are used to the idea of tree thinking, in which evolutionary history is seen as a branching tree-like pattern. Clearly, for many phylogeneticists this has not yet been extended to network thinking, in which evolutionary history can also be seen as a reticulating network. Indeed, I have recently come across several people who have actively insisted that "trees are still central" to phylogenetics (to quote one of my correspondents). As Mindell (2013) has claimed, the Tree of Life is still a useful metaphor, model and heuristic device.

So, there is not just indifference to networks but there seems also to be some resistance to them. This is somewhat unexpected, as a network simplifies to a tree if there are no incompatible phylogenetic signals, and so there is no intrinsic reason to restrict phylogenies to being tree-like.

As a typical example from the literature, Losos et al. (2012) have recently commented:
Although molecular data have rarely changed our understanding of the major multicellular groups of the evolutionary tree of life, they have suggested changes in the relationships within many groups, such as the evolutionary position of whales in the clade of even-toed ungulates. Further investigation has usually resolved conflicts, often by revealing inadequacies in previous morphological studies. This has led to a presumption by many in favor of molecular data.
Needless to say this is a biased point of view, because conflicts can also be resolved by revealing inadequacies in molecular studies. For example, molecular analyses involve many subjective decisions about substitution models and rates of molecular change, and any one of the underlying assumptions may be violated. There is no theoretical justification for favouring one source of data over another.

Similarly, there is no theoretical justification for trying to resolve conflicts by preferring one hypothesis over another. Phylogenetic conflicts can also be "resolved" by recognizing that evolutionary history is not necessarily tree-like. Losos et al. do not even consider this possibility:
When two phylogenies are fundamentally discordant, at least one data set must be misleading.
In fact, the only misleading thing here is the word "must", because both datasets may be perfectly correct but are simply the product of two different evolutionary histories.

This point is perhaps most obvious when comparing molecular datasets. The evolutionary history revealed by between-gene evolutionary processes (e.g. recombination, hybridization, horizontal gene transfer) often conflicts with that from within-gene processes (e.g. nucleotide substitutions and insertions / deletions), and this leads to a reticulating evolutionary history.

Indeed, the more we learn about genomes the less tree-like does the evolutionary history of species seem to be. There are long-standing controversies regarding the evolutionary history of many taxonomic groups, and it has been hoped that genome-scale data would resolve these controversies. However, to date none of these controversies has been satisfactorily resolved into an unambiguous tree-like genealogical history using genome data. They all apparently involve reticulate evolutionary processes.

For example, the estimated relationships among humans, chimpanzees and gorillas did not change as a result of genome sampling (Galtier and Daubin 2008), nor did those of malaria species (Kuo et al. 2008) nor those of placental superorders (Hallström and Janke 2012). In all three cases the estimated relationships were just as complex after the genome sequencing as before. The resolution of controversial branches in our trees has not occurred as a result of increased access to character data or improved data analyses, but our recognition of reticulating relationships certainly has occurred.

There are many other examples where increased character sampling is yet to resolve long-standing controversies about branching patterns, and where reticulation may also be the true explanation. Birds seem to provide many of these examples (eg. Smith et al. 2013), but insects are a rich source as well (eg. Thomas et al. 2013), and sometimes even plants (eg. Goremykin et al. 2013).

Clearly, when two or more phylogenies are fundamentally discordant, none of the datasets needs to be misleading, because a reticulating history may be involved. Network thinking should thus be a standard tool in the arsenal of every phylogeneticist. Tree thinking excludes networks but network thinking does not exclude trees, and so the more general model will always be the more useful one.

[Note: An empirical example is discussed in this later blog post: Conflicting placental roots: network or tree?]


Galtier N, Daubin V (2008) Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences 363: 4023-4029.

Goremykin VV, Nikiforova SV, Biggs PJ, Zhong B, Delange P, Martin W, Woetzel S, Atherton RA, McLenachan PA, Lockhart PJ (2012) The evolutionary root of flowering plants. Systematic Biology 62: 50-61.

Hallström BM, Janke A (2012) Mammalian evolution may not be strictly bifurcating. Molecular Biology and Evolution 27: 2804-2816.

Kuo C-H, Wares JP, Kissinger JC (2008) The Apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees. Molecular Biology and Evolution 25: 2689-2698.

Losos JB, Hillis DM, Greene HW (2012) Who speaks with a forked tongue? Science 338: 1428-1429.

Minell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Systematic Biology 62: 479-489.

Smith JV, Braun EL, Kimball RT (2013) Ratite nonmonophyly: independent evidence from 40 novel loci. Systematic Biology 62: 35-49.

Thomas JA, Trueman JW, Rambaut A, Welch JJ (2013) Relaxed phylogenetics and the Palaeoptera problem: resolving deep ancestral splits in the insect phylogeny. Systematic Biology 62: 285-297.

Monday, May 13, 2013

Non-randomness in Forbes' Celebrity 100 ranking

Some time ago I blogged about The mysterious rankings in Forbes' Celebrity 100. I noted at the time that "There are some other things that we can learn from an analysis of the Celebrity 100 list, but they have nothing to do with networks, so I will not cover them here." I will, however, cover these things now.

Each year since 1999 Forbes magazine has produced a list called the Celebrity 100, which purports "to list the 100 most powerful celebrities of the year" within the USA. The list is based on entertainment-related earnings plus media visibility (exposure in print, television, radio, and online). The 2012 list generated plenty of negative comments around the web, and my network analysis of the data showed that there is little apparent mathematical logic to some of the rankings.

However, the data do also reveal interesting patterns about the perception of celebrity in the media, provided that we accept the quality of Forbes' data (even if we find fault with what Forbes did with those data). In the graphs below I have simply used the information provided by Forbes in order to take a look at some of the features that Forbes did not comment upon.

The first graph plots the celebrity ranking by sex and "profession". Each dot represents one celebrity, with one black and one blue dot per celebrity representing their sex and profession, respectively. They are arranged in the Celebrity 100 order, left to right. You will note that the data are not randomly distributed among the groups.

The graph shows that one third of the celebrities are female, and they dominate the top 10 and the bottom 30. So, in order to get a high ranking it is best to be female but that after that it becomes a handicap.

The other groupings are based on the Forbes description of each celebrity's principal claim to fame. Clearly, in terms of celebrity status: being a musician is better than being an athlete, which is better than being an actor, which is better than being an actress. Being a TV or radio personality is not bad, either. Note that this explains the bi-modal distribution of females: the music females are in the top 10 while the acting females are in the bottom 30.

For the rest, if you are a male, then being a producer/director is marginally better than being an author, which is marginally better than being a comedian. If you are female, then  being a model is much worse than being a singer or an actress. Being an entrepreneur works only if you are Donald Trump.

The second graph compares each celebrity's money ranking (based on an estimate of their earnings) with their overall ranking. This is an attempt to see who is financially benefitting from their celebrity status (or vice versa). Once again, each dot represents one celebrity, with the location reflecting their Celebrity 100 rank (decreasing to the right, horizontally) and their earnings (decreasing towards the top, vertically). The two lines on the graph show that for most celebrities (those between the lines) their financial status closely follows their celebrity status.

However, for those at the top-left of the graph their celebrity standing is greater than they are being paid. (They are ranked in the top 30 on overall celebrity status but are not in the top 25 money earners.) This means that their manager is "not getting them what they are worth". These people are, from top to bottom on the graph:
Jennifer Aniston
Kim Kardashian
Angelina Jolie
Brad Pitt
Adele Adkins
Beyoncé Knowles
Katy Perry
Jennifer Lopez
Stefani Germanotta (Lady Gaga)
Rihanna Fenty
Justin Bieber
television personality
You will note that there are nine females but only two males in this list. Note, also, the number of singers in the list, indicating that being a singer will get you more celebrity than money.

For those at the bottom-right of the graph their celebrity standing is less than their monetary worth. (They are in the top 25 money earners but are not ranked in the top 25 on overall celebrity status.) This means that their publicity agent is not doing their job (or not being asked to!). These people are, from right to left on the graph:
Mark Burnett
Kenny Chesney
Toby Keith
Jerry Bruckheimer
James Patterson
George Lucas
Michael Bay
Howard Stern
  television producer
  country music singer
  country music singer
  film and television producer
  film director and producer
  film director and producer
  radio personality
These people are all male, so these males have more money than celebrity. Most of these men do not work directly in the public spotlight, or they prefer country music to pop music.

One can perform a similar analysis to compare the celebrities' TV/Radio rank with their Press rank. This produces a very similar graph. It turns out that the people whose TV/Radio rank is poor compared to their Press rank are mostly athletes (David Beckham, Roger Federer, Lionel Messi, Li Na, Cristiano Ronaldo, Maria Sharapova), along with one model (Kate Moss) and one producer/director (Steven Spielberg). The thirteen people whose Press rank is poor compared to their TV/Radio rank are almost all TV/Radio "personalities", as expected.

I am sure that there is more to be found in this dataset, if anyone cares to look.

Wednesday, May 8, 2013

Journal of Phylogenetics & Evolutionary Biology?

Many of you will have recently received an email (or two) announcing the impending inaugural issue of the Journal of Phylogenetics & Evolutionary Biology, "an open access, peer-reviewed journal which aims to provide the most rapid and reliable source of information on current developments in the field of phylogenetics and evolutionary biology."

The journal promotional material notes that: "The emphasis will be on publishing quality papers [that will] help establish its high standard and facilitate the journal to be indexed by prestigious ISI and PubMed". Sadly, the journal's flyer indicates that the journal is unlikely to achieve any of these aims, because the people in charge have very little idea of what phylogenetics is:

Only one of these images explicitly relates to a rooted evolutionary history (and it even has reticulations!), but the other images vary from irrelevant to downright wrong.

Publishing "quality papers" will get them nowhere, since we cannot tell whether they will be high quality or low quality, good quality or poor quality. I am sure they will have some sort of quality, because even a used car has that. Caveat emptor. Moreover, perpetuating the transformational view of evolution will not attract the favourable attention of either ISI or PubMed, although this particular viewpoint might be appropriate for the evolution of scientific publishing:

Monday, May 6, 2013

Network analysis of Manhattan apartment buildings

Manhattan has been described as one of the most real estate obsessed neighborhoods on earth (after Monaco); and another thing it is especially obsessed about is prestige. So, a comparison of the most prestigious apartment (Co-operative and Condominium) buildings is of especial interest, I guess.

In Manhattan, prestige seems to result from such things as the building's overall architecture, the scale and layout of the apartments, the notoriety of its current and past residents, the sheer cost of buying any of the apartments, and the requirement that a purchaser be able to stomach the exorbitant monthly maintenance fees.

However, these are not readily quantifiable attributes, except for the monetary ones, which change from year to year.

CityRealty (a New York City apartment search and resource site) has addressed this conundrum by evaluating the best-known apartment buildings based on a consistent set of non-monetary criteria: CityRealty's New York City Condos & Co-ops. They note:
We rate each building based on its architecture, location and features, using the same scoring methods and criteria for all buildings. The maximum number of points for Architecture is 44, Location 36, and Features 39. However, it is virtually impossible for any building to get the full amount of points for any category.
There are 18 criteria for Architecture (each scoring 1-8 points), 14 criteria for Location (1-5 points), and 22 criteria for Features (1-5 points). CityRealty list 3,085 apartment buildings, but only 1,943 of them have ratings.

I have concerned myself only with those buildings that have ratings ≥ 88 (the ratings vary from 30–99), which is 95 buildings. These buildings differ very little in the top-scoring criteria, which is to be expected if they are considered to be the top-rated buildings. These criteria include: Distinction of exterior (8 points), Retail quality (5), Street ambience (5), Distance to business district (5), Distinction of lobby (5), and Number of units per floor (5).

However, these buildings do differ considerably on the other criteria, which include presence or absence of all sorts of  "desirable" characteristics, such as: gargoyles, illumination, water element, recreational roof, garage, maid's room, elevator person, and absence of external air conditioners. This makes a network analysis possible, which will summarize the similarities among the various buildings.

The analysis

I compiled a list of 57 of the buildings for analysis, including all of the top buildings as rated by CityRealty (37 buildings; scores 92–99), plus a selection of others (20 buildings; scores 88–91) that appear to be noteworthy as indicated by various internet lists (eg. based on architecture, prestige, history, cost of apartments). I then collated the data provided by CityRealty. (See the Postscript for a comment on the data.) There are 29 Co-operatives and 28 Condominiums in the analysis. However, two buildings have identical scores, because the Time Warner Center and the Mandarin Oriental are two towers in the same development — the apartments in the south tower have a One Central Park address (Time Warner Center) and those in the north tower are The Residences at the Mandarin Oriental.

As usual for my data analyses, I have compared the buildings based on what is (appropriately!) called the manhattan distance and then calculated a NeighborNet network. Buildings that are closely connected in the network are similar to each other based on the various criteria used by CityRealty, and those that are further apart are progressively more different from each other.

The network shows seven clusters, which I have color-coded. These clusters represent buildings that have many characteristics in common. Notably, these buildings are also clustered in space, as shown in the map below (also available on Google Maps), which is color-coded to match the network. (Note that yellow is a bit hard to represent in the network.) In particular, the colors occur as follows:
  • light blue – around the fringe of the Upper East Side
  • red – Upper East Side next to Central Park
  • yellow – along the west side of Central Park
  • pink – Upper West Side and south-west Central Park
  • purple – south-east corner of Central Park + 2 Upper East Side + 1 Financial District
  • blue – mostly around the southern and eastern sides of Central Park + 2 Midtown East + 1 Financial District
  • green – Downtown + 2 west of Central Park

The strong geographical clustering of the different types of buildings within Manhattan is not unexpected, since many of the areas were developed at the same time and in a similar architectural style. (This geographical result is not because of the importance of Location in the CityRealty scores, since most of the buildings score very well on all of the Location criteria, except Traffic noise). Important differentiating Architecture criteria include: presence of a Plaza or Atrium, Water Element, Illumination, Non Rectilinear Form, Ceiling Height, and Balconies.

There are also perceived differences in the desirability of the various areas, which means that nearby buildings often provide similar Features (such as Recreational Roof, Elevator Person, Maids Room, Garage, or Catering). This is further related to the non-random distribution of the two apartment types: all green are condominiums; all light blue and yellow are co-operatives; all except one of the red are co-ops; all except two of the blues are condos; and only the pink and purple are mixtures of the two types. Co-operative buildings tend to provide a range of more expensive features than do the condominiums (most of the co-ops are at the bottom-left of the network graph).

The purple group are all similar in ambience, in that they are buildings that include both a hotel and apartments (usually, the lower part is the hotel and the tower is a co-op or condo). The exception is the Cipriani building, which is part of a world-wide chain. The pink group were all built at a similar time (1902-1908, except the Dakota in 1882) and in a similarly opulent style, and they are now designated as historic landmarks.

CityRealty also provides lists of the Top 10 Most Prestigious Co-ops and the Top 10 Most Prestigious Condos. These are indicated with numbers and letters, respectively, in the network diagram. These buildings are somewhat clustered in the network, but it is clear that "prestige" is not directly related to the criteria used by CityRealty in their ratings (if it was, then the buildings would be much more clustered in the network). Furthermore, buildings such as River House are not necessarily as prestigious as they once were (see here and here), and so their places in these lists might be contested.

It is also worth noting that not all of the most expensive buildings are necessarily in the list analyzed. For example, in 2012 some very expensive apartments were also sold in the co-op buildings at 785 Fifth Avenue, 884 Fifth Avenue, and 1030 Fifth Avenue, which are not included in the network analysis.

Finally, not all of the apartment buildings discussed here are necessarily lived in by their owners, particularly those in condominium buildings. For example, the New York Times has noted:
In a large swath of the East Side bounded by Fifth and Park Avenues and East 49th and 70th Streets, about 30 percent of the more than 5,000 apartments are routinely vacant more than 10 months a year because their owners or renters have permanent homes elsewhere, according to the Census Bureau’s latest American Community Survey.
This is particularly true of the most expensive condo apartments:
Pieds-à-terre exist throughout the New York City condo market, a separate little world of vacation homes and investment properties. But the higher the price, the higher the concentration is likely to be of owners who spend only a few months, a few weeks or even just a few days each year in their apartments. This very costly form of desolation means that some of the city’s most expensive residential buildings stand mostly dark, lonesome and empty on the inside.

I should point out, in passing, that CityRealty have not been as consistent in their ratings as might be hoped. For example, they note that: "On occasion, we may add (or subtract) a few points based on our subjective view of the building, so if the numbers don't add up exactly as you expect, that's why." I have ignored these extra subjective points in my analysis.

What I have not been able to ignore is some of the other inconsistencies. For example, "The Collection" building stands out like a sore thumb in the Sutton Place area (it is glass while its near neighbors are brick), as does "40 Bond Street" (it is bright green while its neighbors are brick or stone). Nevertheless, CityRealty have coded them: "Contextual Design: No, but Very Good", and scored both buildings 3 out of 3 on this criterion. Clearly, this confounds two criteria, Distinction of Exterior and Contextual Design, as CityRealty are allowing their claim that each building is an "Architectural Masterpiece" (and thus they score 8 out of 8 on Distinction of Exterior) to cloud their decision about whether each building also has "Contextual Design" (where CityRealty admit that each building should score 0 out of 3). Even more oddly, they also code "The Gainsborough" building exactly the same way ("Contextual Design: No, but Very Good") and yet, in the photo they show, this building seems to fit perfectly into its context. Indeed, "The Collection" might also fit in, if it was in a more modern location than Sutton Place, but there seems to be no such hope for "40 Bond Street".

Wednesday, May 1, 2013

Releasing phylogenetic data

One approach that I have taken in this blog to popularizing the use of networks in phylogenetic analysis has been to investigate published data using network techniques. However, this is often difficult because the data have not been publicly made available (eg. Phylogenetic position of turtles: a network view).

I am not the only person to find fault with the failure to release phylogenetic data, although there are recognized reasons why data sometimes cannot be released. Razib Khan at the Gene Expression blog recently had this to say (Why not release data for phylogenetic papers?):
Last month I noted that a paper on speculative inferences as to the phylogenetic origins of Australian Aborigines was hampered in its force of conclusions by the fact that the authors didn't release the data to the public (more accurately, peers). There are likely political reasons for this in regards to Australian Aborigine data sets, so I don’t begrudge them this (Well, at least too much. I’d probably accept the result more myself if I could test drive the data set, but I doubt they could control the fact that the data had to be private). This is why when a new paper on a novel phylogenetic inference comes out I immediately control-f to see if they released their data. In regards to genome-wide association studies on medical population panels I can somewhat understand the need for closed data (even though anonymization obviates much of this), but I don’t see this rationale as relevant at all for phylogenetic data (if concerned one can remove particular functional SNPs). 
Yesterday I noticed PLoS Genetics published a paper on the genomics of Middle Eastern populations ... The results were moderately interesting, but bravo to the authors for putting their new data set online. The reason is simple: reading the paper I wanted to see an explicit phylogenetic tree/graph to go along with their figures (e.g., with TreeMix). Now that I have their data I can do that.
In this particular case the data were made available on the homepage of one of the authors, which is better than nothing but is clearly less than ideal. There are a number of formal repositories for phylogenetic data, all of which should have greater longevity than any personal homepage, including:
The first of these databases has a long history of storing phylogenetic trees and their associated datasets. It has not yet lived up to its full potential, but people like Rod Page are pushing for it to do so eventually.

Dryad is a more general data repository (ie. not just for phylogenetic data), and its use is now encouraged by many of the leading journals — Systematic Biology, for example, makes its use mandatory, at least for data during the submission process, and also for "data files and/or other supplementary information related to the paper" for the published version.

Phylogeny databases are not without their skeptics, however. For example, Rod Page (Data matters but do data sets?) has noted:
How much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses. 
Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much"). 
But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.
However, all of this begs the question that seems to me to be central to science. Science is unique in being based primarily on evidence rather than expert opinion, and therefore the core of science must be direct access to the original evidence, rather than some statistical summary of it or someone's opinion about it. How can I evaluate evidence if I don't have access to it? How can I verify it, explore it, or re-analyze it? Being given the raw data (eg. the sequences) is one thing, but being given the data you actually analyzed and based your conclusions on (eg. the aligned sequences) is another thing entirely.

In short, if you won't openly give me your dataset then I don't see how you can call yourself a serious scientist.

Note: see also this later post: Public availability of phylogenetic data