Wednesday, April 29, 2015

Consanguinity and incest can produce the same effects

I have noted before that Pedigrees and phylogenies are networks not trees. For example, a human family "tree" is a tree only if it includes one sex alone. Otherwise, it must be a network when traced backwards from any single individual through both parents, because the lineages must eventually coalesce in a pair of shared common ancestors.

This potentially creates a problem for maintaining genetic diversity within species. If a pedigree is tree-like, then each person would, for example, have 32 great-great-great grand-parents. These 32 people's genes are mixed more-or-less randomly (depending on recombination and assortment) to produce the great-great-great grand-child. This heterozygosity is a good thing, evolutionarily, because there is then genetic diversity within that person.

However, inbreeding turns a tree into a network. This increases the probability that identical alleles will be paired in any one individual. If deleterious recessive alleles are thereby expressed, then genetic problems can ensue, which is called inbreeding depression. However, this situation is not inevitable, but depends on the probability of alleles becoming paired. Indeed, for domesticated organisms, inbreeding is the norm (see Thoroughbred horses and reticulate pedigrees).

I have discussed examples of well-known historical figures who have encountered the unfortunate effects of inbreeding, including Charles Darwin (Charles Darwin's family pedigree network) and Henri Toulouse-Lautrec (Toulouse-Lautrec: family trees and networks). In both cases the problems arose because of consanguineous relationships, which involve people who are first cousins or more closely related.

I have also discussed the extreme case of consanguinity, incest. In particular, royalty have often been exempt from taboos against sibling and parent-child couplings, as noted in Tutankhamun and extreme consanguinity and also in Cleopatra, ambition and family networks. At least for Tutankhamun there is evidence of genetic problems (an accumulation of malformations is evident), but apparently not in Cleopatra's case (there is no convincing evidence of infertility, infant mortality or genetic defects, for example). Royalty have not been the only exceptions to the incest taboo (see Evolutionary fitness and incest).

In Tutankhamun's case it has been suggested that his mother was his father's (Akhenaten) sister (name not known), which is surprising, because only two wives of Akhenaten, Nefertiti and Kiya, are known to have had the title of Great Royal Wife, which the father of the royal heir should bear. As a way out of this dilemma, Marc Gabolde has suggested that the apparent genetic closeness of Tutankhamun's parents is because his mother was his father's first cousin, Nefertiti. The apparent genetic closeness is then not the result of a single brother-sister mating but instead is due to three successive instances of marriage between first cousins.

To explain this idea we can look at an actual example. An historical example of how consanguinity can produce the same genetic effects as incest is provided by the Spanish branch of the Habsburg dynasty in 1700, as discussed in Family trees, pedigrees and hybridization networks.

This example can be explained using inbreeding F values. For any specified offspring, these indicate the probability of paired alleles being identical by descent (ie. due to the close relationship of the parents). For close family relationships the F values are:
uncle-niece = aunt-nephew
double first cousins
first cousins
first cousins once removed
second cousins
Note that incest produces F values of 0.250 while consanguinity values are 0.063 or greater.

If we consider the case of King Charles II of Spain (1661-1700), then his inbreeding F = 0.254, which was achieved entirely without incestuous relationships. His pedigree is shown in the post Family trees, pedigrees and hybridization networks.

This pedigree shows that the parents of each person had the following relationships:

himself = uncle-niece [ie. his parents were uncle and niece]

father = first cousins once removed [ie. his father's parents were first cousins once removed]
mother = first cousins

father's father = (a) = uncle-niece
father's mother = (b) = uncle-niece
mother's father = first cousins
mother's mother = first cousins once removed

father's father's father = not closely related
father's father's mother = first cousins
father's mother's father = not closely related
father's mother's mother = not closely related
mother's father's father = uncle-niece
mother's father's mother = second cousins
mother's mother's father = see person (a)
mother's mother's mother = see person (b)

Thus, on his father's side he was the third generation of consecutive consanguinity, and on his mother's side he was the fourth generation of consecutive consanguinity. This is simply an accumulating series of probabilities — consanguinity potentially produces problems and consecutive consanguinity simply increases the probability.

It is not surprising, then, that Charles suffered genetic problems (he was disfigured, physically disabled and mentally retarded) to such an extent that his royal lineage came to an end, and the Spanish branch of the Habsburg dynasty ceased to rule.

Incidentally, the scientist who devised the quantity F, Sewall Wright, himself had a rather high amount of inbreeding — his parents were first cousins.

Monday, April 27, 2015

A phylogenetic network of late-night US television shows

"Late night" broadcasting on United States network / cable TV starts at about 11:00 or 11:30 pm, and goes for a couple of hours. Many networks broadcast similar shows during this time, which directly compete against each other for the available audience (which is currently estimated to be slightly in excess of 10 million people per night at 11:30 pm). Many of these shows have been on for a long time. Most of them are recorded on several weekday nights in front of a live audience, and they are usually associated with only a very few presenters over time (almost always men!).

For example, since the early 1990s we have had:
NBC Tonight Show

NBC Late Night

CBS Late Show
CBS Late Late Show

ABC Kimmel Live
ABC Nightline

ComedyCentral Daily Show

ComedyCentral Colbert Report
TBS Conan





Jay Leno 1992-2009
Conan O'Brien 2009-2010
Jay Leno 2010-2014
Jimmy Fallon 2014-
David Letterman 1982-1993
Conan O'Brien 1993-2009
Jimmy Fallon 2009-2014
Seth Meyers 2014-
David Letterman 1993-2015
Tom Snyder 1995-1999
Craig Kilborn 1999-2004
Craig Ferguson 2005-2014
James Corden 2015-
Jimmy Kimmel 2003-
Ted Koppel 1980-2005
Three-anchor team 2005-
Craig Kilborn 1996-1998
Jon Stewart 1999-
Stephen Colbert 2005-2014
Conan O'Brien 2010-

Eventually, the presenters retire or move elsewhere, and the other presenters then move around among the shows. This has lead to the so-called "Late night wars", in which the NBC studio executives in charge repeatedly show that their personnel management skills are often lacking. For example, David Letterman was expected to replace Johnny Carson when he retired as the host of the NBC Tonight Show in 1992, but the job was given to Jay Leno, instead. So, Letterman moved to a directly competing show on CBS. When Leno subsequently moved to another show, Conan O'Brien took over. However, Leno then moved back again, and so O'Brien moved to a directly competing show on TBS. The media interest in these shenanigans exceeded their interest in the shows themselves.

Another substantial decision was that by ABC, at the end of 2012, to swap the timelsots of Nightline (which used to run 11:35-12:00) and Kimmel Live (which ran 12:00-13:00). This had a notable effect on the audience numbers, because Nightline was one of the top two shows in its original timeslot whereas Kimmel Live currently gets about 1 million viewers fewer per night in that same slot. On the other hand Nightline in its new timelsot gets about the same audience as Kimmel Live did when it occupied the slot. That seems to be a net loss of audience for ABC.

The Nielsen Media Research viewing data are available online at the TV by the Numbers site. They provide the weekly averages for each show in millions of viewers, based on what is known as "live plus same day" viewing (ie. the audience at the time of broadcast plus same-day viewing of video recordings). The data I have looked at run from early December 2011 to the end of December 2014 (161 weeks). Unfortunately, these data rely on NBC press releases (rather than direct access to Nielsen), so there are some missing data.

The comparison of these shows can be visualized using a phylogenetic network, as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the nine shows using the manhattan distance; and a Neighbor-net analysis was then used to display the between-show similarities as a phylogenetic network. So, shows that are closely connected in the network are similar to each other based on their audience figures across the three years, and those that are further apart are progressively more different from each other.

The network shows a gradient of increasing audience size, from bottom-left to top-right. So, the Tonight Show consistently got a average nightly audience of c. 3.5 million people, while Conan had c. 0.8 million. The two CBS shows both consistently did somewhat worse than their NBC timeslot competitors.

The two ABC shows apparently did well, but this is confounded by the timeslot swap noted above. Nightline did well for the first year (before it was moved) but not for the second two years, while Kimmel Live did the opposite. This is what creates the big reticulation in the middle of the network, as all of the other shows had fairly consistent audiences throughout the three years.

However, there was a steady decrease in the total audience size across the three years, from c. 12 million per night (at 11:30 pm) at the end of 2011 to c. 10 million at the end of 2014. The only major exception to this was at the time when Jimmy Fallon took over from Jay Leno (early 2014). For several weeks the Tonight Show audience increased to >8 million per night, so that the total audience was c. 15.5 million (a 50% increase). This shows just how many people are available to be added to the late-night viewing, compared to how many watch regularly. So, why are they not watching in the other weeks? It seems that Late Night Television is not reaching its full potential.

Wednesday, April 22, 2015

Do we need more terms for homology?

Homology is a concept that is fundamental to biological studies, and yet it is difficult to define. Generally, characters are considered to be homologous among organisms if they have been inherited from a common ancestral character.

Homology is thus at the heart of phylogenetics, as it expresses the historical relationships among characters, whereas a phylogeny expresses the historical relationships among taxa (including individuals). Since the relationships among the taxa are based on pre-existing information about the relationships among the characters, homology must be established first. It is for this reason that multiple sequence alignments, for example, are so valuable.

However, homology is a relative concept; that is, it is context sensitive. It only applies locally, to any one level of the hierarchy of character generalization. The classic example of this idea is bird wings versus bat wings. These structures are homologous as forelimbs but not as wings – birds and bats independently modified their forelimbs into wings. So, homology exists at the more general level (forelimbs) but not at the less general level (wings). Forelimbs developed first in evolutionary history (the common ancestor of animals with four legs is ancient), and later these forelimbs were modified in different descendants, with some developing wings, some flippers, and some arms. Wings, flippers and arms are more recent, and are thus less general.

So, we can conceptualize characters as existing at many hierarchical levels of generality, depending on when they developed. We might have (going from specific to general) nucleotides, amino acids, protein domains, proteins, biosynthetic pathways, developmental origins, and anatomy, among many possible conceptual levels. Lower levels in the hierarchy "control" the upper levels, so that nucleotides code for amino acids, domains consist of strings of amino acids, proteins function as enzymes in biosynthesis, and development is controlled by biosynthetic pathways.

A nucleotide insertion and compensatory deletion results in two amino acid substitutions,
so that simultaneously aligning homologous nucleotides and homologous amino acids is no longer possible

The issue is that homology among characters can only be determined within any one hierarchical level. As noted by Fitch (2000): "Life would have been simple if phylogenetic homology necessarily implied structural homology or either of them had necessarily implied functional homology. However, they map onto each other imperfectly".

For example, homology of amino acids among a group of organisms does not necessarily imply that all of their coding nucleotides are homologous (see the figure above) — originally the nucleotides would also have been homologous, but insertions and deletions through time can break the original relationship between the amino acids and their coding nucleotides. So, one cannot always simultaneously align homologous amino acids and homologous nucleotides.

Similarly, homology of two anatomical features does not necessarily imply that their developmental sequences are homologous. This is an issue that the study of evo-devo has made increasingly obvious. That is, sometimes identity of morphological characters is not the result of identity of the sets of genes that control their development (Meyer 1999; Mindell and Meyer 2001; Wagner 2014) — non-homologous genes and gene networks can produce morphological structures that are usually considered to be homologs, and non-homologous structures can express homologous genes.

Developmental biologists therefore often prefer a process-oriented concept of homology, which they call 'biological homology', where homologous features are those sharing a set of developmental constraints (Wagner 1989). Indeed, the terms 'syngeny' (Butler and Saidel 2000) and 'homocracy' (Nielsen and Martinez 2003) have been coined to describe morphological features that are organized through the expression of homologous gene networks, irrespective of whether those features are evolutionarily homologous or convergent.

Reticulation and homology

This idea can be extended to other evolutionary scenarios. The one I am particularly interested in here is the consequence of reticulation. In the situations discussed above the character modifications (ancestral to derived) come from "within" the lineage (traditional ancestor-descendant gene inheritance), but the modifications can also come from "outside", by gene flow.

For example, Andam and Gogarten (2012) have noted that horizontal gene transfer (HGT) can in fact be used to provide information for the concept of a Tree of Life, because a transferred gene can also be regarded as a shared derived character. That is, HGT of a gene into an ancestor forms a synapomorphy for its descendants. This gene may subsequently diversify among those descendants, even following a simple tree-like pattern of descent.

This creates a terminological issue. If diversification occurs, then these genes are homologous in the traditional sense (they are modified descendants of a common ancestral character). However, how do they compare to genes in the descendants of species that did not receive the HGT, and to the genes from which the transfer occurred? In the first case they are not applicable (just as the concept of wings is not applicable to animals with flippers). In the second case our current concept of homology does not apply in any simple sense.

The hierarchical concept of homology is tied to a tree model of evolution. The hierarchical nature of characters results from the nested hierarchy of taxon relationships. If there is no nested hierarchy of taxon relationships then our current concepts of homology are inadequate. We need terms that describe possible reticulate relationships among the characters, not just hierarchical ones.

Thus, along with modifications to the concept of monophyly (see Monophyletic groups in networks ), networks imply that we need modifications to the concept of homology, as well.


It is worth noting that a similar issue applies in other fields that are based on a concept of evolutionary history. For example, in historical linguistics words are considered to descend from ancestral languages and diversify among multiple daughter languages. These words are considered to be cognate (cf. homologous). However, words are also borrowed from unrelated languages, and these are loan words (cf. HGT). Loan words may also diversify among the daughter languages, both in the original language and in the borrowing language.

For example, the Germanic word *rīks (ruler) was borrowed from Celtic *rīxs (king), and it has come down to modern times as German 'Reich', English 'rich' (West Germanic), Swedish 'rike' (North Germanic), and Gothic 'reiks' (East Germanic) (see Wikipedia). This diversification has followed Grimm's Law, a regular phonological change that defines the Germanic family — so, the subsequent development of the loan word allows reconstruction of the evolutionary history, and the descendants are cognate. But are they cognate to the words descended from *rīxs within Celtic?


Andam CP, Gogarten JP (2013) Biased gene transfer contributes to maintaining the Tree of Life. In: Lateral Gene Transfer in Evolution (U Gophna, ed.), pp 263-274. Springer: New York.

Butler AB, Saidel WM (2000) Defining sameness: historical, biological, and generative homology. Bioessays 22: 846-853.

Fitch WM (2000) Homology: a personal view on some of the problems. Trends in Genetics 16: 227-231.

Meyer A (1999) Homology and homoplasy: the retention of genetic programmes. In: Homology (GR Bock, G Cardew, eds), pp. 141-157. Wiley: Chichester.

Mindell DP, Meyer A (2001) Homology evolving. Trends in Ecology and Evolution 16: 434-440.

Nielsen C, Martinez P (2003) Patterns of gene expression: homology or homocracy? Development Genes and Evolution 213: 149-154.

Wagner GP (1989) The biological homology concept. Annual Review of Ecology and Systematics 20: 51-69.

Wagner GP (2014) Homology, Genes, and Evolutionary Innovation. Princeton University Press: Princeton NJ.

Monday, April 20, 2015

Domestication networks are complicated

Phylogenetic networks were developed as a professional tool for displaying complicated evolutionary histories. However, this does no mean that such networks cannot be used elsewhere.

As an example, Pete Buchholz produces drawings of dinosaurs as the artist Ornithischophilia at the DeviantArt web site. Among these drawings are some phylogenies, and two of them are networks.

The first one is labelled Citrus is complicated, and refers to the origin of citrus cultivars.

The phylogenetic tree at the left is sourced from the American Journal of Botany, while the network at the right is from information in Wikipedia. The combination of the two appears to be original to the artist. The network is read from left to right — for example, the Limequat is a hybrid of the Key Line and the Kumquat. Compared to the original Wikipedia text, the picture speaks a thousand words.

The second network is labelled Apples are complicated, and refers to the origin of some of the apple cultivars.

No source is given for the information, but I assume that it also comes from Wikipedia. Note that, as before, the network is read from left to right, but this time there is a time scale at the top. The artist refers to it as a "spaghetti diagram", and notes that:
Colors are based on the major parent that the "story" revolves around; purple for Honeycrisp, Yellow for Golden Delicious, Red for Jonathan, Maroon for Red Delicious, Orange for Cox's Orange Pippin, Teal for McIntosh, Green for Granny Smith, and Blue for Topaz.

Wednesday, April 15, 2015

What we know, what we know we can know, and what we know we cannot know

This is a guest blog post by:

Johann-Mattis List

Centre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

What we know, what we know we can now, and what we know we cannot know: Ontological facts and epistemological reality in historical linguistics and evolutionary biology

In a recent blog post (Multiple sequence alignment), David wrote about some theoretical issues regarding the concept of homology in evolutionary biology, and specifically its impact on the design of sequence alignment programs. In that post, he mentioned a recently published paper, where he discusses algorithms for sequence alignment and notes that "there is no known objective function for identifying homology" (Morrison 2015: 14).

This statement triggered my interest, since I was immediately reminded of problems that have been occupying historical linguists for a long time now. These problems arise from the fact that in historical disciplines, such as evolutionary biology or historical linguistics (but also in general history or some parts of geology), scholars are not trying to infer general laws of nature, but rather use knowledge of general laws to infer unique events.

The tasks of scholars working in these disciplines is similar to the task of a crime investigator or a doctor: Detectives use the evidence from a crime scene to infer the individual events that led to the crime (and arrest the culprit), and doctors use the symptoms of patients to identify their individual diseases (and then look for a way to cure them). Similarly, evolutionary biologists and historical linguists try to identify the evolutionary events that lead to the observed diversity of life and languages, respectively.

What unites all these disciplines is the specific mode of reasoning that they employ. Charles Sanders Peirce (1839-1914) was among the first to investigate this reasoning mode in detail (Peirce 1931/1958: 7.202). He called it abduction, and contrasted it with induction and deduction, the traditional modes of logical reasoning. Induction is used to infer a currently unknown general rule from an initial state and its result state, while deduction infers the result state of an initial state and a general rule. On the other hand, abduction seeks to infer initial states from result states by employing a general rule.

What further complicates the task of evolutionary biologists and historical linguists is that we have only limited means to verify or falsify a given hypothesis, since, in contrast to detectives and doctors, our research objects usually do not confess, nor do they give positive feedback when we propose the right hypothesis. We never know whether we found the true murderer or whether we proposed the right cure.

Historical linguistics and the limits of knowledge

In historical linguistics, discussions regarding the limits of our knowledge have been centered around the question of the "nature of the proto-language". Using comparative techniques, in the second half of the 19th century linguists started to reconstruct ancestral words of languages that are not attested in any written source. Thus, linguists would first try to identify cognate (homologous) words in Indo-European languages, and then infer how these words were pronounced in the Indo-European language which was spoken some 8,000 years ago. This technique, which was originally introduced by August Schleicher (1821-1868) in 1861, became very popular, and has remained the standard way of knowledge representation in historical linguistics. Whenever linguists propose such a reconstructed form, based on various pieces of evidence, they use an asterisk symbol * to indicate that the word has been inferred, and that there is no written source that would confirm its existence.

As an example, consider some of the words for "sun" in Indo-European languages (discussed in detail in List 2014: 136):
According to modern historical linguistics theory, these words are all assumed to go back to the same ancestral word in Indo-European. The reconstructed pronunciation of the ancestral form is traditionally represented as *séh₂u̯el- "sun" and an approximate pronunciation of the nominate singular would be [soxwl] (with [x] indicating the same sound as the ch in German Rauch "smoke").

These techniques are generally thought to be quite reliable, and they provided concrete help in the decipherment of many ancient languages (including the Egyptian hieroglyphes, Linear B, and Hittite). The status of the reconstructions that scholars produced was, however, controversially debated. While some scholars claimed that there was a high probability that the proposed reconstructions would come close to the original pronunciation, others would classify them as a pure fiction (Schmidt 1872).

Linear B

While it is obvious that reconstructions represent hypotheses and not indisputable truths, it is less clear how they relate to the actual historical facts. First of all, we know for sure that our hypotheses are not stable over time. As our knowledge of the evidence increases, as we include more languages in our comparison, or get deeper insights into the major processes underlying language history, our hypotheses will also constantly be changed and refined. This is nicely reflected in August Schleicher's Fable (a short parable called "The Sheep and the Horses"), a text that he wrote in his reconstructed version of Proto-Indo-European, in order to illustrate what was by then known about the origin of the Indo-European language. When looking at the many later versions, written by scholars in order to illustrate how our knowledge of Indo-European had changed since then, the differences in the pronunciations are really striking (see this summary in Wikipedia), but so are the similarities.

Judging from the degree to which these reconstruction hypotheses evolved over about 150 years, we can reach an important, apparently paradoxical, conclusion: While our reconstructions in historical linguistics are far from being realistic (in the sense of representing actual pronunciations of an Indo-European people), they are by no means fictions, as Johannes Schmidt claimed long ago. The reconstructions are not (and never will be) realistic, since they will always be preliminary, depending on our currently available data and the theoretical development in our field. On the other hand, the reconstructions are also not necessarily unrealistic, since they reflect scientific hypotheses that have been constantly refined and independently developed using the best knowledge we have at that moment. So, although we know that our hypotheses do not truly reflect what really happened, we have good reasons to assume that they come much closer to the real story than any random hypothesis.

As reflected in David's aforementioned statement regarding the lack of an objective function for homology identification in evolutionary biology, the problem of assessing the realism of our hypotheses is not unique to historical linguistics. In a similar way to that with which we discuss the realism of our reconstructed forms in historical linguistics, one may discuss the realism behind any multiple sequence alignment in evolutionary biology. The objects of investigation in historical linguistics and evolutionary biology are not directly accessible to the researchers, but can only be inferred by tests and theories.

Interestingly, this problem also occurs in the social sciences. In psychology, for example, such attributes of people as "intelligence" cannot be directly observed, but have to be inferred by measuring what they provoke or how they are "reflected in test performance" (Cronbach and Meehl 1955: 178). What is inferred by psychological tests is usually called a construct, and is strictly separated from the underlying quality that scholars originally wanted to measure. The construct is thereby understood as the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981 [1998]: 67). As in the case of reconstruction in linguistics or homology assessment in biology, it is not the "real" object or process.


What can we conclude from this? Or, to put it differently, why should we care about constructs or the degree of fiction behind our claims in historical linguistics and evolutionary biology? I see two important reasons to do so.

First, we can avoid confusion in our fields by strictly separating ontological facts and epistemological reality. In evolutionary biology, this would help to avoid the confusion that often arises when scholars talk about homologous genes, when in practice what they mean is that they applied some similarity threshold and some cluster procedure to cluster genes in sets of presumed homologs. In historical linguistics, on the other hand, it would help us to get rid of the tiresome debate between formalists (who emphasize that reconstructed forms are simple formulas) and realists (who take reconstructed forms as realistic representations) in reconstruction.

Second, from a broader viewpoint, as scientists, we should always try to be explicit in our claims, and we should also always try to be honest about what we know, what we know we can know, and what we know we cannot know.


Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.

List J-M (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.

Peirce CS (1931/1958) Collected papers of Charles Sanders Peirce. Ed. by C Hartshorne and P Weiss. Cont. by AW Burke. 8 vols. Cambridge MA: Harvard University Press.

Schleicher A (1861) Compendium der vergleichenden Grammatik der indogermanischen Sprache. Vol. 1: Kurzer Abriss einer Lautlehre der indogermanischen Ursprache. Weimar: Böhlau.

Schmidt J (1872) Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Hermann Böhlau.

Statt DA, comp. (1981 [1998]) Concise Dictionary of Psychology, 3rd ed. London and New York: Routledge.

Monday, April 13, 2015

Evolution and timelines, 2

I have noted before (Evolution and timelines) that any history can be represented as a timeline, but a timeline diagram does not necessarily show an evolutionary history. Unfortunately, this does not stop people from putting the word "evolution" on their timeline diagrams.

One ambitious example is The Evolution of the Web. Two images are shown below, which illustrate some of the transformational history of web browsers and technology, depicted as complex timelines. This represents complex transformational evolution (see The evolutionary March of Progress in popular culture), rather than variational evolution.

The full majesty, and complexity, of the timline can be seen at the interactive version linked above.

Wednesday, April 8, 2015

Using networks, not trees, to display hybrids

Phylogenetic networks are intended to display reticulate evolutionary histories, rather than strictly divergent or transformational histories. This idea applies both to species and higher taxa (where the ancestors might be inferred), and to individuals and populations (where some of the ancestors might be sampled). However, the literature is still replete with studies that use one or more phylogenetic trees for displaying reticulate phylogenies.

A recent example is shown by: Umer Chaudhry, Elizabeth M. Redman, Muhammad Abbas, Raman Muthusamy, Kamran Ashraf, John S. Gilleard (2015) Genetic evidence for hybridisation between Haemonchus contortus and Haemonchus placei in natural field populations and its implications for interspecies transmission of anthelmintic resistance. International Journal for Parasitology 45: 149-159.

These authors sampled nematode parasites from sheep, goats, cattle and buffaloes at abattoirs in Pakistan and southern India. These parasites were morphologically characterized as being predominantly either Haemonchus contortus or Haemonchus placei. The worms were then genotyped in several ways, including: SNPs of rDNA ITS-2, microsatellite markers, sequences of nuclear isotype-1 of β-tubulin, and sequences of mitochondrial NADH dehydrogenase subunit 4. The genotyping revealed several individual worms that were considered to be inter-species F1 hybrids.

The phylogenetic tree from the β-tubulin sequences is shown in the first figure. There were 25 haplotypes identified among the worms. Most of the worms were homozygous, with haplotypes that were identified as either H. contortus or H. placei. However, five worms were discovered to be heterozygous, with one haplotype considered to have come from each of the species.

The hybrid status of the worms is shown in the phylogenetic tree by having the hybrids appear twice, once for each of their haplotypes, with the other worms appearing only once. Thus, the actual reticulate history is not made visually obvious.

A better approach would be to use a phylogenetic network. This is straightforward in this case. From the perspective of the worms (rather than the haplotypes), the phylogenetic tree is a so-called MUL-tree, in which some of the taxon labels appear multiple times (and some appear only once). The labels that appear once represent homozygous worms, which can be seen as being "monoploid" for this locus. The labels that appear twice represent heterozygous worms, which can be seen as being "diploid".

MUL-trees where the labels represent different ploidy levels can easily be turned into a network using the Padre program. The result is shown in the next figure, which is therefore a hybridization network.

The actual history of the worms is now clear. Interestingly, one of the hybridization events seems to be older than the other four.

As an aside, it is also worth pointing out a mis-interpretation of the phylogenetic tree produced from the mitochondrial ND4 sequences. This tree is shown in the next figure — I have added the annotations at the right.

The phylogeny shows 12 haplotypes considered to be H. contortus and 14 haplotypes considered to be H. placei. One of the hybrids clearly has a H. contortus haplotype, indicating that its maternal parent came from this species. However, the other four hybrids cannot be unequivocally identified as having H. placei mothers (as claimed by the authors), as their haplotypes are all sisters to the H. placei haplotypes — all of the H. placei haplotypes share a common ancestor that is not shared with the hybrids. Given the root of the tree, H. placei is a more likely identification than is H. contortus, but the tree does not provide unequivocal evidence.

Monday, April 6, 2015

Network of business office-space costs

The cost of renting or leasing office space differs dramatically around the world. This is obviously of great importance to businesses, as their profitability depends on the balance between income and costs. Their expenditure on office space can thus determine whether or not it is profitable for them to do business in certain cities.

The CBRE Group Inc. is an American commercial real estate company, and they provide an annual Global Prime Office Occupancy Costs report that addresses this business cost. It is a survey of office occupancy costs for prime office space in a large number of cities worldwide. Occupancy costs for business premises represent rent, plus local taxes and service charges. The report notes that: "The occupation cost figures have also been adjusted to reflect different measurement practices from market to market."

Each report lists the top 50 most expensive office locations in the world during the previous year, along with the average occupancy cost (in US$ / sq ft / annum). The locations examined may be the central business district of each city or several parts of some cities, depending on how much office space is available. The list of locations continues to expand every year, but only the top 50 are ever listed in each report.

The CBRE web site currently contains the data for the years 2008-2010 and 2012-2014. There are 71 locations that have appeared in these six top-50 lists, although only 30 of them have appeared in the top 50 in all six years (and seven have appeared only once).

Of course, a phylogenetic network could be used to visualize the data for each location across the six reports, as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the 30 main locations using the Gower similarity; and a Neighbor-net analysis was then used to display the between-location similarities as a phylogenetic network. So, locations that are closely connected in the network are similar to each other based on their office costs across the six years, and those that are further apart are progressively more different from each other.

The network shows a gradient of decreasing office costs, from bottom-left to top-right. So, the consistently most expensive locations have been the West End of London and central Hong Kong, followed by Moscow and central Tokyo. London City and Kowloon, in Hong Kong, are not far behind, showing that you cannot avoid high costs for prime office space in these two cities.

Across the locations, the most expensive ones cost on average 3.4 times as much as the cheapest locations. Note that Midtown Manhattan is not nearly as expensive as people might think, and certainly not as expensive for office rental as it is for living accommodation. Switzerland has only two cities (Geneva, Zurich), and both of them are in the middle of the network; so it is not cheap, either. Australia has five main cities but only to of them are in the list (Perth, Sydney) — Sydney is also one of the most expensive cities in the world for general living expenses.

In the network, Dubai and central Mumbai are somewhat isolated from the other locations because their office rents have decreased over the six reports, unlike any of the other locations. In the case of Mumbai, the most expensive offices recently have been in the Bandra Kurla complex, instead of Nariman Point.

So, if you are planning on expanding your business globally, you now know where to avoid.

Wednesday, April 1, 2015

The first post-Darwinian phylogeny

It is tolerably well known that Alfred Russel Wallace developed the idea of evolution via natural selection quite independently of Charles Darwin, and that, indeed, it was Wallace's revelation of this fact that prompted Darwin to finally publish his ideas (Bannister et al. 2014).

Some people are even aware that Wallace developed the Tree of Life metaphor independently, as well (Wallace 1855), a fact of which Darwin himself was perfectly well aware (eg. Bradman and Bartlett 1998):
"the analogy of a branching tree [is] the best mode of representing the natural arrangement of species ... a complicated branching of the lines of affinity, as intricate as the twigs of a gnarled oak ... we have only fragments of this vast system, the stem and main branches being represented by extinct species of which we have no knowledge, while a vast mass of limbs and boughs and minute twigs and scattered leaves is what we have to place in order, and determine the true position each originally occupied with regard to the others."
What is less well known is Wallace's contribution to phylogenetic imagery.

The Darwinian version of a phylogenetic tree is, of course, something usually considered to post-date 1859, when Darwin published his best-known book. However, producing such a tree was apparently a rather slow process. For example, in 1863, Franz Hilgendorf wrote a PhD thesis for which he produced a hand-drawn phylogeny, but he did not actually include this in the thesis; and he significantly modified it for its publication in 1866. In 1864 Fritz Müller published a couple of three-taxon trees. Also in 1864, Ernst Haeckel claimed to have started work on his series of phylogenetic trees, but the resulting book was not published until 1866. This means that the first substantial tree to appear in print was that of Mivart (1865).

However, long before this, Wallace was already moving ahead. In 1856 Wallace took the tree imagery from his 1855 publication and applied it to the relationships among bird groups. This publication was his first clearly evolutionary empirical contribution. He adapted the unrooted diagram of Strickland (1841), which represented "the natural system" of bird relationships, and gave it a clearly evolutionary interpretation. So, while Strickland's work was strictly atemporal and non-evolutionary, Wallace produced an evolutionary view of the world, with his two trees representing the end-product of change through time.

Wallace was in South-East Asia at the time of this work, collecting specimens among the islands of what is now Indonesia. He returned to England in 1862, thus having been absent during Darwin's rise to fame. However, he did return before anyone else had tackled Darwin's ideas empirically, and he was in an ideal position to do so himself (Beckenbauer et al. 2010). It would therefore be surprising if he had not done so.

Recently, it has become clear, as a result of the work done for the Wallace Correspondence Project, that Wallace did, indeed, produce a post-Darwinian phylogenetic diagram before any of his contemporaries, although it remained unpublished (Becker and Borg 2014). Not unexpectedly, it also refers to the relationships among birds. What is most interesting for us, however, is that it was a phylogenetic network, not a tree.

You will note that it is an unrooted network, in the same manner as his unrooted bird trees from 1856. In this, his presentation differed from that of Müller, Hilgendorf, Mivart and Haeckel, who all indicated a common ancestor. On the other hand, the branch lengths represent the "relative amount of affinity" between the named taxa, unlike the diagrams of his contemporaries. This means that the diagram can, indeed, be interpreted (in modern terms) as an unrooted phylogenetic network.

In his bird paper, Wallace (1856) had noted that producing the tree diagrams is not easy, as "you will most likely find that you have set down some conflicting affinities, or that you have mistaken some mere analogies for affinities". This seems to be the origin of his interest in the alternative model of a network, rather than a tree (Brabham and Berger 2014), thus making him the first person the use a data-display network to represent conflicting character data.

This post was inspired by the work of Torvill and Dean (1996). Happy April 1.


Bannister RG, Ballesteros-Sota S, Bjørndalen OE (2014) Running, swinging and skiing — the private life of Alfred Russel Wallace. Studia Wallaceana 6: 82-96.

Becker BF, Borg BR (2014) The phylogenetics of A.R. Wallace, and its relation to the science of tennis. Journal of Phylogenetic Inference 13: 101-110.

Beckenbauer FA, Best G, Bruyneel J (2010) Association football as a metaphor for phylogenetics. Is it a sport or a science? Phyloinformatics 7:1.

Brabham JA, Berger G (2014) The speed required to achieve the publication rate of A.R. Wallace. Philosophy and History of Biology 102: 89-92.

Bradman DG, Bartlett KC (1998) Wallace Down Under: the work of Alfred Russel Wallace in the southern hemisphere. Systematic Zoology 47: 767-780.

Haeckel E (1866) Generelle Morphologie der Organismen. Verlag von Georg Reimer, Berlin.

Hilgendorf F (1866) Planorbis multiformis im Steinheimer Süßwasserkalk: ein beispiel von gestaltveränderung im laufe der zeit. Buchhandlung von W. Weber, Berlin.

Mivart, StG (1865) Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592.

Müller F (1864) Für Darwin. Verlag von Wilhelm Engelman, Leipzig.

Strickland HE (1841) On the true method of discovering the natural system in zoology and botany. Annals and Magazine of Natural History 6: 184-194.

Torvill J, Dean CC (1996) Skating on thin ice. Systematic Biology 45: 641-650.

Wallace AR (1855) On the law which has regulated the introduction of new species. Annals and Magazine of Natural History 16 (2nd series): 184-196.

Wallace AR (1856) Attempts at a natural arrangement of birds. Annals and Magazine of Natural History 18 (2nd series): 193-216.