The Genealogical World of Phylogenetic Networks: November 2013

Wednesday, November 27, 2013

Within-species networks

In this blog we have consistently championed the idea that within-species relationships are better represented by a network than by a tree. We have done this for humans and their relatives:

Networks and human inter-population variation
Human races, networks and fuzzy clusters
Why do we still use trees for the Neandertal genealogy?

and for other species as well:

Are phylogenetic trees useful for domesticated organisms?
Why do we still use trees for the dog genealogy?
Network of apple cultivars

Genetically, a within-species network is a haplotype network. Also, when dealing with individuals in a sexually reproducing species it is a hybridization network, as I have noted:

Family trees, pedigrees and hybridization networks
Charles Darwin's family pedigree network
Toulouse-Lautrec: family trees and networks

We are not the only blog to emphasize intra-species networks, of course. As far as humans are concerned, one of the more vocal blogs has been Gene Expression, run by Razib Khan over at Discover magazine. For example, when discussing phylogenetic trees (Burning down the trees in historical population genetics), Khan notes:

These sorts of trees range from Ernst Haeckel's classical attempt, depicting relationships which biologists derived from intuition within the framework of a grand evolutionary scheme, all the way down to modern methods implemented in software packages such as Mr. Bayes, which many frankly utilize in a "turnkey" manner. These trees are abstractions, in that they reduce down a wide range of phenomena into schematic representations which impart aspects of particular interest in a stylized form. This is important, because the actual nature of the phenomena being represented may be more complex than is being represented.

Phylogenetic analysis involving distinct species has its own problems, but they are dwarfed by what must confront those who attempt to parse out relatedness of populations within species. Because of the ubiquity of gene flow across populations within species, attempts to generate a tree of relationships of populations is always bound to be a gross simplification. Instead of a sequence of bifurcations the true relationship of putative populations is more accurately represented by a networked graph.

When discussing alternative evolutionary models (Unveiling the genealogical lattice), Khan notes:

It seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.

And here the takeaway for me is that we may need to rethink our whole conception of pure ancestral populations, and imagine a human phylogenetic tree as a series of lattices in eternal flux, with admixed nodes periodically expanding so as to generate the artifice of a diversifying tree. The closer we look, the more likely it seems that most of the populations which have undergone demographic expansion in the past 10,000 years are also the products of admixture. Any story of the past 10,000 years, and likely the past 100,000 years, must give space at the center of the narrative arc to lateral gene flow across populations.

Mind you, the network and lattice metaphors are not the only ones he has up his sleeve (When trees turn into brambles):

With the expansion of genomics from humans to a wide range of species I suspect that we’ll see a lot more blurring of distinctions between species on the margins. This will be particularly true of those lineages with wide and continuous distributions. It will also be most salient and surprising for mammalian populations, where our prejudices about the primacy of a biological species concept are most strongly developed. In a phylogenetic sense when you shift the grain of analysis to a finer scale the tree of life becomes much more of a bramble in many cases.

Indeed.

Monday, November 25, 2013

Toulouse-Lautrec: family trees and networks

In a previous blog post (Charles Darwin's family pedigree network), I mentioned several well-known people who were involved in a consanguineous marriage, which is defined as the union of two people who are related as closer than second cousins. In that post I discussed in detail Charles Darwin (who married his first cousin); and in this post I discuss the artist Henri Toulouse-Lautrec, who was the offspring of a marriage between first cousins.

I thought that this would be a simple post, because there must be people who have studied the Toulouse-Lautrec-Montfa genealogy, given Henri's fame as a Post-Impressionist artist, along with the widespread knowledge that his phyiscal disabilities were genetic. But it turned out not to be so — there is no broad family tree that I could find, and no detailed discussion of inbreeding. The main information easily available is the direct lineage of inheritance of the various noble titles to which Henri would have been heir (had he survived his father, the Comte de Toulouse-Lautrec-Montfa), which can be traced back for more than 1000 years (see Vizegrafschaft Lautrec). However, the main interest for biology lies in his genetic relationship with his cousins, as we shall see below.

So, I sat down for a day to compile the family history for myself. The resulting genealogy is incomplete, but all of the relevant people are in it. I could not find all of the details about some of these people, either, which are apparently not available on the web; and some of the actual dates are inconsistent across different sources. In general, I have followed Dupic (2012).

When genealogical trees become networks

The point of this post is that marriages within a family turn the family tree into a network. So, a pedigree can be tree-like or not. In the latter case it is an example of a hybridization network.

This first genealogy shows a standard family tree for a single individual, looking backward in time from the bottom. So, this person is #1, the parents are #2 (father) and #3 (mother), and so on back through the generations, always with the male parent on the left (as is the convention). This example covers six generations, showing that without inbreeding everyone has 32 great-great-great grand-parents. These 32 people's genes are mixed more-or-less randomly (depending on recombination and assortment) to produce person #1. This is a good thing, evolutionarily, because there is then genetic diversity within #1.

However, with inbreeding some part of the ancestry disappears (when looking backward in time), because another part of the ancestry is duplicated in its place (this is called "pedigree collapse").

The second genealogy shows what happens when person #7 is the daughter of someone else in the same pedigree. If she is the daughter of #10 and #11, for example, then #5 and #7 would be sisters, and #2 and #3 would be first cousins. Now, person #1 has only 24 great-great-great grand-parents, and some of them are contributing to their descendants twice, rather than once (ie. #40–#47). This means that the genetic diversity in person #1 is less than it would be without the inbreeding. More to the point, any recessive alleles that exist in the ancestry have an increased probability of being homozygous in #1, and thus being expressed in the phenotype.

Toulouse-Lautrec's ancestry

This is, unfortunately, exactly what happened to Henri Toulouse-Lautrec, whose pedigree network is shown in the next figure. It is complete for six generations, plus an important part of the seventh. It is difficult to be complete beyond this generation, as the information becomes sparse, particularly about the female family members.

As shown, Henri's parents were first cousins, because their mothers were sisters. In addition, his maternal grandfather (#6) also had recent inbreeding in his history, because his mother (#13) was the daughter of a first-cousin marriage. This is not nearly as much inbreeding as has been implied by most commentators about Henri's life, but it is enough to potentially create genetic problems.

Note that it was Henri's mother's side of the family that was involved in the recent inbreeding, but the de Toulouse-Lautrec Montfa side was prone to the same thing, as are most titled families. As noted above, Henri died before inheriting his title. The title Comte de Toulouse-Lautrec-Monfa passed to Alphonse' next brother, Charles (1840-1917), who had no children, and thence to the next brother, Odon (1842-1937), and finally to Odon's son, Robert (1887-1972), who also had no children. The Internet seems to be silent about what happened to it after that.

Consequences of inbreeding

For Henri, life was tragic because he ended up with two copies of one particular recessive allele. The medical profession has been interested in this ever since his death, and much information is therefore now available about his condition (eg. Albury & Weisz 2013; Leigh 2013).

Albury & Weisz (2013) note:

The condition from which he probably suffered was first described in 1954 by the French physician Robert Weissman-Netter. It was named pycnodysostosis in 1962 by Marateaux and Lamy and was soon attributed to this artist as the "Toulouse-Lautrec Syndrome" ... Pycnodysostosis is a hereditary autosomal recessive dysplasia caused by an enzyme deficiency, namely of cathepsin K (cysteine protease deficiency in osteoclasts), reducing the normal bone resorption and leaving an incomplete matrix decomposition ... Toulouse Lautrec had a short stature with shortened legs, a large head due to a lack of closure of the fontanellae (which he usually covered with a hat), a shortened mandible with an obtuse angle (covered with a thick beard), dental deformities that required several surgical interventions, a large tongue, thick lips, profuse salivation, and a sinus obstruction with post-nasal drip. With fractures of the long bones during childhood, later on of the clavicle, with progressive hearing problems and cranio-facial deformities, Lautrec’s condition would complete the diagnosis of pycnodysostosis.

It seems to be widely recognized that Henri threw himself into his art at least partly to compensate for the psychological damage produced by his physical condition (he also became an alcoholic). As Leigh (2013) notes, his mother's side of the family had money (his father's side had a title but little money), and so Henri was financially free to do what he liked. He worked at a prodigious rate, and produced a life-time's worth of art in just 15 years — perhaps most famously his flamboyant lithograph posters (still as popular today as they were in his own time), but also oil paintings, watercolours, sculptures, ceramics and stained glass. He died at his mother's Château Malromé at age 36, after a stroke, but ultimately probably from tuberculosis (Albury & Weisz 2013).

Further inbreeding in the family

I noted in my previous post about Charles Darwin that, not only did he marry his cousin, his own sister married his wife's brother, thus literally keeping things in the family. In Henri Toulouse-Lautrec's case, the same thing happened: his paternal aunt married his maternal uncle, as shown in the next figure. This pedigree shows some more information about Henri's closest relatives, emphasizing the pair of consanguineous marriages.

There are 14 people shown in Henri's generation, all born to first-cousin marriages. (There may have been two more children in the Alix–Amédée marriage, but I have been unable to find any direct reference to them.) Of these people, six seem to have had disabilities similar to Henri's: Henri himself; his brother, who died the day before his first birthday; Madeleine, who died as a teenager; Geneviève; Béatrix; and Fides. The latter was so small that apparently she lived her entire life in a baby carriage (Rosenhek 2009). The photo below shows Henri with most of the Tapié de Céleyran family. It was taken in the summer of 1896 at Château du Bosc, where Henri had been born.

The two elderly women in the middle are Gabrielle (left) and Louise (right), the maternal and paternal grandmothers (they were sisters, remember). The father, Amédée, is at the rear centre (sticking his tongue out at the photographer), and the mother, Alix, is standing at the far right. Standing next to her is the oldest son, Raoul; and his wife, Elisabeth, is seated at the far left. The next two sons, Gabriel and Odon, are absent, along with their wives. The next son, Emmanuel, is standing at the back left; and his wife, Marie-Thérèse, is seated next to the pram (middle right). The youngest sons are sitting on the ground at the front centre, with Alexis on the left and Olivier on the right. The first-born daughter, Madeleine, was already dead when the photo was taken. The next three daughters are sitting at the middle left, with Germaine sitting on Elisabeth's lap, Geneviève in front of her, and then Marie seated on the ground. Béatrix is at the middle right, sitting next to Marie-Thérèse, and Fides is in her pram. Henri himself is seated on the ground at the far left. His brother, Richard, had also died before the photo was taken. The remaining four people (standing either side of Amédée) are other relatives.

Nevertheless, this large family did manage to survive the effects of inbreeding, unlike Henri's own family. At least seven of the children survived to have children of their own (~19 grand-children):

Person
Raoul
Gabriel
Odon
Emmanuel
Germaine
Marie
Alexis

Spouse
Elisabeth DAUDÉ de LAVALETTE (1870-1956)
Anne de TOULOUSE-LAUTREC (1873-1944)
Marguerite TAILLEFER de LAPORTALIÈRE (1878-1958)
Marie-Thérèse des CORDES
Alexandre d'ANSELME (1876-1912)
Adrien de RODAT d'OLEMPS (1806-1884)
Anne Marie de MALVIN de MONTAZET (1885-1974)

4 children
3 children
1 child
2 children
2 children
3 children
4 children

Note that Gabriel and Anne were third cousins, since they had great-grand-fathers who were brothers; nevertheless, they had 3 female children, at least one of whom also had 3 children. One of Alexis' sons (ie. Henri's second cousin once removed) was well-known art critic Michel Tapié de Céleyran (1909-1987), who married and had seven children, two of whom died in infancy.

Inbreeding increases the probability that recessive alleles will be expressed, but it does not make this inevitable. In Henri's case, two disabled children in succession seems to have dissuaded his parents, and they separated, whereas his aunt and uncle had a healthy child the second time, and so they continued producing a family. However, these days it is not recommended that you marry any of your first cousins.

Conclusion

Evolution is about biodiversity at all hierarchical levels, not just between or within species, but within individuals as well. Average intra-individual genetic diversity reaches a maximum when the ancestry is tree-like, and reduces with each instance of inbreeding, which turns the tree into a network of increasingly greater complexity.

I have discussed an even more extreme example of consanguinity in a previous post (Family trees, pedigrees and hybridization networks), in which the inbreeding became so severe that the royal family lineage actually came to an end.

References

Albury WR, Weisz GM (2013) Toulouse-Lautrec and medicine: a triumph over infirmity. Hektoen International 5: 3.

Dupic S. (2012) Toulouse-Lautrec - Généalogie 87 le site de référence de la généalogie de la haute-vienne.

Leigh FW (2013) Henri Marie Raymond de Toulouse-Lautrec-Montfa (1864-1901): artistic genius and medical curiosity. Journal of Medical Biography 21: 19-25.

Rosenhek J (2009) Picture imperfect: tiny Henri de Toulouse-Lautrec’s talent – and troubles – were larger than life. Doctor's Review Oct 2009.

Wednesday, November 20, 2013

Bioinformaticians look at bioinformatics

Bioinformatics as a term dates back to the 1970s, usually credited to Paulien Hogeweg, of the Bioinformatics group at Utrecht University, in The Netherlands, although it apparently did not make it into print until 1988 (Paulien Hogeweg. 1988. MIRROR beyond MIRROR, puddles of Life. In: Artificial Life, C. Langton, ed. Addison Wesley, pp. 297-315.).

In the 1990s the field expanded rapidly and became recognized as a discipline of its own, as a subset of computational science. However, Christos A. Ouzounis (2012. Rise and demise of bioinformatics? Promise and progress. PLoS Computational Biology 8: e1002487) has noted a distinct decrease in the use of the term itself, as shown by this graph.

Ouzounis recognizes three (admittedly artificial) periods in the history: Infancy (1996-2001), Adolescence (2002-2006) and Adulthood (2007-2011). Along the way, the practice of bioinformatics has received a lot of criticism. I have noted some of this before, in previous blog posts:
Poor bioinformatics?
Archiving of bioinformatics software

What is perhaps most important is that much of this criticism comes from bioinformaticians themselves, rather than from biologists. Moreover, this criticism does not seem to have had much effect on how bioinformatics is practiced, given the length of time over which it has been made.

For example, Carole Goble (2007. The seven deadly sins of bioinformatics. Keynote talk at the Bioinformatics Open Source Conference Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007) produced this list of what she called "intractable problems in bioinformatics":

1. Parochialism and insularity.
2. Exceptionalism.
3. Autonomy or death!
4. Vanity: pride and narcissism.
5. Monolith megalomania.
6. Scientific method sloth.
7. Instant gratification.

More recently, Manuel Corpas, Segun Fatumo & Reinhard Schneider (2012. How not to be a bioinformatician. Source Code for Biology and Medicine 7: 3) pointed out what they call "a series of disastrous practices in the bioinformatics field", which look very similar:

1. Stay low level at every level.
2. Be open source without being open.
3. Make tools that make no sense to biologists.
4. Do not provide a graphical user interface: command line is always more effective.
5. Make sure the output of your application is unreadable, unparseable and does not comply to any known standards.
6. Be unreachable and isolated.
7. Never maintain your databases, web services or any information that you may provide at any time.
8. Blindly believe in the predictions given, P-values or statistics.
9. Do not ever share your results and do not reuse.
10. Make your algorithm or analysis method irreproducible.

You can peruse the originals to check out the details of these problems, and whether they sound uncomfortably familiar.

Monday, November 18, 2013

Language history and language weirdness

Native speakers of any language will judge the "difficulty" of another language by how much it differs from their own. For example, the Foreign Service Institute (FSI) of the U.S. Department of State lists five categories of increasing time taken for native English speakers to acquire "General Professional Proficiency" in other languages. This refers to an average, of course, and anyone may personally find one language or another more easy or difficult than others.

FSI Category I (the least time needed) includes most of the Germanic and Romance languages, since English was originally a Germanic language that received a huge Romance input after the Normans turned up in Britain in 1066. The exception is German itself, which is alone in Category II (needing longer), because of its more complex grammar. Category V (the longest time needed for proficiency) consists of Arabic, Cantonese, Japanese, Korean and Mandarin, with Japanese being considered the most difficult.

Most languages are in Category IV, including the rest of the Indo-European languages. The recognizably tougher ones in that group are the Uralic languages (Estonian, Finnish and Hungarian), because of their countless noun cases. Interestingly, Category III (easier than IV) consists of Indonesian, Malaysian and Swahili, which have no known historical connection to English — they just happen to have fewer linguistic differences than do the other languages.

And that is the point of this post — linguistic similarities don't necessarily reflect the evolutionary history of the languages. There are trees allegedly showing the genealogy of languages, because there is vertical transfer of information in the history of languages (generation to generation), but horizontal transfer has also been a powerful evolutionary force, as cultures come in contact with each other. The history of English, as noted above, shows both vertical (Germanic) and horizontal (Romance) influences. Language history is a reticulating network, not an evolutionary tree.

Just as importantly, though, languages can have coincidental similarities. There are, after all, not that many different ways of constructing a language, and there are reported to be ~6,900 distinct languages on this planet. So, chance similarities must abound — what in biology we would call parallelisms and convergences. This makes constructing the evolutionary history of languages difficult.

The complexity created by coincidences has lead some people to wonder about how "unusual" any one language might be. This can be defined as how many of its characteristics occur commonly in other languages, and how many of them occur more rarely. The most unusual languages will be those that have lots of the rare features; and we might call them linguistic outliers. The Idibon blog has already had a look at this topic (The weirdest languages), and here I reconsider their data in the light of a phylogenetic network.

The data

The original data come from the World Atlas of Language Structures, which describes itself as "a large database of structural (phonological, grammatical, lexical) properties of languages gathered by a team of 55 authors". There are apparently 2,676 different languages in the database, coded for 192 linguistic features. Sadly, the database is very sparse, so that most languages have not yet been coded for most of the features (there are 5–1,519 languages coded for each feature).

So, the Idibon people selected a subset of the data: 1,693 languages and 21 features. These features were chosen to be an uncorrelated subset of those 165 features that have at least 100 languages coded; and the selected languages each have at least 10 features coded.

The features are certainly an eclectic collection, which you can read about on the WALS site:

83A:
87A:
143A:
143G:
69A:
116A:
57A:
101A:
6A:
71A:
129A:
130A:
44A:
14A:
9A:
72A:
111A:
64A:
124A:
117A:
19A:

Order of Object and Verb
Order of Adjective and Noun
Order of Negative Morpheme and Verb
Minor Morphological Means of Signaling Negation
Position of Tense-Aspect Affixes
Polar Questions
Position of Pronominal Possessive Affixes
Expression of Pronominal Subjects
Uvular Consonants
The Prohibitive
Hand and Arm
Finger and Hand
Gender Distinctions in Independent Personal Pronouns
Fixed Stress Locations
The Velar Nasal
Imperative-Hortative Systems
Nonperiphrastic Causative Constructions
Nominal and Verbal Conjunction
'Want' Complement Subjects
Predicative Possession
Presence of Uncommon Consonants

From the subset of languages, I chose all of those languages with at least 12 of these features coded, plus Icelandic (10 features), and Cornish and Gaelic(Scots) (11 features).

I then tried to fill in some of the missing data, to get as many languages as easily possible up to having 14 features coded (ie. two-thirds of the features). For the phonology features (6A, 9A, 19A), the relevant information can be looked up on the web, particularly in Wikipedia and the Native American Language Net. For the word features (129A, 130A), I used the LEXILOGOS Online Translation.

In the process, I found that Idibon has at least one feature mis-coded compared to the WALS web site: for feature 14A, some of the languages that should be coded "Second " have been coded as "Antepenultimate", and all of the others that should be coded "Second" have missing data.

I also found a few contradictions between the WALS coding and the information elsewhere on the web. In some of these cases I re-coded the WALS data.

My final spreadsheet is available online. There are 280 languages coded for at least 14 of the 21 features, compared to 239 such languages in the Idibon analysis. There are 19% of the data still missing, varying from 0–53% across the 21 features.

The network

My network is intended as an exploratory data analysis, rather than some attempt at an evolutionary diagram. Thus, the network simply displays the apparent similarity among the languages. That is, languages that are closely connected in the network are similar to each other based on their linguistic features, and those that are further apart are progressively more different from each other.

First, I recoded the multivariate linguistic data as 59 binary characters. Then the similarity among the 280 languages was calculated for each pair of languages using the Gower similarity index, which can accommodate missing data (by ignoring features that are missing for each pairwise comparison). A Neighbor-net analysis was then used to display the between-language similarities as a phylogenetic network.

The network is not very tree-like, is it? A few tentative groups can be recognized, as indicated by my colouring, but that is all. These groups do not correspond to any known language groups, meaning that the language features chosen do not reveal a traditional tree-like genealogy. Whether this reflects horizontal transfer of linguistic features, coincidence, or simply inadequate data, is not necessarily clear.

However, it seems most likely that much of the complexity represents coincidence. In the study of language evolution, parallelism and convergence are not nuisances, which is the way they are treated when constructing phylogenies of organisms. Coincidental similarities are a fundamental part of language history, but they are not necessarily the product of processes like natural selection, as they often are in biology.

If we look at some of the details, the nature of the complexity becomes clearer, as shown in the next figure. Here, I have colour-coded the Indo-European family of languages by their so-called "genus", plus the other languages that occur in Europe (the Uralic group, and Basque):

Albanian - pale brown
Armenian - dark brown
Baltic - orange

Celtic - pale blue
Germanic - black
Greek - pale green

Indic - pink
Iranian - blue
Romance - purple

Slavic - green
Uralic - red
Basque - grey

Note that the seven Germanic languages are clustered in a single location, as are the two Baltic languages. The others appear in either two (Celtic, Romance, Iranian) or four (Indic, Slavic, Uralic) locations. This implies considerable linguistic variation within most of what are considered to be closely related languages (that is why they are called language genera). A larger collection of features might change the pattern, of course, but I still reckon that there is a large component of non-vertical transmission here. This is either coincidence or horizontal transmission. For the Indo-European languages, the latter is perhaps quite likely; but it is equally likely that it is simply coincidence, even at this relatively fine scale.

The weirdest languages

The Idibon blog tried to reduce the multivariate data down to a single number for each language (scaled 0–1), representing its "weirdness" in terms of how many uncommon features it has. So, I have performed the same calculation for my expanded dataset.

The complete list is in the spreadsheet, but here are the top and bottom most-unusual languages:

Top 20
Mixtec (Chalcatongo)
Seri
Nenets
Diegueño (Mesa Grande)
Oromo (Harar)
Choctaw
Kutenai
Iraqw
Danish
Kongo
Norwegian
Dutch
Swedish
German
Armenian (Eastern)
Abkhaz
Mumuye
Ju|'hoan
Khoekhoe
Ladakhi

0.9725
0.9354
0.9346
0.9196
0.9187
0.9138
0.9079
0.9005
0.8843
0.8830
0.8751
0.8705
0.8585
0.8581
0.8473
0.8445
0.8410
0.8346
0.8300
0.8247

     Bottom 20
     Kanuri
     Kunama
     Kiowa
     Marathi
     Khanty
     Turkish
     Bulgarian
     Wichita
     Manam
     Kewa
     Sentani
     Bororo
     Usan
     Cantonese
     Hungarian
     Chamorro
     Ainu
     Cherokee
     Purépecha
     Hindi

0.2410
0.2401
0.2361
0.2752
0.2149
0.2145
0.2112
0.2054
0.2085
0.1984
0.1952
0.1534
0.1508
0.1435
0.1316
0.1285
0.1277
0.1232
0.0997
0.0872

My results differ from those of the Idibon blog for two reasons: more languages, and more data for some of the languages. Some of my added languages make it to the top of the weirdness list, including Seri, Danish and Swedish; and some of the other languages considerably change their score — for example, Hebrew, Welsh, Portuguese and Chechen are now near top of the list, and Quechua, Basque, Saami and Cornish are no longer near bottom. All of the big changes are increases in weirdness, suggesting that the missing data are important for this calculation.

Nevertheless, it is worth noting that five of the seven Germanic languages are in the top 15 (plus English is at 40 and Icelandic 47). Unusually, most of the Germanic languages still use cases (modifications to words that show how they relate to other words in a sentence). This means that you have to memorize a lot of different versions of each noun, just as you do in Latin. Moreover, these languages change the word order when asking a question as opposed to making a statement, whereas most languages add a participle instead. (In the most unusual language, Mixtec, a native language from Mexico, there is apparently no difference between a question and statement!)

English has a lower score than other Germanic languages presumably because of the French influence mentioned above (French is ranked 42). For example, in English there are now very few cases (only for some pronouns), as in the other Germanic languages, but instead it uses a fairly strict word order to express grammatical relationships. (You will note that two of the English-speaking authors of this blog now live in countries with other Germanic languages, and so we know just how big a pain it is to learn illogical case endings.)

English does have one really odd feature, though, which is the use of the sound "th" (which is part of feature 19A). There are two forms of this sound, voiced (as in "the") and unvoiced (as in "thing"). These sounds do not exist in most languages, and they are rare even among the other Indo-European languages. That is why you often hear non-native speakers say "dis" and "zis" instead of "this" — "th" is a sound that they have no experience making.

Actually, the Indo-European languages are very diverse in their weirdness. Many of them are at the top of the list, but there are also some at the bottom, including Hindi which is dead last. Notably, three of the Romance languages are at the top (Spanish, Portuguese, French) and two are at the bottom (Romanian, Italian). This seems unlikely, given the overall similarity of Spanish and Italian, for example; and so it probably reflects the specific choice of linguistic features.

The data are also potentially sensitive to some of the feature coding. One notable example is for feature 19A in Arabic. WALS codes Arabic as having pharyngeals but not "th", while Wikipedia says that the pharyngeals are doubtful, but that Arabic has "th". So, the possble codings of Arabic, and their resulting weirdness, are:

Feature
"Th" sounds only
Pharyngeals only
Pharyngeals and "th"

Score
0.0893
0.0469
0.0045

Weirdness
  0.6788
  0.7416
  0.9245

So, this feature alone can potentially change Arabic from "normal" to "very weird", depending on how it is coded.

Conclusion

Languages do not have a tree-like evolutionary history. Even the relatively small dataset presented here seems to show the influence of horizontal evolution. But, more importantly, we should not underestimate the coincidental occurrence of language features (parallelism and convergence). These have usually been treated as a nuisance in phylogenetic studies of organisms, but they are likely to be important for the study of languages. I have discussed this further in a previous post (False analogies between anthropology and biology).

Wednesday, November 13, 2013

Monophyletic groups in networks

I have noted before that taxonomic groups that are represented in any tree-like parts of a phylogeny can be considered to be monophyletic, but those that consist of hybrids cannot, unless we hypothesize a single hybrid origin for each group (How should we treat hybrids in a taxonomic scheme?). This issue arises from the concept that monophyletic groups must share an exclusive Most Recent Common Ancestor (MRCA), and this concept is not straightforward for a network compared to a tree.

This topic has been tackled mathematically a couple of times (see Huson and Rupp 2008; Fischer and Huson 2010), resulting in the recognition that for a network there are three main types of MRCA: conservative MRCA (or stable MRCA), Lowest Common Ancestor (or minimal common ancestor), and Fuzzy MRCA (see Networks and most recent common ancestors). These have definitions based on the Least Lower Bound and Greatest Lower Bound of mathematical lattices.

Unfortunately, there has been very little discussion of the topic in the biological literature. However, recently Wheeler (2013) has made a start. There is no reference to the mathematical work on MRCAs, but he considers what to do about the concepts of monophyly, paraphyly and polyphyly with respect to networks.

Basically, he suggests three new types of phyletic group: periphyletic, epiphyletic, and anaphyletic. He provides algorithmic definitions of these groups, relating them to the previous algorithmic definitions of monophyly, paraphyly and polyphyly. These new types concern groups that are monophyletic on a tree, but have additional gains or losses of members from network edges — that is, they lie somewhere between monophyletic and paraphyletic.

For example, an epiphyletic group would be one that is otherwise monophyletic but also contains one or more hybrids that have one of their parents from outside the group, while a periphyletic group would be monophyletic but has contributed as a parent to at least one hybrid outside the group. An anaphyletic group would have done both of these things. For clarification, Wheeler provides the following empirical example, based on Indo-European languages (where English is recognized as a "hybrid" of Germanic and Romance languages).

Reproduced from Wheeler (2013).

In terms of MRCA, it seems to me that all three new group types use the Lowest Common Ancestor model, which is the shared ancestor that is furthest from the root along any path (ie. the LCA is not an ancestor of any other common ancestor of the taxa concerned). However, this is only clear when we consider hybrids, in which the two (or more) parents contribute equally to the hybrid offspring. When dealing with introgression or horizontal gene transfer, where the parentage is unequal, then we approach the Fuzzy MRCA model, in which only a specified proportion of the paths (representing some proportion of the genomes) needs to be accommodated by the MRCA, thus keeping the MRCA close to the main collection of descendants.

What is not yet clear is whether we would want to recognize any of these new group types in a taxonomic scheme. I guess that this is something that the PhyloCode will have to think about, since it is based strictly on clades (although they are allowed to overlap).

References

Fischer J, Huson DH (2010) New common ancestor problems in trees and directed acyclic graphs. Information Processing Letters 110: 331–335.

Huson DH, Rupp R (2008) Summarizing multiple gene trees using cluster networks. Lecture Notes in Bioinformatics 5251: 296–305.

~~Wheeler WC (2013) Phyletic groups on networks. Cladistics (online early).~~
Wheeler WC (2014) Phyletic groups on networks. Cladistics 40: 447-451.

Monday, November 11, 2013

Presenting complex splits graphs

One common problem when presenting results using a data-display network is the complexity of the relationships among the samples, especially when there is a large number of them. It is often the case that the relationships among closely related samples are impossible to see clearly.

A recent paper (El Baidouri F, Diancourt L, Berry V, Chevenet F, Pratlong F, Marty P, Ravel C (2013) Genetic structure and evolution of the Leishmania genus in Africa and Eurasia: what does MLSA tell us. PLoS Neglected Tropical Diseases 7: e2255) presents an interesting solution to this problem. Basically, the idea is to present a series of graphs, with the main graph showing the overall relationships and a collection of small graphs showing the details of different parts of the network.

This takes longer, of course, as it involves doing a series of analyses, one for each subset of the data, but this is easy enough to do in programs like SplitsTree. It seems to be an idea worth considering.

Tuesday, November 5, 2013

Using constraints to get a handle on the space of phylogenetic networks?

The following two problems will be familiar to researchers working on evolutionary phylogenetic networks.

1) The severe computational intractability associated with globally optimizing most objective functions over the space of phylogenetic networks.

2) The fact that within the space of potential solutions, there are typically very many that an end-user biologist will want to exclude from consideration, for context-specific biological reasons that the software does not know about. This hidden information often only becomes available at the end of the analysis. It is not unusual to receive comments such as: "Thanks for the networks but they can't be good, because experimentalists strongly believe that taxon X is a hybrid of taxa Y and Z, and we also think that taxon group C should be monophyletic ... this is not visible in your networks."

In a recent opinion piece added to the Arxiv ("Fighting network space: it is time for an SQL-type language to filter phylogenetic networks") myself, David and Simone Linz pose the question of whether it might be possible to address both these questions at the same time, using constraint-based modelling. The core of the idea is that, via some kind of comparatively easy-to-use modelling language (e.g. something with an SQL flavour), the end-user biologist should be able to specify characteristics that all candidate solutions must (or must not) have.

The win-win scenario would be that this (a) tempers the intractability of the search problem, by cutting out large swathes of irrelevant networks in the vast search space and (b) invites biologists to incorporate their context-specific knowledge "upstream", reducing the risk that the networks generated by the software are mis-interpreted. In the context of phylogenetic trees, the idea is not new: in 1986 Constantinescu and Sankoff showed that the use of a constrained tree indeed reduces the search space remarkably.

It seems a natural idea to do this for networks, but the question of course is how feasible all this is. Constraint-based pruning of intractable search spaces is seductive but technically challenging for all kinds of reasons. Depending on the constraints used it might help a lot or a little, it is certainly no silver bullet. We might nevertheless hope that in many cases end-user biologists have so much implicit knowledge that the search space is massively shrunk. The question of the modelling language is also tricky because we need to decide upon a set of network constraints that biologists want and need: the dominant topological feature of trees, the clade, is no longer sufficient to describe (or constrain) the topologically richer space of phylogenetic networks. Furthermore, the constraints themselves should not become a new source of intractability.

In the opinion piece we make a few basic suggestions for atomic network constraints and how they might be combined via an SQL-style language. This, of course, is only the starting point for what we hope will be a wider discussion.

We're very keen to hear your thoughts about this!

Monday, November 4, 2013

Gastropods on Monday

This week, for Monday we have a phylogenetic tree constructed using examples of the organisms whose relationships are being represented.

In 1867, Franz Martin Hilgendorf published the tree shown in the first figure, which is illustrated with pictures of the fossil snails being discussed. This may well have been the first time that this form of illustrated phylogeny was produced.

From Hilgendorf (1867).

In 2000, Hilgendorf's papers and original fossil materials were re-discovered in the Palaeontological Collections of the Natural History Museum, Berlin, to which they had been donated by Hilgendorf's heirs. Among these was a series of cards to which snails had been glued, illustrating the morphological transitions within and between the taxa, as described in the original paper. One of these cards illustrates the phylogeny, as shown in the next figure.

From Glaubrecht (2012).

This is not the only example of this type produced by Hilgendorf. Another one had previously been discovered in the State Museum of Natural History, Stuttgart, as shown in the final figure. This copy was apparently produced much later.

From Rasser (2006).

References

Glaubrecht M. (2012) Franz Hilgendorf's dissertation "Beiträge zur Kenntnis des Süßwasserkalks von Steinheim" from 1863: transcription and description of the first Darwinian interpretation of transmutation. Zoosystematics & Evolution 88: 231-259.

Hilgendorf F. (1867) Über Planorbis multiformis im Steinheimer Süsswasserkalk. Monatsberichte der Königliche Preussischen Akademie der Wissenschaften zu Berlin 1866: 474-504.

Rasser M.W. (2006) 140 Jahre Steinheimer Schnecken-Stammbaum: der älteste fossile Stammbaum aus heutiger Sicht. Geologica et Palaeontologica 40: 195-199.

Pages