Showing posts with label Genealogy. Show all posts
Showing posts with label Genealogy. Show all posts

Wednesday, November 19, 2014

How confusing were the first written genealogies?


In a previous post I introduced the Great Stemma as the earliest known pedigree, being a genealogical view of biblical history (The first infographic was a genealogy). In it I noted that people were enclosed in circles, which were connected by lines showing relationships, much as we still do today. However, the lines combined marriage, parent-offspring and brotherly relationships without distinction. So, while it is a good first attempt, the Great Stemma leaves room for informational confusion, and this was not corrected at any time during its centuries of being copied. (In fact, confusion was increased through embellishments, deletions and modifications; but that is another story.)

To illustrate the potential problem of interpreting this early type of genealogy, I have included here a specific example.


The above excerpt from the Stemma shows the the children of Jacob by his wife Leah (who is shown at the top centre), and their subsequent children (ie. Leah's grandchildren). I have annotated the diagram to show parent-offspring (P), brother (B) and half-brother (HB) relationships. Note that all relationships are between males unless specified otherwise (so, half-brothers have the same father).

Leah is at the top [generation 1], with her six sons in a row below her (in birth order left to right), and her daughter to the side [generation 2]. Below this is the first-born son of each of the sons [generation 3], followed in columns down the page by their later sons, in birth order. Sons by later relationships are shown as half-brothers. At the bottom are two of Leah's great-grandchildren [generation 4].

Thus, the genealogical diagram does not effectively separate the generations visually, and parental and fraternal relationships are depicted in the same way. These days we solve this, of course, by keeping each generation as a single row and linking each child directly to the parent. It is easy to get used to the Stemma way of doing it, because it is fairly consistent about the arrangement. If there is confusion, then each circle does specify the relationship in words.

So, as I noted, this is a good first attempt, but some of the things that we now feel need distinguishing were not distinguished by the (unknown) original author.

However, the 24 extant copies of the Stemma are not identical, and two of them try to fit more information into Leah's family tree than is shown above. This information concerns the origin of the fourth generation, which is accurately depicted as far as it goes, but the above figure leaves out a lot. Some of the extra information is shown in the Stemma version below, which adds two extra people, both of them wives. I have annotated this version the same as the previous one, except that this pedigree adds one more relationship to the mix — marriage (M).


The extra details come from Genesis 38, which describes a set of relationships that would make a modern television soap-opera scriptwriter jealous. The story goes something like this (I have indicated the named people with letters in the diagram above, with Leah as L):
Judah (J) marries the [unnamed] daughter (W) of Shua. Judah and his wife have three children, Er (E), Onan (O), and Shelah (S). Er marries Tamar (T), but God kills him because he "was wicked in the sight of the Lord" (Gen. 38:7). Tamar becomes Onan's wife in accordance with the custom of the time, but he too is killed by God after he refuses to father children for his older brother's childless widow, and "spills his seed on the ground" instead (Gen. 38:8-10). Although Tamar should marry Shelah, the remaining brother, Judah does not consent, for fear of his son's life (Gen. 38:11). In response, after Judah's wife has died, Tamar deceives Judah into having intercourse with her, by pretending to be a prostitute (Gen. 38:12-23). When Judah discovers that Tamar is pregnant he prepares to have her killed, but recants and confesses when he finds out that he is the father (Gen. 38:24-26). The result is twin boys, Zerah (Z) and Perez (P) (Gen. 38:27), who are accepted as Judah's sons.
Biblically, this story is important because Judah became the founder of the Tribe of Judah, one of the twelve Tribes of Israel. Their land encompassed most of the southern portion of the Land of Israel, including Jerusalem. Both the Book of Ruth and the Gospel of Matthew identify Tamar's son Perez as an ancestor of King David, which makes Judah and Tamar also ancestors of Jesus.

For our purposes here, though, the interesting thing is the confusion caused by trying to add the two marriage relationships to the pedigree. These are in no way distinguished visually from the paternal and fraternal relationships, although the circled text does specify the relationship in words. Today, we solve this potential confusion by using horizontal lines for marriage relationships and vertical lines for parent-offspring relationships.

Equally importantly, note that Tamar's (legal) relationship supplants the (biological) parent-offspring relationship between Judah and her sons — you would never conclude from the diagram that Perez was Judah's son, for example, rather than Er's. However, note the neat attempt to keep Tamar's children in a single column by putting one twin above her and one below (perhaps also signifying simultaneous birth).

The above part of this post was inspired by a blog post from Jean-Baptiste Piggin (The Tamar Storyboard). The first picture above is from an unnamed manuscript in the Biblioteca Medicea Laurenziana, Florence, Plut.20.54, dated c. 1050 AD. The second picture if from an unnamed manuscript in the Pierpont Morgan Library, New York, M.644, dated 940-945 AD.

Moving on, the scribes of that time tried to go even further in complicating simple genealogies, as shown in the next figure. This is drawn by Stephanus Garsia Placidus, and is taken from the Saint-Sever Beatus in the Bibliothèque Nationale de France, Paris, ms. lat 8878, dated c. 1060 AD.


It shows the non-Semitic (ie. polytheistic) part of Noah's family. Noah is at the top right (sacrificing two doves), with his son Japheth (J) to the left and son Ham (H) below. Their wives (W) are indicated by intersecting circles, rather than by lines, which is a more successful approach than in the Stemma. Their descendants are shown in roughly the same style as above, with the first-born son followed by the later ones in order (so that the P and B relationships are not clearly distinguished) — Japheth has seven sons and Ham has four.

However, the illustrator has also tried to include a lot of history in this genealogy. For example, the sons of Ham's son Cush end with Nimrod (N), who has a small essay attached to his name. Among other things, he founded Babel, the city that plays an important role later in the Bible. Moreover, the sons of Ham's son Canaan (C) are shown as a reticulating network rather than as a simple chain. This apparently represents their roles as founders of the 11 tribes who originally occupied the ancient Land of Canaan, and who were later driven out and enslaved by the Israelites. These lines thus represent later history rather than parental or fraternal relationships.

This diagram is thus not a simple pedigree, as we would usually leave it today.

Monday, November 17, 2014

The first infographic was a genealogy (c. 400 AD)


The New Testament was originally written in Greek, and it apparently did not occur to the writers that a visualization of the many (and lengthy) Biblical genealogies would be helpful. They knew a lot about geometry but nothing about infographics.

Given the importance of the Old and New Testament genealogies for the foundation of Christianity (see The role of biblical genealogies in phylogenetics), it is not at all surprising that eventually someone had a go at summarizing them all in one place. However, this did not happen until several centuries later, when the Bible was being translated into Latin. Perhaps this delay had something to do with the biblical prohibition on images.

The first known attempt to draw a biblical pedigree, rather than writing out the relationships as text, also appears to have been the first attempt at a genealogy of any sort. Jean-Baptiste Piggin has been researching this document since 2009, and he has remarkably extensive notes about it at his web site Macro-Typography. Piggin dates the document to sometime in the decades before 427 AD, which is surprisingly early and thus unique in its historical context (Late Antiquity).

Importantly, the pedigree is actually an infographic in the modern sense, in that the figure itself conveys almost all of the information, with the text acting as a supplement. Thus, a single image allows the viewer to grasp the overview (of biblical history in this case), as well as providing access to the details. This is an idea that did not really catch on until the Medieval period, when Latin manuscripts started to use images as pedagogic devices, in addition to their textual descriptions. An obvious example is the so-called Tree of Porphyry in logic, which was first described in words by Porphyry of Tyre in c. 270 AD (Isagoge), sketched by Boëthius c. 520 AD (In Porphyrium Commentariorum), and finally reproduced as an actual tree diagram in Medieval manuscripts (being named arbor Porphyrii by Petrus Hispanus in 1240, in Summulae Logicales).


Sadly, there is no extant copy of this early biblical pedigree, and so we do not know who produced it or exactly when; nor do we have any of the copies made during the following 500 years. We do, however, have 24 complete or partial copies from the period 950-1250, many of them incorporated into Spanish editions of the Bible. Piggin has studied these copies extensively, and tried to reconstruct what he thinks the original document most probably looked like.

Piggin reconstructs the document (shown above), which he calls the Great Stemma, as a single scroll made from papyrus, designed to be unrolled and read from the upper left towards the middle right. All extant copies, however, break the figure up into sections, for inclusion as pages in a parchment manuscript (a codex) typical of the Medieval period.

Reconstruction was not an easy task, given the later modifications, digressions and embellishments, made with each successive hand-drawn copy. In particular, the process of reducing the long scroll to sequential pages apparently introduced many errors; and subsequent modifications degraded the logic of the original intention. Incidentally, embellishments do not improve the communication of information (see Mistaken improvements), and nor necessarily do modifications, since in this case they often created contradictions.

Above is a schematic overview of the reconstructed original scroll, but you can zoom in to all of the details by visiting Piggin's original reconstruction. Each circle represents one person (out of 540), with connecting lines showing their genealogical relationships — marriage, parent-offspring or brotherly (these are inter-mixed). Time is read left to right along the top (Adam is at the top-left), with vertical excursions downwards for lineages that do not lead to Jesus (who is at the middle-right). Note that the pedigree is drawn using nodes and lines, as we still do, but it is not drawn anything like a tree (ie. a "family tree"). Indeed, it is actually a network, since two ancestral lineages converge on Jesus (via Joseph and Mary), and elsewhere there are 13 simultaneous appearances of the same person in two places (to avoid complicated connections; see Reducing networks to trees).

The diagram also has a distinct timeline superimposed, shown as the elements without circles, which attempts to synchronize biblical events with contemporaneous secular history. So, Piggin notes that the Stemma it is "not just a genealogy, but a graphic version of the universal chronicles which attempted in antiquity to cross reference the histories of different civilizations to establish an overview of Middle Eastern and Graeco-Roman history." However, the timeline is not calibrated in any way (ie. time changes are not constant).

[Note: There is an update about the reconstruction in this blog post: The origin of an idea: reducing networks to trees]

Below, I have included pages from some of the extant manuscripts, to show their variety after more than 500 years of scribes making copies.


The above figure is the first page from the Roda Codex, in the Real Academia de la Historia (Madrid) cod.78 (dated 990 AD). This is the start of the genealogy, with Adam at the top-left, and illustrating his family.


The above figure is the third page from an unnamed manuscript in the Pierpont Morgan Library (New York) M.644 (dated 940-945 AD). This one shows Noah and his non-Semite descendants.


The above figure is the final page from an unnamed manuscript in the Plutei collection at the Biblioteca Medicea Laurenzian (Florence) Plut.20.54 (dated 1050 AD). This shows the incarnation of Jesus, at the end of the genealogy, illustrating the confluence of the lineages described by Matthew (at the top) and Luke (at the bottom).

Piggin notes that here may actually have been few early copies of the Stemma, because of the difficulty of transcribing illustrations by hand. That is, it is very difficult to accurately hand-copy a diagram, as opposed to copying text (where only the words matter not their visual style). Indeed, to what extent did the scribes actually understand that they needed a precise copy? Copying complex technical drawings requires careful measurement and layout, and yet some of the copies seem to have been very badly planned. Piggin suggests that "the serious corruption done to the Great Stemma early in its diffusion led to it ultimately being discarded and begun all over again by medieval writers such as Peter of Poitiers." The reference is to the Compendium Historiae in Genealogia Christi by Petrus Pictaviensis (Peter of Poitiers) produced in c.1185 AD, and for which there are many extant copies dated from that time to 1650 AD — he used long rolls for his genealogies.

Finally, Piggin even has a suggestion for a small ancient board game that might have provided inspiration for the form of the infographic (see Board Game). This is important, because there are no known prior models for constructing such a diagram — apart from geometry, no-one had previously produced an image that illustrated non-corporeal ideas.


Footnote: The word stemma referred originally to an ancient Roman genealogy (displayed in noble homes), which is roughly how it is used by Piggin. However, these days the word is more commonly used in anthropology to refer to a genealogy of manuscript copies. A genealogy of manuscripts is more properly called a stemma codicum.

Wednesday, October 22, 2014

Is phylogenomics tree-like?


Phylogenomics, the idea of applying genomic data to phylogenetic studies, has been around for quite a while now (Eisen 1998), although it was probably Rokas et al. (2003) who drew the first widespread attention among phylogeneticists. Molecular phylogenetics started off using the sequence of a single locus (often small-subunit rRNA) as the data, and slowly progressed from there to multiple loci. Currently, it is considered good practice to use half-a-dozen loci, sampling the main genomes (nucleus, mitochondrion, plastid); and genomics offers the possibility of a fast and cost-effective means of generating large amounts of multi-locus sequence data.


Review papers are beginning to appear based explicitly on next-generation sequencing (NGS), such as those of Lemmon & Lemmon (2013) and McCormack et al. (2013), replacing the earlier work of Philippe et al. (2005), and there are suggestions for how phylogenetics analyses might need to change in response to NGS data (Chan and Ragan 2013). These all treat phylogenomics as being very similar to traditional molecular phylogenetics, in the sense that many people are expecting phylogenomics to provide tree-like resolution of questions that remain unresolved with the current smaller datasets. In the words of Rokas et al. (2003), phylogenomics is intent on "resolving incongruence in molecular phylogenies". That is, incongruent gene trees are seen as the major obstacle to be overcome by phylogenetics data analysis (see also Jeffroy et al. 2006).

However, this might be a naive expectation. After all, the existing phylogenetic conflicts are there for a reason. If we cannot resolve certain parts of organismal history in terms of a phylogenetic tree when we use the current levels of multi-locus data (say <10 loci), then there is no real reason to think that this will happen just because we increase the number of loci. There are plenty of other reason for incongruence among genes, the most obvious one being that the history is not tree-like in the first place. The advantage of phylogenomics, then, would be its ability to clarify the phylogenetic history rather than to resolve incongruence.

There are now quite a few published empirical phylogenomic studies, which allows us to provide a preliminary answer to the question about whether phylogenomic patterns are tree-like or not. There are a few published studies where the authors claim resolution in terms of a tree, as least for part of their phylogeny (e.g. Wang et al. 2012), but it seems to me that there are far more studies where the incongruence remains even with genomic data. Below, I briefly introduce a few arbitrarily chosen examples.

So, complex genealogical problems often remain complex even after using genomic data. We haven’t "solved" any of the so-called genealogy problems, we have simply made clear in what way they are complex. That is, genomics data generally reveal reticulate evolutionary histories, not simple tree-like ones.

This leads me to conclude that phylogenomics is about reticulate evolution, and it is thus time for phylogeneticists to abandon trees as a model for genealogies. We have probably already resolved most of the simple tree-like genealogical patterns, using non-genomic data, and from here on we will be using genomics to study gene flow in addition to parental gene inheritance.

Examples

(1) Galtier and Daubin (2008) were among the earliest researchers to try to "deal with incongruence in phylogenomic analyses", and one of their examples was the long-standing problem deciphering the relationships among the closest relatives of humans. However, the genomics data make it clear that, while humans share slightly more genes with chimpanzees than with other great apes, we still share some with gorillas but not chimpanzees, and with orangutans but not either chimpanzees or gorillas. Also, chimpanzees share some genes with gorillas that we do not share. The situation is now clearer, but the tree incongruence remains.


(2) At the same time, Kuo et al. (2008) looked at the then-available genomes for members of the Apicomplexa, which are unicellular eukaryotic parasites. The genomic data confirmed the current groupings of Haemosporidians, Piroplasmids and Coccidians (shown as branches with high support in the diagram) but completely failed to resolve the relationships between these groups (shown as branches with low support). Things are no better today, when we have at least some data for 35 genomes.


(3) The relationships among mammal superorders, particularly the placentals, has been a ongoing area of debate. I have already covered this in some previous blog posts, notably Conflicting placental roots: network or tree? and Why are there conflicting placental roots? There are three possible ways of resolving a tree at the root of the placental phylogeny, and genomic datasets seem to support all three of them — the published different trees are therefore based on variation in the model used for data analysis. As Hallström and Janke (2010) have noted, there was probably incomplete lineage sorting and hybridization in the early placental mammalian divergences, rather than a truly tree-like history.

(4) Dell'Ampio et al. (2014) looked at the phylogenetic relationships of the wingless insects, and tried to come to grips with the incongruence among genes. They considered three main tree-based hypotheses for the relationships, and found that genomic support was pretty evenly spread among the three topologies. They dryly note that after their hard work the relationships "are still considered unresolved."


(5) Relationships among hominids have been a popular study for many years, and not unexpectedly there has been a burst as a result of genomic data, especially as there are now SNP micro-arrays available to simplify the data collection. I have covered this in previous posts, as well, notably Why do we still use trees for the Neandertal genealogy? The bottom line is that the genomic data provide evidence of extensive introgression (or admixture) between humans and their nearest relatives throughout their time of co-existence. This example is from Reich et al. (2011).


References

Chan CX, Ragan MA (2013) Next-generation phylogenomics. Biology Direct 8: 3.

Dell'Ampio E, Meusemann K, Szucsich NU, Peters RS, Meyer B, Borner J, Petersen M, Aberer AJ, Stamatakis A, Walzl MG, Minh BQ, von Haeseler A, Ebersberger I, Pass G, Misof B (2014) Decisive data sets in phylogenomics: lessons from studies on the phylogenetic relationships of primarily wingless insects. Molecular Biology and Evolution 31: 239-249.

Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research 8: 163-167.

Galtier N, Daubin V (2008) Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences 363: 4023-4029.

Hallström BM, Janke A (2010) Mammalian evolution may not be strictly bifurcating. Molecular Biology and Evolution 27: 2804-2816.

Jeffroy O, Brinkmann H, Delsuc F, Philippe H (2006) Phylogenomics: the beginning of incongruence? Trends in Genetics 22: 225-231.

Kuo C-H, Wares JP, Kissinger JC (2008) The Apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees. Molecular Biology and Evolution 25: 2689-2698.

Lemmon EM, Lemmon AR (2013) High-throughput genomic data in systematics and phylogenetics. Annual Review of Ecology, Evolution, and Systematics 44: 99-121.

McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT (2013) Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution 66: 526-538.

Philippe H, Delsuc F, Brinkmann H. Lartillot N (2005) Phylogenomics. Annual Review of Ecology, Evolution, and Systematics 36: 541-562.

Reich D, Patterson N, Kircher M, Delfin F, Nandineni MR, Pugach I, Ko AM, Ko Y-C, Jinam TA, Phipps ME, Saitou N, Wollstein A, Kayser M, Pääbo S, Stoneking M (2011) Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania. American Journal of Human Genetics 89: 516-528.

Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425: 798-804.

Wang N, Braun EL, Kimball RT (2012) Testing hypotheses about the sister group of the Passeriformes using an independent 30 locus dataset. Molecular Biology and Evolution 29: 737-750.

Wednesday, October 8, 2014

Thoroughbred horses and reticulate pedigrees


I noted recently that the best documented human genealogies are those for the various Anabaptist populations (including the Mennonites, Hutterites and Amish) (The importance of the Amish for reticulate genealogies). They have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. This makes them ideal for genealogical studies involving reticulation, including being a source of "known" reticulate histories for testing network algorithms.

If we move outside of Homo sapiens then a genealogy that is equally well documented (if not better) is that of English Thoroughbred horses. This breed was developed as a result of the enthusiasm of the British aristocracy for racing in the 17th century. Thoroughbred pedigree records are regarded as the most comprehensive records detailing ancestral relationships among domestic animal breeds, and they have been formally catalogued since the appearance of the first edition of the General Stud Book in 1791.


As noted by Binns et al. (2011):
The Thoroughbred horse breed was established in England in the early 1700s based on crosses between stallions of Arabian origin and indigenous mares. The founder population was small, with all current males tracing back to one of three stallions, the Godolphin Arabian, the Byerley Turk and the Darley Arabian; in contrast, on the female side, about 70 foundation mares have been identified. A stud book for Thoroughbred horses was initiated in 1791, and pedigree records for the breed, which now number about five hundred thousand horses, are maintained by Thoroughbred registries worldwide.
For the males, the story is continued by Bower et al. (2012):
All living Thoroughbreds trace paternally to just three stallions imported into England in the late 17th and early 18th centuries: Byerley Turk (1680s), Darley Arabian (1704) and Godolphin Arabian (1729). Furthermore, a small number of stallions exerted disproportionate influence on early Classic races resulting in their greater popularity at stud. Therefore, the Thoroughbred gene pool has been restricted by small foundation stock and subsequent limited paternal contributions as a result of sire preference and selection. [Our] historic samples were related largely via the Darley Arabian sire line to which 95% of all living Thoroughbreds can be traced in their paternal lineage.
Actually, 95% of living Thoroughbreds trace their male lineage to Eclipse (1764), a great-great grandson of the Darley Arabian, so that it is Eclipse who appears as the progenitor in most published genealogies (eg. see the one below). Information about these early males is available at this Thoroughbred Heritage page.

Females have been of less interest to horse breeders, and so in many cases we do not know who they were, and in many others we have only a generic name (eg. "Miss Darcy's pet mare", "old Montagu mare", "royal mare", etc). This means that in modern horses there is a high level of mtDNA diversity due to multiple female lineages but there is very little sequence diversity on the Y chromosome (Wallner et al. 2013). Nevertheless, Hill et al. (2002) have tried to trace the influence of the early females on current genotypes, singling out 19 of them as having large influence (on the mitochondrial genealogy), while Bower et al. (2011) provide a broader analysis. Information about these early females is available at this Thoroughbred Heritage page.

The relevance of this information for genealogy studies is that it tells us the Thoroughbred genealogy is effectively closed (little outside breeding), and it is thoroughly documented. This is potentially another source of known reticulate genealogies.

Of particular interest to horse breeders is inbreeding (see Binns et al. 2012). In suitable doses this is seen as a Good Thing, because it can produce the homozygous appearance of desirable racing characteristics. However, inbreeding should not be too recent. For example, if we look at the list of the Blood-Horse Top 100 Thoroughbreds of the 20th Century then none of them have inbreeding in the previous generation and only one has inbreeding in the one before that. However, 54% of the horses have inbreeding in the fourth ancestral generation, and 18% in each of the third and fifth generations. Only 9 horses had no inbreeding during the five previous generations.

For this reason, the standard version of horse genealogies only goes back five generations. This is the stage at which the inbreeding coefficient becomes <1% — inbreeding earlier than five generations has no practical effect on homozygosity. There are potentially 32 ancestors in the 5th generation, contributing 1/32=3% of the DNA on average. This inbreeding is of interest to us because it creates extensive reticulation in horse genealogies.

Pedigree data are readily available at sites like Pedigree Online. Pedigrees are usually drawn as treemaps (see the blog post Trees, treemaps and networks) with horses being repeated as often as necessary to be able to draw the network as a tree (see the blog post Reducing networks to trees). Here is a typical example, for the horse Maddox, without recent inbreeding. Males are in blue and females pink, with the parents at the left and their ancestors proceeding to the right.


Here is an example, for the horse Induna Mkubwa, with inbreeding in the 3rd+4th ancestral generations (highlighted in purple) and also in the 4th+5th generations (in green). Note that the horse Be My Chief is also inbred, in his 4th ancestral generation (in green).


Clearly, this second genealogy should more properly be drawn as a reticulating network. Once this sort of thing is done the reticulations become obvious. Here is an example network for the horse known as Roberto. The horses are numbered in the manner conventional for human pedigrees, with the males on the left of each pair. This is about as complex as it gets for these horses; and you will note that there are only two-thirds of the "expected" number of ancestors.


Finally, here is an example network from the paper by Bower et al. (2012), covering a longer time period but restricted to selected male horses (ie. the female lineages that lead to the reticulation are not named).


Thanks to Induna Mkubwa for the idea for this post.

References

Binns MM, Boehler DA, Bailey E, Lear TL, Cardwell JM, Lambert DH (2012) Inbreeding in the Thoroughbred horse. Animal Genetics 43: 340-342.

Bower MA, Campana MG, Whitten M, Edwards CJ, Jones H, Barrett E, Cassidy R, Nisbet RE, Hill EW, Howe CJ, Binns M. (2011) The cosmopolitan maternal heritage of the Thoroughbred racehorse breed shows a significant contribution from British and Irish native mares. Biology Letters 7: 316-320.

Bower MA, McGivney BA, Campana MG, Gu J, Andersson LS, Barrett E, Davis CR, Mikko S, Stock F, Voronkova V, Bradley DG, Fahey AG, Lindgren G, MacHugh DE, Sulimova G, Hill EW (2012) The genetic origin and history of speed in the Thoroughbred racehorse. Nature Communications 3: 643.

Hill EW, Bradley DG, Al-Barody M, Ertugrul O, Splan RK, Zakharov I, Cunningham EP (2002) History and integrity of thoroughbred dam lines revealed in equine mtDNA variation. Animal Genetics 33: 287-294.

Wallner B, Vogl C, Shukla P, Burgstaller JP, Druml T, Brem G (2013) Identification of genetic variation on the horse Y chromosome and the tracing of male founder lineages in modern breeds. PLoS One 8: e60015.

Wednesday, October 1, 2014

A fundamental limitation of pedigrees and networks but not trees


It would be nice to think that genealogical history can be reconstructed with ease. However, this is known not to be so. In particular, being able to reconstruct an overall history from a collection of sub-histories, which can thought of as the "building blocks", is not necessarily guaranteed.

That is, even given a complete collection of all of the sub-histories it is not necessarily possible to reconstruct a unique overall history. In other words, there can be pairs of graphs that do not represent the same evolutionary histories, but still display exactly the same collection of building blocks. ("Display" means roughly that a building block can be obtained by simply deleting some of the edges and vertices in the graph.) Mathematically, the sub-histories do not determine (or encode) the history.


For example, it is known that pedigrees cannot necessarily be reconstructed from a collection of all of the sub-pedigrees (Thatte 2008). Pedigrees are the traditional "family trees" showing the ancestry of individuals. Pedigrees differ from phylogenies in that all of the individuals have two parents (rather than possibly having a single immediate ancestor) and there are probably multiple roots (unless there is considerable inbreeding).

Phylogenetic trees, on the other hand can be uniquely reconstructed from a collection of all of the possible sub-trees (see Dress et al. 2012). This is one of the things that makes trees valuable as a phylogenetic model — it is theoretically possible to collect enough information to construct a unique phylogenetic tree.

Rooted phylogenetic networks do not, however, share this property. For some time it has been known that networks cannot necessarily be built from their building blocks, whether those blocks are rooted trees (Willson 2011) or triplets (= rooted 3-taxon trees) or clusters (= rooted sub-trees = clades) (Gambette and Huber 2012).

This is illustrated in the next figure (adapted from Huber et al.), which shows two networks at the top and below that the four trees that are displayed by both of them (by deleting one of each pair of incoming edges at the two reticulation nodes). Given these four trees we cannot reconstruct a unique network, and yet they are the only four trees associated with either network.


To make matters worse, Huber et al. (in press) have now revealed that we can't reconstruct rooted phylogenetic networks even from sub-networks. To do this they show that networks cannot necessarily be built from trinets (= rooted 3-taxon networks). Certain types of networks (e.g. level-1, level-2, tree-child) can be reconstructed (van Iersel and Moulton 2014), but Huber et al. show the example in the second figure, which shows two networks at the top and below that the four trinets that are displayed by both of them. Given these four trinets we cannot reconstruct a unique network, and yet they are the only four trinets associated with either network.


This means that "even if all of the building blocks for some reticulate evolutionary history were to be taken as the input for any given network building method, the method might still output an incorrect history." The best analogy here is Humpty Dumpty — even given all of the pieces, we literally might not be able to put him back together again. We could if he is a rooted tree, but we cannot guarantee it if he is a rooted network or pedigree.

This may not matter in practice, given that we don't yet know the circumstances under which it is possible to uniquely reconstruct networks, but it does mean that we acquire a certain degree of uncertainty as we move from "tree thinking" to "network thinking".

References

Dress A, Huber KT, Koolen J, Moulton V, Spillner A (2012) Basic Phylogenetic Combinatorics. Cambridge Uni Press.

Gambette P, Huber K (2012) On encodings of phylogenetic networks of bounded level. Journal of Mathematical Biology 65: 157-180.

Huber KT, van Iersel L, Moulton V, Wu T (in press) How much information is needed to infer reticulate evolutionary histories? Systematic Biology

van Iersel L, Moulton V (2014) Trinets encode tree-child and level-2 phylogenetic networks. Journal of Mathematical Biology 68: 1707-1729.

Thatte BD (2008) Combinatorics of pedigrees i: counterexamples to a reconstruction problem. SIAM Journal of Discrete Mathematics 22: 961-970.

Willson SJ (2011) Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 785-796.

Monday, September 29, 2014

Goofy genealogies


Family pedigrees seem to be confusing things, because there are two distinct interpretations of the expression "family tree".

First, the pedigree tree could be drawn with a particular contemporary person at the root of the tree, so that the tree expands backwards in time to increasing numbers of ancestors at the leaves (ie. an "ascent tree"). In some ways this seems quite illogical as an analogy, given that the base of a real tree is the origin of its growth.


Second, the pedigree tree could be drawn with a particular ancestor at the root of the tree, so that the tree expands forwards in time to increasing numbers of descendants at the leaves (ie. a "descent tree"). This is more logical, although we often draw the root at the top. (The following example is actually a network, rather than strictly a tree; see also Pedigrees and phylogenies are networks not trees.)


Pedigrees are generally somewhat different from phylogenies, but in phylogenetics we do choose the latter option for interpreting trees — we start with a collection of contemporary leaves and try to reconstruct the tree backwards towards the common ancestor. Thus the root is at the "base" of the tree, even when we draw the root at the top of the diagram.

In popular usage these distinctions are often blurred. Consider this "family tree" of the Disney character Goofy. It is taken from Gilles R. Maurice's Calisota web page, where the character names are listed clearly.


This is based on the first usage described above, since Goofy himself is at the base and his ancestors are at the leaves. This is actually closer to a lineage rather than a tree, especially as no females seem to be involved at any stage.

However, roughly the same information can be presented the other way around. This cartoon is taken from a different Calisota page.


Here, Goofy is now at the top of the tree and his ancestry proceeds downwards, with the oldest ancestor at the base (except for his son!). This really is confusing.

Monday, September 22, 2014

Reducing networks to trees


I have commented before about the perceived tendency to resist thinking about evolutionary relationships as networks (Resistance to network thinking), and even to present reticulating evolutionary relationships as trees rather than as networks (The dilemma of evolutionary networks and Darwinian trees). Charles Darwin seems to be the guilty party in starting this phenomenon.

This behavior becomes particularly obvious when we consider family genealogies. A good example appears when we consider the family relationships of the Olympian gods of Ancient Greece. Several illustrations of these relationships are gathered together on the Olympian Gods Family Tree web page.

Noteworthy is the particularly frisky nature of Zeus, who "got around a bit", to put it mildly. As shown in the first diagram, Zeus was the offspring of Cronus and Rhea. However, he then fathered children with at least nine people, including two of his own sisters, an aunt, a first cousin, and several first cousins once removed, among others. This creates the complex network shown.


However, not everyone wants to draw family genealogies as reticulating networks. After all, they are usually called "family trees". As shown by the examples below, the most common way to reduce a network to a tree is simply to repeat people's names as often as necessary. That is, rather than have them appear once (representing their birth) with multiple reticulating connections representing their reproductive relationships, they appear repeatedly, once for their birth and once for each relationship, so that there are no reticulations. I will leave it to you to count how often Zeus appears in each of these so-called family trees.






Clearly, this is misleading, and it makes no sense to obscure the fact that a so-called tree is actually a reticulate network. If relationships are reticulate then it is best to illustrate them that way, rather than to disguise the networks as trees.

Wednesday, September 10, 2014

The importance of the Amish for reticulate genealogies


I noted in my previous blog post (Charles Darwin and the coalescent) that the multispecies coalescent needs to be based on a network model not a tree model. This is because reticulation processes occur both within species and between species — there is gene flow within genealogies and within phylogenies.

Reticulate genealogies are nothing new, and I have blogged about some of the best-known human genealogies with reticulations due to consanguinity (marriage between close relatives):
King Charles II of Spain
Charles Darwin
Henri Toulouse-Lautrec
Albert Einstein
Pharaoh Tutankhamun
Pharaoh Cleopatra

Importantly, in the modern world there are quite a few genealogical datasets available for study. For example, the Kinsources repository has c. 100 datasets from around the world, covering multi-generational histories for nearly 350,000 individuals. These data are actively used for research (eg. Bailey et al. 2014).

However, the best documented human genealogies are those for the various Anabaptist populations, who moved from Europe to North America during the 18th and 19th centuries. Anabaptists have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. These populations include the Mennonites, Hutterites and Amish, the latter being the best known.

As noted by Agarwala et al. (2001):
The term "Anabaptist" literally means "rebaptizer" and is used to refer to a Christian movement that arose in central Europe in the first half of the 16th century. Adherents support adult baptism, pacifism, and separation of church and state. Among the large Anabaptist groups existing today are Mennonites (who were originally followers of Menno Simons), Amish (originally followers of Jakob Ammann who split away from the Mennonites at the end of the 17th century), and Hutterites (originally followers of Jakob Hutter). Amish and Mennonites emigrated to North America in multiple waves in the 18th and 19th centuries. The Hutterites began emigrating to the northern and western parts of North America in the late 1800s.
Distribution of Amish settlements in North America
Note the rapid expansion over the past 25 years.

The Mennonites originated in the Swiss Alps, and diffused northward into Germany and the Netherlands. The Dutch/North German Mennonites began the migration to America in the 1680s, followed by a much larger migration of Swiss/South German Mennonites beginning in 1707. The Amish are an early split from the Swiss/South German group that occurred in 1693. There are now at least 200,000 Amish in the eastern United States and eastern Canada (see the map above, taken from here), with the numbers apparently growing rapidly with recently increasing movement westward. There are various subgroups (eg. Old Order Amish, New Order Amish). There are about 1.7 million Mennonites worldwide, with c. 150,000 in the eastern United States and eastern Canada. The genealogies of 295,000 Mennonite and Amish individuals from the eastern USA have been databased (Agarwala et al. 2001).

The Hutterites originated as an Anabaptist offshoot in the Tyrolean Alps in the 1500s, but now there are c. 135,000 Hutterites living on 1,350 communal farms in the northern United States (principally South Dakota) and western Canada. Genealogical records trace all extant Hutterites to 90 ancestors who lived during the early 1700s to the early 1800s (see Ober et al. 1999).

These Anabaptist groups are frequently used in medical studies, because it is possible to relate disease occurrences to the recorded genealogy, and thus to assess the genetic component of the disease (eg. Dorsten et al. 1999, Hou et al. 2013). So, the literature is replete with figures showing the distribution of different diseases plotted onto the genealogy. I have included some of the Amish ones here, to illustrate the extreme reticulation that results when inbreeding is ongoing over many generations.

This first one is from Georgi et al. (2014). The diseased people are marked in red.


The next one is from Garner et al. (2001).


This one is from Lee et al. (2008).


The final one is from Racette et al. (2002).


Here is one small part of this genealogy, which emphasizes that between-generation marriages are an important component of the consanguinity.


References

Agarwala R, Schaffer A, Tomlin J (2001) Towards a complete North American Anabaptist genealogy II: analysis of inbreeding. Human Biology 73: 533-545.

Bailey DH, Hill KR, Walker RS (2014) Fitness consequences of spousal relatedness in 46 small-scale societies. Biology Letters 10: 20140160.

Dorsten L, Hotchkiss L, King T (1999) The effect of inbreeding on early childhood mortality: twelve generations of an Amish settlement. Demography 36: 263-271.

Garner C, McInnes LA, Service SK, Spesny M, Fournier E, Leon P, Freimer NB (2001) Linkage analysis of a complex pedigree with severe bipolar disorder, using a Markov chain Monte Carlo method. American Journal of Human Genetics 68: 1061-1064.

Georgi B, Craig D, Kember RL, Liu W, Lindquist I, Nasser S, Brown C, Egeland JA, Paul SM, Bućan M (2014) Genomic view of bipolar disorder revealed by whole genome sequencing in a genetic isolate. PLoS Genetics 10: e1004229.

Hou L, Faraci G, Chen DT, Kassem L, Schulze TG, Shugart YY, McMahon FJ (2013) Amish revisited: next-generation sequencing studies of psychiatric disorders among the Plain people. Trends in Genetics 29: 412-418.

Lee SL, Murdock DG, McCauley JL, Bradford Y, Crunk A, McFarland L, Jiang L, Wang T, Schnetz-Boutaud N, Haines JL (2008) A genome-wide scan in an Amish pedigree with parkinsonism. Annals of Human Genetics 72: 621-629.

Ober C, Hyslop T, Hauck WW (1999) Inbreeding effects on fertility in humans: evidence for reproductive compensation. American Journal of Human Genetics 64: 225–231.

Racette BA, Rundle M, Wang JC, Goate A, Saccone NL, Farrer M, Lincoln S, Hussey J, Smemo S, Lin J, Suarez B, Parsian A, Perlmutter JS (2002) A multi-incident, Old-Order Amish family with PD. Neurology2 58: 568-574.

Monday, September 8, 2014

Inbreeding creates the most complex networks


In an earlier blog post (The ultimate phylogenetic network?) I reproduced the lattice network from the anthropologist Franz Weidenreich. This comes close to being as complex as a network can get when applied to groups of organisms. However, when we study the genealogy of individuals, the network can get much more complex. This will be most true when there are marriages between close relatives (consanguinity), which creates inbreeding.

The family pedigree (or family tree!) shown here is for a group of people in a recently isolated population from the southwestern area of The Netherlands. There are 4,645 people involved, covering 18 generations (one row each). The average number of consanguineous loops for the 103 study individuals is 71.7, which is what is creating all of the cross-connections that make the network look so horrendous. (Consanguineous or inbreeding loops are illustrated here.)


The genealogy is from:
Liu F, Arias-Vásquez A, Sleegers K, Aulchenko YS, Kayser M, Sanchez-Juan P, Feng BJ, Bertoli-Avella AM, van Swieten J, Axenovich TI, Heutink P, van Broeckhoven C, Oostra BA, van Duijn CM (2007) A genomewide screen for late-onset Alzheimer disease in a genetically isolated Dutch population. American Journal of Human Genetics 81: 17-31.

Wednesday, September 3, 2014

Charles Darwin and the coalescent


The full title of Charles Darwin's most famous book was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. It is important to note that this title juxtaposes the concepts of between-species variation and within-species variation (Darwin usually referred to "races" rather than to "breeds", "subspecies", etc). This was one of his major insights: the idea that there is a continuum of variation in biology through time (or, as he put it, that it is arbitrary whether variants are treated as different races or as different species).

As I recently noted, this paved the way for between-species phylogenies to be seen as directly analogous to within-species genealogies (The role of biblical genealogies in phylogenetics) — previous applications of genealogies to non-humans (such as those of Buffon and Duchesne) had been explicitly restricted to within-sepcies relationships.

This conceptual integration of within-species and between-species relationships has become explicit in modern biology by using multispecies coalescent models to integrate population genetics and phylogenetics. As noted by Reid et al. (2014):
These models treat populations, rather than alleles sampled from a single individual, as the focal units in phylogenetic trees. The multispecies coalescent model connects traditional phylogenetic inference, which seeks primarily to infer patterns of divergence between species, and population genetic inference, which has typically focused on intraspecific evolutionary processes. The development of these models was motivated by the common empirical observation that genealogies estimated from different genes are often discordant and the discovery that, if ignored, this discordance can bias parameters of direct interest to systematists, such as the relationships and divergence times among species.
However, as specifically emphasized by Reid et al.:
In order to reconcile discordance among gene trees and uncover true species relationships, the first gene tree/species tree models assumed that discordance is solely the result of stochastic coalescence of gene lineages within a species phylogeny ... Coalescent stochasticity, however, is not the only source of gene tree discordance. Selection, hybridization, horizontal gene transfer, gene duplication/extinction, recombination, and phylogenetic estimation error can also result in discordance.
They examined this situation by studying the fit of the multispecies coalescent model:
to 25 published data sets. We show that poor model fit is detectable in the majority of data sets; that this poor fit can mislead phylogenetic estimation; and that in some cases it stems from processes of inherent interest to systematists ...
Our analyses suggest that poor fit to the multispecies coalescent model can mislead inference in empirical studies. In the case of recent hybridization, the consequences may be severe, as species divergences are forced to post-date gene divergences ... When topological conflict among coalescent genealogies is the result of ancient hybridization, balancing selection, or gene duplication and extinction, the consequences may be less severe.
In other words, tree-based phylogenetics is inadequate in practice because of gene flow. Within-species genealogies and between-species phylogenies intersect in the concept of a network, not a tree. That is, the multispecies coalescent needs to be based on a network model not a tree model:
The biological processes that generate variation in gene tree topologies should be explicitly modeled, as should relevant dynamics of molecular evolution. Increasingly complex multispecies coalescent models are being implemented, but there are tradeoffs. Some examine gene duplication and extinction or migration but cannot estimate divergence times.
So, current models are inadequate. It will be interesting to see how these approaches develop to incorporate gene flow (reticulation) into what has heretofore been a tree model (modeling only ancestor-descendant relationships), as we are still in need of methods for estimating rooted evolutionary networks.

Reference

Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD, Satler JD, Carstens BC (2014) Poor fit to the multispecies coalescent is widely detectable in empirical data. Systematic Biology 63: 322-333.

Wednesday, August 20, 2014

The role of biblical genealogies in phylogenetics


Phylogeneticists treat the tree image as having special meaning for themselves. Conceptually, the tree is used as a metaphor for phylogenetic relationships among taxa, and mathematically it is used as a model to analyze phenotypic and genotypic data to uncover those relationships. Irrespective of whether this metaphor / model is adequate or not, it has a long history as part of phylogenetics (Pietsch 2012). Of particular interest has been Charles Darwin's reference to the "Tree of Life" as a simile, since that is clearly the key to the understanding of phylogenetics by the general public.

The principle on which phylogenetic trees are based seems to be the same as that for human genealogies. That is, phylogenies are conceptually the between-species homolog of within-species genealogies. As far as Western thought is concerned, human genealogies make their first important appearance in the Bible, with a rather specific purpose. The Bible contains many genealogies, mostly presented as chains of fathers and sons. For example, Genesis 5 lists the descendants of Adam+Eve down to Noah and his sons, which can be illustrated as a pair of chains (as shown in the first figure); and the rest of Genesis gets from there down to Moses' family, for which the genealogy can be illustrated as a complex tree.

The genealogy as listed in Genesis 5.
Cain's lineage was terminated by the Flood.

However, the theologically most important genealogies are those of Jesus, as recorded in Matthew 1:2-16 and Luke 3:23-38. Matthew apparently presents the genealogy through Joseph, who was Jesus' legal father; and Luke apparently traces Jesus' bloodline through Mary's father, Eli. These two lineages coalesc in David+Bathsheba, and from there they have a shared lineage back to Abraham. Their importance lies in the attempt to substantiate that Jesus' ancestry fulfils the biblical prophecies that the Messiah would be descended from Abraham (Genesis 12:3) through Isaac (Genesis 17:21) and Jacob (Genesis 28:14), and that he would be from the tribe of Judah (Genesis 49:8), the family of Jesse (Isaiah 11:1) and the house of David (Jeremiah 23:5).

That is, these genealogies legitimize Jesus as the prophesied Messiah. Following this lead, subsequent use of genealogies has commonly been to legitimize someone as a monarch, so that royal genealogies have been of vital political and social importance throughout recorded history (see the example in the next figure). This importance was not lost on the rest of the nobility, either, so that documented genealogies of most aristocratic families allow us to identify the first-born son of the first-born son, etc, and thus legitimize claimants to noble titles — genealogies are a way for nobles to assert their nobility.

The genealogy of the current royal family of Sweden. [Note: most children are not shown]
The lineage of the recent monarchs is highlighted as a chain, with an aborted side-branch dashed.

If we focus solely on the line of descent involved in legitimization, then genealogies can be represented as a chain (as shown in the genealogy above). However, if we include the rest of the paternal lines of descent then family genealogies can be represented as a tree. However, if we include some or all of the maternal lineages as well, then family genealogies can be represented as a network. For example, the biblical genealogies only rarely name women, but where females are specifically named the genealogies actually form a reticulated network. Jacob produced offspring with both Rachel and Leah, who were his first cousins; and Isaac and Rebekah were first cousins once removed. Even Moses was the offspring of parents who were, depending on the biblical source consulted, either nephew-aunt, first cousins, or first cousins once removed. These relationships cannot be represented in a tree. (See also the complex genealogy of the Spanish branch of the Habsburgs, who were kings of Spain from 1516 to 1700.)

This idea of genealogical chains, trees and networks was straightforward to transfer from humans to other species. Originally, biologists stuck pretty much to the idea of a chain of relationships among organisms, as presented in the early part of Genesis. Human genealogies were traced upwards to Adam and from there to God, and thus species relationships were traced upwards to God via humans. However, by the second half of the 1700s both trees and networks made their appearance as explicit suggestions for representing biological relationships. In particular, Buffon (1755) and Duchesne (1766) presented genealogical networks of dog breeds and strawberry cultivars, respectively.

However, these authors did not take the conceptual leap from within-species genealogies to between-species phylogenies. Indeed, they seem to have explicitly rejected the idea, confining themselves to relationships among "races". It was Charles Darwin and Alfred Russel Wallace, a century later, who first took this leap, apparently seeing the evolutionary continuum that connects genealogies to phylogenies. In this sense, they both took ideas that had been "in the air" for several decades, but previously applied only within species, and applied them to the origin of species themselves. [See the Note below.] Both of them, however, confined themselves to genealogical trees rather than using networks. It seems to me that it was Pax (1888) who first put the whole thing together, and produced inter-species phylogenetic networks (along with some intra-species ones).

In this sense, the biblical Tree of Life has only a peripheral relevance to phylogenetics. Darwin used it as a rhetorical device to arouse the interest of his audience (Hellström 2011), but it was actually the biblical genealogies that were of most practical importance to his evolutionary ideas. Apart from anything else, the original biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). Similarly, the tree from which Adam and Eve ate the forbidden fruit was the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil), not the arbor scientiae (Tree of Knowledge) that was subsequently used as a metaphor for human knowledge.

Note. Along with phylogenetic trees, Darwin and Wallace did not actually originate the idea of natural selection, which had previously been discussed by people such as James Hutton (1794), William Charles Wells (1818), Patrick Matthew (1831), Edward Blyth (1835) and Herbert Spencer (1852). However, this discussion had been in relation to within-species diversity, whereas Wallace and Darwin applied the idea to the origin of between-species diversity (i.e. the origin of new species).

References

Buffon G-L de. 1755. Histoire naturelle générale et particulière, tome V. Paris: Imprimerie
Royale.

Duchesne A.N. 1766. Histoire naturelle des fraisiers. Paris: Didot le Jeune & C.J. Panckoucke.

Hellström N.P. 2011. The tree as evolutionary icon: TREE in the Natural History Museum, London. Archives of Natural History 38: 1-17.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Bot. Jahrb. Syst. Pflanzeng. Pflanzengeo. 10:75-241.

Pietsch T.W. 2012. Trees of life: a visual history of evolution. Baltimore: Johns Hopkins University Press.

Monday, August 4, 2014

A network of cheese rind microorganisms?


Cheese making is about 8,000 years old, and there are now about 1,000 distinct types of cheese throughout the world. As with most ancient crafts, the art of making cheese is to get the microbes to do most of the work for you.

To this end, there has been much interest in the microbial communities that occur in cheese rinds (the bit around the outside). Different communities are expected to be associated with different styles of cheese, since the production process can be quite different. This is shown in the first figure, which emphasizes that much of the difference between cheeses is due to different maturation procedures.

From Wolfe et al. (2014).

Recently, Wolfe BE, Button JE, Santarelli M, and Dutton R (2014. Cheese rind communities provide tractable systems for in situ and in vitro studies of microbial diversity. Cell 158: 422-433) had a look at the dominant genera of bacteria and microfungi in the rind communities of 137 different types of cheese. They don't actually tell us much about which cheeses these were, merely claiming:
We attempted to evenly sample across rind type (24 bloomy rind cheeses, 52 washed rind cheeses, and 61 natural rind cheeses) and geographic regions (87 European cheeses across 9 countries; 50 American cheeses across 13 states from the West Coast to the east Coast). We also attempted to sample across different milk types (77 cow milk, 34 goat milk, 21 sheep milk, and 5 mixed milk) and milk treatments (99 raw milk, 38 pasteurized).
Based on sequencing the bacterial 16S and fungal ITS loci, the authors identified 14 bacterial and 10 fungal genera (moulds and yeasts) that occurred with an average abundance of >1%, as shown in the next figure.

The 137 rind samples with their bacterial (middle row) and fungal (bottom row) genera indicated
by different colours. The order of the samples was determined by UPGMA clustering (top row).

The authors also used shotgun metagenomic sequencing to identify a range of genes in the microorganisms. They present a phylogeny of one particular gene (shown in the next figure) that shows a close relationship between some of the cheese microbes and marine bacteria:
The widespread distribution and high abundance of marine-associated gamma-Proteobacteria, enriched in both washed and bloomy rind cheeses, was an unexpected finding in our survey of taxonomic diversity ... One possible source of these marine microbes is the sea salt used in cheese production.
[Note: the other cheese rind bacterium shown in the phylogeny, Brevibacterium linens, is the one responsible for the unbelievable smell of washed-rind cheeses such as Epoisses, Münster and Limburger. It is also responsible for personal-hygiene issues such as foot odour. You can imagine how it first got into cheese making!]


However, Ropars J, Cruaud C, Lacoste S, and Dupont J (2012. A taxonomic and ecological overview of cheese fungi. International Journal of Food Microbiology 155: 199-210), in a related study, have pointed out the usual problem with microbial phylogenies: gene trees are frequently incongruent. So, the gene phylogeny shown above is not likely to be the species phylogeny. It would thus be of great interest to investigate the full microbial network, rather than looking at a single tree.

Wednesday, July 23, 2014

Evolutionary fitness and incest


I have written before about the expected genetic problems associated with inbreeding, including consanguinity and incest (relationships between people who are first cousins or closer). Conventionally, the evolutionary advantage of sexual over non-sexual reproduction is considered to be the creation of genetic diversity through heterozygosity. Inbreeding, by reducing heterozygosity, then seems to negate the advantages of sexual reproduction — it leads to the propagation of deleterious recessive alleles and thus inbreeding depression. So, there is a clear evolutionary dimension to the fact that incest avoidance is nearly universal in humans.

The best known exceptions to this situation are among royalty, including the family "trees" of the ancient Egyptian 18th Dynasty (see Tutankhamun and extreme consanguinity) and the Egyptian Ptolemaic dynasty (see Cleopatra, ambition and family networks), which were hybridization networks rather than conventional trees. The presence of consanguinity and incest among royal families then requires a biological explanation. As noted by van den Berghe & Mesher (1980):
Royal incest is best explained in terms of the general sociobiological paradigm of inclusive fitness ... Royal incest (mostly brother-sister; less commonly father-daughter) represents the logical extreme of hypergyny. Women in stratified societies maximize fitness by marrying up; the higher the status of a woman, the narrower her range of prospective husbands. This leads to a direct association between high status and inbreeding.
The benefits of inclusive fitness refer to the increased number of offspring in future generations that result from increasing the reproductive success of close relatives. This is achieved via choice of mate. In other words, close relatives share genes, and the success of any relative in leaving offspring is a success for all relatives. Therefore, evolutionary fitness is a combination of individual fitness plus the fitness of close relatives. Inbreeding may reduce individual fitness but can increase inclusive fitness, as noted by Puurtinen (2011):
Theoretical work has shown that inclusive fitness benefits can favor close inbreeding even when this results in substantial reduction in offspring fitness. These models have identified the boundary level of inbreeding depression limiting the evolution of inbreeding among first-order relatives, that is, between full siblings, or between parents and offspring.
So, there is a stable level of inbreeding in those populations that practice mate choice for optimal inbreeding. For example, the genetic risks of close inbreeding can be more than accounted for by the production of a highly related heir who has access to a wide choice of mates. Nevertheless:
For a wide range of realistic inbreeding depression strengths, mating with intermediately related individuals maximizes inclusive fitness.
In other words, mating with very close relatives is unlikely to evolve via natural selection because it is not an optimal strategy; and we must thus look to a sociological component to incest (such as retaining wealth within the family), as well as a biological one.


In this context, it is interesting to note exceptions to the usual restriction of incest to the aristocracy. The society of Graeco-Roman Egypt (from c. 300 BCE to 300 CE) provides the best-documented case (eg. see Hopkins 1980; Shaw 1992; Parker 1996; Scheidel 1997; Huebner 2007; Remijsen & Clarysse 2008). [This era starts with the Ptolemaic dynasty, which marks the collapse of Egyptian rule of Egypt.] During this time a significant proportion of all marriages noted in official Roman census declarations were between full brothers and sisters. That is, the Roman-era Egyptians did not limit this type of inbreeding to any small group, but spread it across several social classes (mainly Greek settlers rather than native Egyptians).

As noted by Schiedel (1997):
According to official census returns from Roman Egypt (first to third centuries CE) preserved on papyrus, 23·5% of all documented marriages in the Arsinoites district in the Fayum (n=102) were between brothers and sisters. In the second century CE, the rates were 37% in the city of Arsinoe and 18·9% in the surrounding villages. Documented pedigrees suggest a minimum mean level of inbreeding equivalent to a coefficient of inbreeding of 0·0975 in second century CE Arsinoe. Undocumented sources of inbreeding and an estimate based on the frequency of close-kin unions indicate a mean coefficient of inbreeding of F=0·15-0·20 in Arsinoe and of F=0·10-0·15 in the villages at the end of the second century CE. These values are several times as high as any other documented levels of inbreeding.
For comparison, the inbreeding F values for these family relationships are:
self
parent-offspring = siblings
uncle-niece = double first cousins
first cousins
first cousins once removed
second cousins
0.500
0.250
0.125
0.063
0.031
0.016

However, inbreeding depression seems not to have been a notable problem during this historical time. As noted by John Hawkes:
There is not a single mention in the evidence that links sibling marriage to negative genetic effects or unhappy marriages.
This does not mean that there were no problems, but merely that any problems were not documented, as noted by Scheidel (1997):
Even in the absence of explicit references to inbreeding depression from Roman Egypt, there is no compelling reason to assume that brother–sister marriage could have remained entirely without negative consequences for the Arsinoites. It is however possible that, due to a low incidence of lethal recessives, such effects were considerably weaker than in some western samples. The census returns do not suggest lower levels of fertility or smaller numbers of children among sibling couples ...
The practice seems to have stopped solely because it was contrary to Roman Law:
Before a.d. 212 the Romans had accepted discrepancies between their own legal practice and prevailing local customs and traditions in the Eastern provinces. Papyri from Roman Egypt, the Talmud, and the Romano-Syrian law book indeed reveal legal procedures which differed significantly from Roman law in matters such as marriage, guardianship, paternal authority, sales, and debts. The Constitutio Antoniana, however, made all free men and women of the Roman Empire into Roman citizens, and so Roman law became applicable to all inhabitants of Egypt. Brother-sister marriages cease to be documented in our Roman census returns from the early third century on. Our last [incest] testimony dates to a.d. 229.

References

Hopkins K (1980) Brother-sister marriage in Roman Egypt. Comparative Studies in Society and History 22: 303-354.

Huebner SR (2007) "Brother-sister" marriage in Roman Egypt: a curiosity of humankind or a widespread family strategy? Journal of Roman Studies 97: 21-49.

Parker S (1996) Full brother-sister marriage in Roman Egypt: Another look. Cultural Anthropology 11: 362-376.

Puurtinen M (2011) Mate choice for optimal (k)inbreeding. Evolution 65: 1501-1505.

Remijsen S, Clarysse W (2008) Incest or adoption? Brother-sister marriage in Roman Egypt revisited. Journal of Roman Studies 98: 53-61.

Scheidel W (1997) Brother-sister marriage in Roman Egypt. Journal of Biosocial Science 29: 361-371.

Shaw BD (1992) Explaining incest: brother-sister marriage in Graeco-Roman Egypt. Man 27: 267-299.

Thursday, July 3, 2014

Are genotype or phenotype data more tree-like?


I recently wrote a manuscript comparing the tree-likeness of phylogenetic data in biology and anthropology (see Are phylogenetic patterns the same in anthropology and biology?). While doing so, I also made a comparison of genotype and phenotype data within biology.

The comparison is based on maximum-parsimony analyses of the data, using the (ensemble) Retention Index (RI) as the measure of tree-likeness. If RI = 1 then all of the characters are compatible with the same tree, whereas if RI = 0 then none of them are pairwise compatible. As the graph shows, the genotype data are considerably less tree-like than are the phenotype data (mean RI ≈ 0.5 versus 0.7, respectively).

It would be interesting to know whether other people have observed this pattern. If it is general, then what causes it? Are the phenotype characters being chosen (subconsciously or not) because they show nested grouping patterns (which lend themselves automatically to a tree representation)? Or do the genotype data inherently have more stochastic variation? Does this mean that we should always be using phylogenetic networks for the representation of genotype data?


You can read the manuscript if you want the details of the analyses. Briefly, the initial collections of datasets were taken from Collard et al. (Evolution and Human Behavior 27: 169-184; 2006) — the graphed data are taken from the paper as I never managed to get the original datasets from the authors. I then supplemented this information with phenotype datasets from TreeBase (total of n=31) and miscellaneous genotype datasets from the literature (n=15). All of the datasets refer to vertebrates and insects (with one phenotype dataset from spiders). My parsimony analyses used the parsimony ratchet and PAUP*.