Monday, May 25, 2020

General remarks on rhyming (From rhymes to networks 2)


In this month's post, I want to provide some general remarks on rhyming and rhyme practice. I hope that they will help lay the foundations for tackling the problem of rhyme annotation, in the next post. Ideally, I should provide a maximally unbiased overview that takes all languages and cultures into account. However, since this would be an impossible task at this time (at least for myself), I hope that I can, instead, look at the phenomenon from a viewpoint that is a bit broader than the naive prescriptive accounts of rhyming used by teachers torture young school kids mentally.

What is a rhyme?

It is not easy to give an exact and exhaustive definition of rhyme. As a starting point, one can have a look at Wikipedia, where we find the following definition:
A rhyme is a repetition of similar sounds (usually, exactly the same sound) in the final stressed syllables and any following syllables of two or more words. Most often, this kind of perfect rhyming is consciously used for effect in the final positions of lines of poems and songs. Wikipedia: s. v. "Rhyme", accessed on 21.05.2020
This definition is a good starting point, but it does not apply to rhyming in general, but rather to rhyming in English as a specific language. While stress, for example, seems to play an important role in English rhyming, we don't find stress being used in a similar way in Chinese, so if we tie a definition of rhyming to stress, we exclude all of those languages in which stress plays a minor role or no role at all.

Furthermore, the notion of similar and identical sounds is also problematic from a cross-linguistic perspective on rhyming. It is true that rhyming requires some degree of similarity of sounds, but where the boundaries are being placed, and how the similarity is defined in the end, can differ from language to language and from tradition to tradition. Thus, while in German poetry it is fine to rhyme words like Mai [mai] and neu [noi], it is questionable whether English speakers would ever think that words like joy could form a rhyme with rye. Irish seems to be an extreme case of very complex rules underlying what counts as a rhyme, where consonants are clustered into certain classes (b, d, g, or ph, f, th, ch) that are defined to rhyme with each other (provided the vowels also rhyme), and as a result, words like oba and foda are judged to be good rhymes (Cuív 1966).

When looking at philological descriptions of rhyme traditions of individual languages, we often find a distinction between perfect rhymes on the one hand and imperfect rhymes on the other. But what counts as perfect or imperfect often differs from language to language. Thus, while French largely accepts the rhyming of words that sound identical, this is considered less satisfactory in English and German, and studies seem to have confirmed that speakers of French and English indeed differ in their intuitions about rhyme in this regard (Wagner and McCurdy 2010.

Peust (2014) discusses rhyme practices across several languages and epochs, suggesting that similarity in rhyming was based on some sort of rhyme phonology, that would account for the differences in rhyme judgments across languages. While the ordinary phonology of a language is a classical device in linguistics to determine those sounds that are perceived as being distinctive in a given language, rhyme phonology can achieve the same for rhyming in individual languages.

While this idea has some appeal at first sight, given that the differences in rhyme practice across languages often follow very specific rules, I am afraid it may be too restrictive. Instead, I rather prefer to see rhyming as a continuum, in which a well-defined core of perfect rhymes is surrounded by various instances of less perfect rhymes, with language-specific patterns of variation that one would still have to compare in detail.

Beyond perfection

If we accept that all languages have some notion of a perfect rhyme that they distinguish from less perfect rhymes, which will, nevertheless, still be accepted as rhymes, it is useful to have a quick look at differences in deviation from the perfect. German, for example, is often used as an example where vowel differences in rhymes are treated rather loosely; and, indeed, we find that diphthongs like the above-mentioned [ai] and [oi] are perceived as rhyming well by most German speakers. In popular songs, however, we find additional deviations from the perceived norm, which are usually not discussed in philological descriptions of German rhyming. Thus, in the famous German Schlager Griechischer Wein by Udo Jürgens (1934-2014), we find the following introductory line:
Es war schon dunkel, als ich durch Vorstadtstrassen heimwärts ging.
Da war ein Wirtshaus, aus dem das Licht noch auf den Gehsteig schien.
[Translation: It was already dark, when I went through the streets outside of the city. There was a pub which still emitted light that was shining on the street.]
There is no doubt that the artist intended these two lines to rhyme, given that the overall schema of the song shows a strict schema of AABCCB. So, in this particular case, the artist judged that rhyming ging [gɪŋ] with schien [ʃiːn] would be better than not attempting a rhyme at all, and it shows that it is difficult to assume one strict notion of rhyme phonology to guide all of the decisions that humans make when they create poems.

More extreme cases of permissive rhyming can be found in some traditions of English poetry, including Hip Hop (of course), but also the work of Bob Dylan, who does not have a problem rhyming time with fine, used with refused, or own with home, as in Like a Rolling Stone. In Spanish, where we also find a distinction between perfect (rima consonante) and imperfect rhyming (rima asonante), basically all that needs to coincide are the vowels, which allows Silvio Rodriguez to rhyme amór with canción in Te doy una canción.

While most languages coincide on the notion of perfect rhymes (notwithstanding certain differences due to general differences in their phonology), the interesting aspects for rhyming are those where they allow for imperfection. Given that rhyming seems to be something that reflects, at least to some extent, a general linguistic competence of the native speakers, a comparison of the practices across languages and cultures may help to shed light on general questions in linguistics.

Rhyming is linear

When discussing with colleagues the idea of making annotated rhyme corpora, I was repeatedly pointed to the worst cases, which I would never be able to capture. This is typical for linguists, who tend to see the complexities before they see what's simple, and who often prefer to not even try to tackle a problem before they feel they have understood all the sub-problems that could arise from the potential solution they might want to try.

One of the worst cases, when we developed our first annotation format as presented last year (List et al. 2019), was the problem of intransitive rhyming. The idea behind this is that imperfect rhyming may lead to a situation where one word rhymes with a word that follows, and this again rhymes with a word that follows that, but the first and the third would never really rhyme themselves. We find this clearly stated in Zwicky (1976: 677):
Imperfect rhymes can also be linked in a chain: X is rhymed (imperfectly) with Y, and Y with Z, so that X and Z may count as rhymes thanks to the mediation of Y, even when X and Z satisfy neither the feature nor the subsequence principle.
Intransitive rhyming is, indeed, a problem for annotation, since it would require that we think of very complex annotation schemas in which we assign words to individual rhyme chains instead of just assigning them to the same group of rhymes in a poem or a song. However, one thing that I realized afterwards, which one should never forget is: rhyming is linear. Rhyming does proceed in a chain. We first hear one line, then we hear another line, etc, so that each line is based on a succession of words that we all hear through time.

It is just as the famous Ferdinand de Saussure (1857-1913) said about the linguistic sign and its material representation, which can be measured in a single dimension ("c'est une linge", Saussure 1916: 103). Since we perceive poetry and songs in a linear fashion, we should not be surprised that the major attention we give to a rhyme when perceiving it is on those words that are not too far away from each other in their temporal arrangement.

The same holds accordingly for the concrete comparison of words that rhyme: since words are sequences of sounds, the similarity of rhyme words is a similarity of sequences. This means we can make use of the typical methods for automated and computer-assisted sequence comparison in historical linguistics, which have been developed during the past twenty years (see the overview in List 2014), when trying to analyze rhyming across different languages and traditions.

Conclusion

When writing this post, I realized that I still feel like I am swimming in an ocean of ignorance when it comes to rhyming and rhyming practices, and how to compare them in a way that takes linguistic aspects into account. I hope that I can make up for this in the follow-up post, where I will introduce my first solutions for a consistent annotation of poetry. By then, I also hope it will become clearer why I give so much importance to the notion of imperfect rhymes, and the emphasis on the linearity of rhyming.

References

Brian Ó Cuív (1966) The phonetic basis of classical modern irish rhyme. Ériu 20: 94-103.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Peust, Carsten (2014) Parametric variation of end rhyme across languages. In: Grossmann et al. Egyptian-Coptic Linguistics in Typological Perspective. Berlin: Mouton de Gruyter, pp. 341-385.

de Saussure, Ferdinand (1916) Cours de linguistique générale. Lausanne:Payot.

Wagner, M. and McCurdy, K. (2010) Poetic rhyme reflects cross-linguistic differences in information structure. Cognition 117.2: 166-175.

Zwicky, Arnold (1976) Well, This rock and roll has got to stop. Junior’s head is hard as a rock. In: Papers from the Twelfth Regional Meeting of the Chicago Linguistic Society 676-697.

Monday, May 18, 2020

Supernetworks and gene tree incongruence


One of the most interesting side effects of using Big Data in phylogenetics has been the realization that individual gene trees may be wrong as an estimation of the genome phylogeny. In the past, the easy "solution" was just to add more and more gene data until any topological ambiguity disappears, leaving us with a fully resolved, all-inclusive, tree.

However, even with increasing amounts of data, some relationships remain ambiguous. Adding data sometimes reveals primary conflict in the data, eg. because of a reticulate phylogenetic history. With modern-day software, we can quickly infer trees and compare them. Less straightforward is the interpretation of the observed conflict and what to do with it.

In our The Emperor has no clothes blog mini-series [Pt 1 – the mighty matK, Pt 2 – a thicket of trees, Pt 3 – conflict or not], I discussed gene incongruence in the complete angiosperm plastome data, which some people have interpreted as evidence for inter-plastome reticulation (recombination), as well as discussing the role that a single gene may play when inferring a trees based on a concatenated, multi-gene data set.

A new analysis

In this new post, I will show what a supernetwork can tell us about gene conflict using the data from a recent (open access and open data) paper investigating deep relationships in (land) plants.
Sousa F, Civáň P, Brazão J, Foster PG, Cox CJ. 2020. The mitochondrial phylogeny of land plants shows support for Setaphyta under composition-heterogeneous substitution models. PeerJ 8:e8995.
This is what the authors state:
Majority-rule consensus trees resulting from the best-fitting Bayesian MCMC analyses of individual genes had low resolution in general, but liverworts were supported (>95% posterior probability (PP)) as the earliest-branching lineage in two genes (nad 3 and nad 5), whereas the mosses were supported as the earliest-branching lineage in one gene (ccm C). All other resolutions of the bryophyte lineages relative to the tracheophyte clade were not statistically supported, and the Setaphyta clade was not resolved in any gene tree.
The authors objective was to filter the data, leaving those genes and information that can provide a better tree — ie. a less ambiguous one, resolving the deep relationships in land plants. Filtering the additive weak signals in each gene, and countering the problem of codon bias, leads to the fully resolved, unambiguously supported concatenated tree shown below.

Sousa et al.'s fig. 2, a fully resolved (all branches have PP = 1) Bayesian tree inferred from the concatenated 36 gene amino acid data. The novelty is the unambiguous support and recognition of a Septaphyta clade composed of liverworts (turquois) and mosses (brown).

The zenodo link provides full access to the data used. It includes:
  • the single-gene nucleotide and translated amino acid matrices,
  • the single-gene Bayesian majority rule consensus trees (MRC)
  • a concatenated 36-gene nucleotide and amino acid matrix, plus one cleaned for substitution bias ("codon degenerate"), and the resulting ML trees.
For this post, my objective is to visualize how the genes' information differs using a somewhat under-used (and under-appreciated) method: the supernetwork.

Why use a Supernetwork

Typically, we might just display the Consensus network of the single-gene trees (eg. summarizing branch-lengths or using averaged branch-lengths), and then compare it to the "preferred" concatenated tree (above). But, the data at hand includes missing gene partitions: some gene trees have fewer leaves. Consensus networks need tree samples that all have the same set of tips, while Supernetworks, on the other hand, can handle trees with different tip sets.

Here is the Supernetwork for our data, using the z-closure algorithm and tree-size averages for edge lengths (Huson et al. 2004, IEEE/ACM Trans. Comput. Biol. Bioinform. 1: 151–158).

Supernetwork of the 36 mitochondrial gene trees. Edge bundles corresponding to branches in Sousa et al.'s fully resolved tree in blue, conflicting edge bundles in red. Gray, systematic groups not represented by according branches (taxon bipartitions) in any gene tree.

This Supernetwork is not based on the MRC provided, but gene-based ML trees. MRC trees are not strictly phylogenetic trees (Losing information in phylogenetic consensus; Why should we present a Bayesian phylogenetic analyses using networks): they can include, as in this case, polytomies originating from collapsing branches below a certain PP threshold. To distill and visualize the primary (raw) signal, I also did not partition the codons; treating the often saturated (in the case of mt DNA) third codon position as distinct data partition, would be obligatory in case we wanted to have the best-possible phylogenies.

Although the network is quite boxy (not tree-like), it does contain nearly all of the edges (branches; in blue) that make up Sousa et al.'s fully resolved, concatenated amino acid tree, with the exception of the Septaphyta edge (as noted by Sousa et al.) Although we used the suboptimal nucleotide matrices and none of our gene trees was fully congruent with the concatenated amino acid tree, they all included aspects of it (despite not filtering or correcting for codon bias).

When talking about gene incongruence, use a Supernetwork

Supernetworks (see also Whitfield et al. 2008, Syst. Biol. 57: 939–947) are much under-used graphs, ideal to visualize statements such as the one by Sousa et al. quoted above. They not only show the full extent of the gene tree incongruence but also to what degree it relates to short or long branches in the gene trees — this is something that, eg., cloudograms, fail to do. While a cloudogram is a summary of cladograms (sister relationships), a Supernetwork is a summary of phylograms, phylogenetic trees informing us about the amount of change. Keeping an eye on branch lengths helps a lot when interpreting branch support and discussing tree topologies, especially trees based on concatenated gene data (filtered or not).

Monday, May 11, 2020

A new SARS-CoV-2 variant?


In previous blog posts, Guido has examined the phylogenetic patterns in the current SARS-CoV-2 outbreak, responsible for the socially disruptive Covid-19 pandemic:
These patterns are traceable because, being a virus, there is a high mutation rate in the genome, and many genomes have been sequenced. Even on the Diamond Princess boat, it is clear that a number of genetic variants arose during its few weeks of quarantine.

Guido analyzed in detail some of these known variants, and their associated genome mutations. He carefully tried to distinguish possible sequencing artifacts from genuine mutations, and which of the latter seem to be the result of genomic recombination among different strains. Naturally, he did this in the context of using phylogenetic networks as the preferred tool of analysis.


Needless to say, Guido is not the only person to have tried this sort of analysis, although people do not really seem to have grasped that recombination as a molecular process requires the concept of a phylogenetic network. There is an intellectual fixation with phylogenetic trees rather than networks. The tree approach is to detect incompatibilities among the trees, and to deduce recombination as the cause. However, why demonstrate that your preferred analysis method fails, and reach a conclusion from this, when you could simply analyze the data appropriately in the first place?

One recent pre-print that has attracted a lot of attention, based on looking for genetic mutations in a single gene, and then using a tree-based analysis, is:
 Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2

The attention-getting part of the paper is that a particular mutation variant of the virus seems to be getting more common among hosts, and in some places has become the dominant strain. The authors conclude that the mutation has been positively selected due to greater infectivity. This is potentially important because the gene being studied is the Spike (or S) protein, which creates the distinctive crown-like appearance of the virus itself. This crown mediates infection of host cells, and is thus the target of most vaccine strategies and antibody-based therapies. Clearly, then, this variant might be of great practical interest.

However, while the press coverage has been enthusiastic, most of the professional commentary so far has been unimpressed with the authors' conclusions. Basically, the reaction to the authors has been "not so fast, guys". The evidence is suggestive at best, and not yet verified (see We don’t know yet whether a mutation has made SARS-CoV-2 more infectious).

Comments

My points in this blog post are about the analyses. There are two parts to the analyses: the identification of mutations and selection, and the study of recombination.

First, only one mutation has been identified, which appears to increase in prevalence through time. So, the conclusion that the new variant is more virulent seems to be based on the idea that it becomes the dominant strain in any population. If this is so, then we still have only one main variant to deal with, in terms of medical response. Indeed, if this variant has been around since February, as the report claims, then most infected people must have it. The only people who wouldn't have this one would be the very earliest cases.

Moreover, if a mutation is positively selected, then it must be difficult to distinguish reticulation from convergence. If variants that gain a mutation via reticulation become dominant, then with every generation we increase the probability that the same mutation will be independently obtained by another virus lineage. Being positively selected, these independent mutations will quickly be dispersed. Given that the virus has been around now for nearly 5 months, with a steadily increasing and diversifying available-host population, there would be plenty of time for convergent evolution of the same beneficial mutation.

Second, phylogenetic trees are often used to try to study the origin of genetic variation, especially if there has been recurrent emergence of particular variants, each of which has subsequently diverged independently. This was Charles Darwin's idea when he talked about the tree as a model for evolution. However, Darwin's book also has a long chapter on hybridization, which cannot easily be studied using the tree model. This apparent contradiction did not concern Darwin, because his book is mostly about the continuity of evolutionary history, which was his main motivation for using the tree model. Hybridization is evidence for continuity, even though the tree model is too simple for studying it. The same argument applies to the study of introgression.

It is the same for processes like recombination, which is conceptually no different, although it occurs at the molecular level, instead. As far as the new paper is concerned, its Figure 1, which is a couple of phylogenetic trees, does not fit well with Figure 6, which is a set of alignments illustrating recombination. Why authors cannot see contradictions between different parts of their own work remains a mystery.

As a final note, the authors raise the specter of re-infection by the new SARS-CoV-2 variant. However, it is our developed immunity (ie. production of antibodies) that protects us, epidemiologically. To allow re-infection, the virus would need to avoid these antibodies. Being more infectious does not automatically make a virus able to avoid antibodies. Nevertheless, I would not be surprised if we learn that some people become ill more than once. (NB. This is different from saying that people have multiple strains. Multiple infections do not necessarily result in multiple illnesses, because of the antibodies.) A bigger concern for new illnesses is likely to be the observed large variation in the amount of antibodies that people produce (more is better, of course).

Monday, May 4, 2020

Finding the CoV-2 root


In my last post, I looked at the prospects and pitfalls of using Median networks to trace virus evolution in the case of the SARS-CoV-2 virus. In this post, I will explore how we can try to root the CoV-2 MJ network, and why using an outgroup, as done by Forster et al., PNAS (2020), is not the best choice.

We'll stick to our 88 sequence dataset because I have already investigated its characteristics in my last post (XLSX-file included in the figshare file set). Here's the unweighted MJ network that can be inferred from these data, including all 146 mutation patterns (145 characters because one indel overlaps with a SNP – single-nucleotide polymorphism).

Median-joining network for the 88 samples in our early March harvest, color-coded for provenance and with sample dates. Four mutations (purples) are resolved as homoplasies. Red edges – potential recombination with unsampled types, line thickness gives here the number of deviating SNPs. Forster et al's Types given for orientation.

As in Forster et al.'s graph, we have one box in the central part of the graph, probably between Forster et al.'s type B (the big pie in the center and its satellites) and their type C (here: the long-edge global group including the Australian and European samples).

There's a useful rule-of-thumb in population genetics: a widespread, frequent haplotpype with many satellite types is often the ancestral type of the investigated sample. This, in our case, includes the reference CoV-2 genome ("Wuhan 1"; NC_0455512, sampled 26/12/2019). Having investigated in detail the data behind the graph (see the last post; adding sample date, provenance, graph above), we can put forward hypotheses as to what degree the parallel edge bundles represent alternative evolutionary scenarios, or are alternatively the result of potential recombinants between CoV-2 sub-lineages.

This allows us to depict an evolutionary scenario for our early samples, to picture how (i) the putative original variant (Wuhan 1/Type B) was distributed during the intitial phase (largely unmodified — light gray arrows in the next figure), (ii) where mutations happened to give rise to sequentially new (sub)types, and (iii) where recombination may have happened (crosses in the figure). Some links (the dotted lines) require further data in order to decide whether the shared mutation is lineage-diagnostic (as indicated by the MJ network) or a convergence.


Early evolution of CoV-2 in time (earliest dates) and space (coloring). Different grays distinguish the main two/three lineages: 20% gray, original Wuhan type (Forster et al.'s type B), dispersed unmodified to rest of China (sampled), Nepal (not sampled), the cruiseship (sampled) and North America (not sampled); 40% gray, potential type C differing by one transversion (basic type not sampled); 60% gray, Forster et al.'s type A differing by two transitions, basic variant found in a sample from Taiwan (Jan 31st). The circle sizes give the number of additional mutations within a lineage and geographic cluster; the x indicate potential recombination (within or between main types/lineages).

The early samples demonstrate that the later USA samples were infected by various (sub)types by mid-/end of January (by up to six lineages), while most of the variation arising in locked-down Wuhan did not escape (at this early stage) — the earliest two samples from 23/12 (MT019529) and 26/12 (reference genome) differ by three mutations.

The quarantined cruise-ship in Japan was infected with the unmodified Wuhan 1 type, which then evolved within the vessel's population. So, this quarantine worked, because the vessel's mutated viruses are not found elsewhere. While the 11121-transition has probably been propagated in the vessel's population via recombination, its occurrence outside (in the Jetsetter/USA lineage, type C?, and USA-Type A) could be due to homoplasy: both the Jetsetter/USA and the A-type USA genomes are (strongly) derived. The 24072 and 28892-transitions point to reticulation between (less evolved) American B- and (highly evolved) A-type lineages; the MJ network can't resolve the resulting box because the American A-type showing the 24072 mutation is strongly derived.

Note: It's also interesting to compare our graph with the tree-based virus "phylogeny" on the GISAID page, which doesn't seem to include the cruise-ship samples. Note that most of the deep branches of the GISAID tree are unsupported ("no mutation"), and samples identical to the reference can be found among the early samples of most main "clades" depicted in the GISAID "phylogeny".

Substitution probabilities

It is also straightforward to identify likely (→ U) and less likely substitutions (all others), as shown in the table.


There is a clear substitutional bias, as transitions are more likely than transversions, the approximate substitution model is abaaba for substitutions replacing the reference / CoV-2-consensus nucleotide. But the model is asymmetrical: Us are more likely to replace C than vice versa, while A/G transitions are balanced. Stochastically distributed singleton/rare mutations have a high probability to show a U, in general. So, a shared C is more likely to be a conserved, shared ancestral pattern (what Hennig called a "symplesiomorphy"). A shared U may be a uniquely shared, derived pattern (a "synapomorphorphy"), or a convergently (in parallel) obtained, derived pattern, a homoplasy. Low-frequency Cs, but also A and Gs at predominately U positions, are most probably synapormorphies as well (based on the data situation and observed substitution probabilities).

Currently, there is no maximum likelihood analog to Median networks, but one could weight mutation patterns differently (see, e.g., guidelines provided in NETWORK under the Help > About menu item in the Median Joining analysis window).

With each successive virus generation, the probability for a homoplasious U increases. Thus, when using MJ networks for virus evolution, we should consider analyzing the data at different time-points, rather than including all of the data in one large analysis (see also our posts on stacking Neighbor-nets: introduction, fossil king ferns, and manual alphabets).

Homoplasy + distant outgroups = wrong roots

By relying on a distantly related sister-lineage to infer an outgroup root of the MJ network, Forster et al. likely got the basic relationships wrong.

Central part of the original outgroup-rooted "phylogenetic network". Coloring after Forster et al. (2020).

Their Type A is probably not ancestral to Type B/Wuhan 1, but derived from it or representing an early split.

Same graph, mutation arrows taking into account observed mutation probabilities (our 88 genomes data) and assuming that there was no recombination among earliest types of each lineage.

The 3 Us shared by the bat outgroup and (part of) Type A (8782, 18060, 29095) likely represent homoplasy in distantly related sister lineages (cf. our last SARS virus post). Being homoplasies, they produce a network box reflecting alternative mutational pathways but not recombination. Homoplasious (convergently evolved) mutational patterns accumulate with increasing phylogenetic distance. Neutral mutations have a generally higher chance to replace a C by an U, back-mutations are less likely, and some sites are more likely to be mutated than others. Hence, there is a good chance that the bat sister-CoV-virus shows more shared mutational patterns with a derived CoV-2 lineage (ie. derived Type A variants) than with the ancestral one (Type B). Distant outgroups should not be used to root Median networks (see also: How do we interpret a rooted median network).

The only possibly genuine mutation would be the shared C (Forster et al.'s pos. 28144, pos. 28219 in our alignment) opposing a U in all Type B and Type C, differentiated only by two incompatible mutations, G → U transitions. The U at pos 28144 may have evolved in parallel in the B and C types; and the actual all-ancestor of CoV-2 (as indicated above) is neither included in Forster et al.'s sample, nor in the current GISAID sample (or our harvest).

Outlook

It will be interesting to infer MJ networks on time-stamped and geo-referenced subsamples collected in the GISAID database, once the virus has had half a year (or more) to evolve, to see (i) how common homoplasy is, (ii) which sites are likely to accumulate → U substitutions independently of ancestry and (iii) whether there are further and more obvious examples of recombination. The further that genotypes evolve from the original stock, then the more diagnostic their sequences may become, and the easier it will be to decide whether shared but incompatible sequence features are the result of homoplasy or recombination.