When it comes to phylogenetic networks, there is often misunderstanding between biological and computational scientists, because the former tend to focus on the biological processes underlying the network whereas the latter focus on the patterns needing to be analyzed to produce the networks.
Here, I try to provide a summary of the different processes and patterns involved in reticulation, so that both "sides" get an overview, and hopefully can communicate more easily. I am principally discussing the development of networks that display evolutionary history.
In phylogenetics, historical processes create contemporary patterns, and we then try to detect those patterns, and assess them in order to determine what process created each pattern. Computationally, algorithms will detect certain data patterns and display them in a directed acyclic graph, which is then interpreted biologically. What needs to happen is for us to identify the possible patterns created by the different processes, so that algorithms can be developed that will detect them. It is doubtful that an algorithm will be able to identify all individual processes — it will be up to biologists to work out what process created each pattern detected.
In what follows, there are major simplifications from both the biological and computational points of view, so please be aware of that. In particular, note that I have not discussed either deep coalescence or gene duplication-loss which, if present, will confound the detection of reticulation patterns.
Hybridization (hybrid speciation)
This is the formation of a new species via sexual reproduction. There are two basic forms that are of interest:
Homoploid Hybridization, in which one copy of the genome is inherited from each parent species (eg. diploid parents create a diploid hybrid);
Polyploid Hybridization, in which multiple copies of the genome are inherited from each parent species (eg. diploid parents create a polyploid hybrid).
Polyploid hybridization is usually assessed by sequencing each copy of the genome in the hybrid species, and treating each copy as a terminal in the data analysis, This produces a multi-labelled genome tree, which is then turned into a single-labelled species network.
At the species level, homoploid hybridization is usually assessed by sequencing several genes in the hybrid species (often from both the nuclear and non-nuclear genomes) and producing independent gene trees. The species network is created by resolving conflicts among the gene trees. This form of analysis assumes a data pattern that is very similar to that of HGT.
In population studies, homoploid hybridization is usually assessed at the sequence level, using multiple-copy nuclear genes, where hybrids are detected by additive polymorphisms at some alignment positions.
Introgression (introgressive hybridization)
This is the transfer of genetic material from one species to another via sexual reproduction. This happens when hybrid individuals back-cross preferentially to one of the parental species, rather than forming a new hybrid species. It can involve anything from 1-49% of the genome (at 50% it is best called hybridization). The data pattern created is very similar to that of HGT (the transfer of genetic material from one species to another via non-sexual means).
It is usually assessed at the population level, by sequencing one or more genes (often from both the nuclear and non-nuclear genomes) from many individuals, and demonstrating that identical haplotypes (haploid genotypes) occur in what are recognized as separate species. This is done by constructing a haplotype network. Often, individuals are detected where the non-nuclear haplotype differs from the nuclear haplotype (as shown in the figure).
Horizontal Gene Transfer
This is the transfer of genetic material from one species to another via non-sexual means (eg. transformation, transduction, or conjugation). The data pattern created is very similar to that of introgression (the transfer of genetic material from one species to another via sexual reproduction).
It is sometimes assessed by sequencing several genes and producing independent gene trees. The species network is created by resolving conflicts among the gene trees. This form of analysis assumes data that are very similar to those of homoploid hybridization or recombination.
Alternatively, it is often assessed by comparing gene trees to a species tree (either pre-specified, or derived from multi-gene data). The species network is created by resolving conflicts between the gene trees and the species tree.
Homologous Recombination and Viral Reassortment
These involve homologous parts of a genome breaking part and re-arranging themselves, often during sexual reproduction. With cross-over the two genomes exchange material, and with gene conversion one genome acquires material from the other. There are three basic forms that are of interest:
Intra-genic Recombination, in which the break-points occur within a single gene;
Inter-genic Recombination, in which the break-points occur in different genes or non-coding spaces between genes;
Reassortment, in which segmented viruses re-combine their segments to create new strains (similar to gene conversion); this is basically inter-genic recombination without sex.
Intra-genic recombination is usually analyzed at the sequence level, based on ordered data. The gene network is constructed by identifying break-points, and thus the recombined segments. It is also possible for one of the donors of a recombined sequence to be missing from the dataset, in which case the data pattern will be the same as for HGT without the donor sampled.
Inter-genic recombination will produce the same pattern as hybridization, if both break-points are outside the region sequenced. Furthermore, homoploid hybridization can be thought of as recombination of whole chromosomes.
Viral reassortment is usually assessed by comparing strains with each other based on presence-absence of segmental haplotypes (rather similar to haplotyping of sexual organisms). This is a unique form of analysis, and it can produce incredibly complex networks.
Polyploid hybridization (species)
Homoploid hybridization (species)
Homoploid hybridization (population)
Horizontal gene transfer (species)
incongruent gene trees
sequence additive polymorphisms
incongruent gene trees
incongruent gene/species trees
incongruent gene trees
It may be impossible ever to reliably distinguish homoploid hybridization, introgression, HGT and inter-genic recombination from each other by pattern analysis alone, at least not without genome-scale data.