Sampling bias refers to a statistical sample that has been collected in such a way that some members of the intended statistical population are less likely to be included than are others. The resulting biased sample does not necessarily represent the population (which it should), because the population members were not all equally likely to have been selected for the sample.
This affects scientific work because all scientific questions are about the population not the sample (ie. we infer from the sample to the population), and we can only answer these questions if the samples we have collected truly represent the populations we are interested in. That is, our results could be due to to the method of sampling but erroneously be attributed to the phenomenon under study instead. Bias needs to be accounted for, but it cannot be assessed by looking at the sampled data alone. Bias can only be addressed via the sampling protocol itself.
In genome sequencing, sampling bias is often referred to as ascertainment bias, but clearly it is simply an example of the more general phenomenon. This is potentially a big problem for next generation sequencing (NGS) because there are multiple steps at which sampling is performed during genome sequencing and assembly. These include the initial collection of sequence reads, assembling sequence reads into contigs, and the assembly of these into orthologous loci across genomes. (NB: For NGS technologies, sequence reads are of short lengths, generally <500 bp and often <100 bp.)
The potential for sampling (ascertainment) bias has long been recognized for the detection of SNPs. This bias occurs because SNPs are often developed using only a small group of samples from which to choose the polymorphic markers. The resulting collection of markers samples only a small fraction of the diversity that exists in the population, and this mis-estimates phylogenetic relationships.
However, it is entirely possible that the any attempt to collect high-quality NGS data actually results in poor quality sampling — that is, we end up with high-quality but biased genome sequences. Genome sequencing is all about the volume of data collected, and yet data volume cannot be used to address bias (it can only be used to address stochastic variation). It would be ironic if phylogenomics turns out to have poorer data than traditional sequence-based phylogenetics, but biased genomic data are unlikely to be any more use than non-genome sequences.
The basic issue is that attempts to get high-quality genome data usually involve leaving out data, either because the initial sequencing protocol never collects the data in the first place, or because the subsequent assembly protocols delete the data as being below a specified quality threshold. If these data are left out in a non-random manner, which is very likely, then sampling bias inevitably results. Unfortunately, both the sequencing and bioinformatic protocols are usually beyond the control of the phylogeneticist, and sampling bias can thus go undetected.
Two recent papers highlight common NGS steps that potentially result in biased samples.
First, Robert Ekblom, Linnéa Smeds and Hans Ellegren (2014. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics 15: 467) discuss genome coverage bias, using mtDNA as an example. They note:
It is known that the PCR step involved in sequencing-by-synthesis methods introduces coverage bias related to GC content, possibly due to the formation of secondary structures of single stranded DNA. Such GC dependent bias is seen on a wide variety of scales ranging from individual nucleotides to complete sequencing reads and even large (up to 100 kb) genomic regions. Systematic bias could also be introduced during the DNA fragmentation step or caused by DNA isolation efficacy, local DNA structure, variation in sequence quality and mapability of sequence reads.
In addition to variation in coverage, there may be sequence dependent variation in nucleotide specific error rates. Such systematic patterns of sequencing errors can also have consequences for downstream applications as errors may be taken for low frequency SNPs, even when sequencing coverage is high. GC rich regions and sites close to the ends of sequence reads typically show elevated errors rates and it has also been shown that certain sequence patterns, especially inverted repeats and "GGC" motifs are associated with an elevated rate of Illumina sequencing errors. Such sequence specific miscalls probably arise due to specific inhibition of polymerase binding. Homopolymer runs cause problems for technologies utilising a terminator free chemistry (such as Roche 454 and Ion Torrent), and specific error profiles exist for other sequencing technologies as well.
Sequencing coverage showed up to six-fold variation across the complete mtDNA and this variation was highly repeatable in sequencing of multiple individuals of the same species. Moreover, coverage in orthologous regions was correlated between the two species and was negatively correlated with GC content. We also found a negative correlation between the site-specific sequencing error rate and coverage, with certain sequence motifs "CCNGCC" being particularly prone to high rates of error and low coverage.The second paper is by Frederic Bertels, Olin K. Silander, Mikhail Pachkov, Paul B. Rainey and Erik van Nimwegen (2014. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Molecular Biology and Evolution 31: 1077-1088). They discuss the situation where raw short-sequence reads from each DNA sample are directly mapped to the genome sequence of a single reference genome. They note:
There are reasons to suspect that such reference-mapping-based phylogeny reconstruction methods might introduce systematic errors. First, multiple alignments are traditionally constructed progressively, that is, starting by aligning the most closely related pairs and iteratively aligning these subalignments. Aligning all sequences instead to a single reference is likely to introduce biases. For example, reads with more SNPs are less likely to successfully and unambiguously align to the reference sequence, as is common in alignments of more distantly related taxa. This mapping asymmetry between strains that are closely and distantly related to the reference sequence may affect the inferred phylogeny, and this has indeed been observed. Second, as maximum likelihood methods explicitly estimate branch lengths, including only alignment columns that contain SNPs and excluding (typically many) columns that are nonpolymorphic, may also affect the topology of the inferred phylogeny. This effect has been described before for morphological traits and is one reason long-branch attraction can be alleviated with maximum likelihood methods when nonpolymorphic sites are included in the alignment.
We identify parameter regimes where the combination of single-taxon reference mapping and SNP extraction generally leads to severe errors in phylogeny reconstruction. These simulations also show that even when including nonpolymorphic sites in an alignment, the effect of mapping to a single reference can lead to systematic errors. In particular, we find that when some taxa are diverged by more than 5-10% from the reference, the distance to the reference is systematically underestimated. This can generate incorrect tree topologies, especially when other branches in the tree are short.
These issues are part of the current "gee-whizz" phase of phylogenomics, in which over-optimism prevails over realism, and volume of data is seen as the important thing. David Roy Smith (2014. Last-gen nostalgia: a lighthearted rant and reflection on genome sequencing culture. Frontiers in Genetics 5: 146) has recently commented on this:
The promises of NGS have, at least for me, not lived up to their hype and often resulted in disappointment, frustration, and a loss of perspective.
I was taught to approach research with specific hypotheses and questions in mind. In the good ol' Sanger days it was questions that drove me toward the sequencing data. But now it’s the NGS data that drive my questions ... I'm trapped in a cycle where hypothesis testing is a postscript to senseless sequencing.
As we move toward a world with infinite amounts nucleotide sequence information, beyond bench-top sequencers and hundred-dollar genomes, let’s take a moment to remember a simpler time, when staring at a string of nucleotides on a screen was special, worthy of celebration, and something to give us pause. When too much data were the least of our worries, and too little was what kept us creative. When the goal was not to amass but to understand genetic data.