Suspicious K-mer Content at Start of Illumina Reads Generated Using Nextera Library Kits
Transposon insertion bias or a different sort of sequencing artifact?
May 2nd, 2014
Many researchers have reported some form of biased representation of k-mer content at the beginning (5'-end) of sequencing data generated from Illumina library prep kits. Here are two threads with discussion of this phenomenon for Nextera kits (HERE) and TruSeq kits (HERE). The general consensus is that that the patterns of bias are produced by modest and uncontrollable preferential interactions between certain nucleic acid sequences and the random hexamer PCR primers or transposable elements (Here's a paper demonstrating such a bias for the Tn5 transposable element related to the one used in the Nextera kit).
The main question is: "What kind of effect might this have on assembly?" The feedback from a number of researchers was that they observed no significant differences in assembly based on whether they trimming these features (i.e. the first 14bp of affected reads) or not. For one of those researchers, however, the lack of improvement meant abandoning his data. Interestingly, the problems presented by this unexplained k-mer pattern is not readily addressed using trimming software, which hunt and prey on adapter sequences. This isn't surprising given the fact the k-mers sequences weren't found in the sequencing adapters or indices (see the list of the top most abundant repeats re-constructed from the k-mer chart left of figure 1).
NOTE: One more readily solved, but different problem, is trimming reads which had such short fragments that the sequenced read included the opposing paired end adapter. This can be remedied using trimming software (like Trimmomatic) and is described HERE (@ pathogenomenick).
So, if I haven't lost you by assuaging your worries by saying: "some people just don't worry about it", I'll explain what prompted me to write this piece. I have become skeptical about the explanation based on transposon bias due to the uncanny consistency in which these patterns of bias materialize across different samples and projects. All eight of the samples I prepared (and sequenced over two separate runs) had the same identical pattern, despite the expectation of a very different nucleic acid composition as the libraries were made from complex soil microbial communities. My suspicions grew when I began searching for explanations and found identical, or nearly identical, FastQC output graphs from four other researchers from related posts in forums and blogs.
In my opinion, even if transposons had ultra-specific target sites in downstream locations from where the adapter sequence gets inserted, there is a low probability of getting such clean and comparable data by accident. I may be jaded by how few of my experiments ever look so clean, but I would be surprised we just accidentally nailed down the biased recognition sites of whatever variant of tn5 transposon the Nextera people are using. The more likely explanation would involve something to do with the manufacturing process of the Nextera kits, whether it be a characteristic of how the adapters or primers are synthesized or part of the transposase sequence itself being inserted withing the read. I admit to not having an answer, but I hope this post might spark an more complete explanation. I have contacted Illumina and will report back their answer.
In the meantime, PLEASE COMMENT @ the SEQanswers Thread Addressing This Question Started HERE.
-- UPDATE --
(May, 7th, 2014)
This post generated some good discussion on the SEQanswers thread, and both the Illumina rep. and a thoughtful contributer in the thread referenced the following paper which compared different library preparation methods to the "tagmentation" with transposases. They detected bias as a result of the native Tn5 transposase recognition site: AGNTYWRANCT (where N is any nucleotide, R is A or G, W is A or T, and Y is C or T). When I search through my reads based on the 6 base k-mers I reconstruct from the table below, I DO find similarity to the reported recognition site. The similarity is not perfect, but this may be due to the fact the transposase used in the Nextera kit is not the same as what was used in the paper.
Now, with the actual sequence information related to the transposon bias in hand, I'm much more comfortable with that explanation. And, when I look at the total number of sequences which contain one of these k-mers in the first 14 bases of sequence, I find that the counts are really quite small (0.3%). My initial fears were based on an incorrect calculation where I summed the "count" column in the FastQC output for the seven major over-represented k-mers and divided by the total number of sequences. When I was reviewing this, I realized that the count data provided by FastQC is the total number of occurrences anywhere in the read and was, therefore, a vast overestimate of the reads with disproportionate abundances at the start of the read.
One additional finding, which further negates the importance of this bias is that when I randomly BLAST the first 14 bp of reads containing the repeats, I do not see a consistent taxonomic signal.
I thank everyone for sharing information and I hope this may have helped some newbies, like myself, in assessing the quality of their sequencing data. I am preparing to assemble my metagenomes; I will make one final brief update regarding the success of the assembly to finish this post.
- Roli Wilhelm
My Own Example of K-mer Bias From 100bp Paired-end Reads from a HiSeq Run. (Click to Enlarge)
recurring sequence in reads
transposase recognition site
Last updated Jan. 26, 2017