This notebook is meant to accompany the following assigned article: Compeau, Phillip E. C., Pavel A. Pevzner, and Glenn Tesler. 2011. “How to Apply de Bruijn Graphs to Genome Assembly.” Nature Biotechnology 29 (11): 987–91. https://doi.org/10.1038/nbt.2023.
By the end of this notebook you should:
In this notebook we are going to write Python functions to assemble a contig from a set of sequenced reads broken into kmers. This differs from the last two notebooks where we have generated kmers directly from the known genome sequence. Here the reads will cover small portions of the full genome, and they will contain variation. We are getting closer to the real task of genome assembly, which is to infer the genome sequence from the kind of data that available to us. Next week we will explore how we add long-read sequences to gain additional information. But for now, we are sticking with assembly based only on short read sequences.
Shotgun sequencing is an imperfect way to assemble genomes, but remains the most common starting point for genome assembly today. It is cheap compared to long-read methods, it does not require any special methods for DNA extraction (e.g., high molecular weight extractions that minimize breakage to get longer reads), and it doesn't require much in terms of preparing the genomic library for sequencing. Shotgun sequencing typically involves either a sonication step, or enzymatic digestion to cut the genome randomly into smaller fragments. These fragments are expected to randomly start at any point throughout the genome. Thus, if we sequence enough of them we expect to have full coverage of the entire genome.
In practice, we typically aim to cover every base in the genome with at least 50X coverage in order to accurately identify which regions are heterozygous, and to correct for sequencing errors. For large genomes this can require enormous sequencing efforts, with literally trillions of reads. It's worth noting that although this method is considered cheap, especially for small genomes, to attain 50X coverage of very large genomes (>10Gb), as can be found in organisms like ferns, amphibians, and shrimps, can cost >$50K, and sometimes much more. So cheap is relative. Illumina is the dominant technology for short read sequencing. For genome assembly paired-end sequencing is always used since the distance between paired reads provides additional information that can be used to span repeats.
Let's generate a random genome sequence that will serve as the template for generating short read data that we will then use to try to re-assemble the genome. The function short_read_sequencing()
below will generate short reads from this genome.
import random
import toyplot
def random_sequence(seqlen):
return "".join([random.choice("ACGT") for i in range(seqlen)])
def short_read_sequencing(sequence, nreads, readlen):
"generate short reads from a circular genome"
# do not allow reads to be longer than genome
assert len(sequence) > readlen, "readlen must be shorter than sequence"
# get random start positions of short reads
starts = [random.randint(0, len(sequence)) for i in range(nreads)]
# return reads as a list, generate reads by slicing from sequence
reads = []
for position in starts:
end = position + readlen
# if read extends past end then loop to beginning of sequence
if end > len(sequence):
read = sequence[position:len(sequence)] + sequence[0:end-len(sequence)]
else:
read = sequence[position:position + readlen]
# append to reads list
reads.append(read)
return reads
# get a random genome sequence
genome = random_sequence(100)
# get 100 reads randomly drawn from the genome
reads = short_read_sequencing(genome, 100, 25)
# print the first 20 reads
reads[:20]
We now have a 100bp genome (treated as circular) that has been sequenced to produce 100 reads that are each 25bp in length. The chances that these 100 reads started at every possible point in the genome is really low, and so it will be hard to build a de Bruijn graph from these reads as they are currently. Instead, we can break them down into smaller kmers. Here we will recycle the function we used last time to get kmers from a sequence, and we will instead get all kmers from all of the reads.
def get_kmer_count_from_sequence(sequence, k=3, cyclic=True):
"""
Returns dictionary with keys representing all possible kmers in a sequence
and values counting their occurrence in the sequence.
"""
# dict to store kmers
kmers = {}
# count how many times each occurred in this sequence (treated as cyclic)
for i in range(0, len(sequence)):
kmer = sequence[i:i + k]
# for cyclic sequence get kmers that wrap from end to beginning
length = len(kmer)
if cyclic:
if len(kmer) != k:
kmer += sequence[:(k - length)]
# if not cyclic then skip kmers at end of sequence
else:
if len(kmer) != k:
continue
# count occurrence of this kmer in sequence
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
return kmers
def get_kmer_count_from_reads(reads, k=3):
"Combines results of 'get_kmer_count_from_sequence()' across many reads"
# a dictionary to store kmer counts in
kmers = {}
# iterate over reads
for read in reads:
# get kmer count for this read
ikmers = get_kmer_count_from_sequence(read, k, cyclic=False)
# add this kmer count to the global kmer counter across all reads
for key, value in ikmers.items():
if key in kmers:
kmers[key] += value
else:
kmers[key] = value
# return kmer counts
return kmers
get_kmer_count_from_reads(reads)
Again, this is the same function we used in the previous notebook for build a graph of edges form the kmers.
def get_debruijn_edges_from_kmers(kmers):
"""
Every possible (k-1)mer (n-1 suffix and prefix of kmers) is assigned
to a node, and we connect one node to another if the (k-1)mer overlaps
another. Nodes are (k-1)mers, edges are kmers.
"""
# store edges as tuples in a set
edges = set()
# compare each (k-1)mer
for k1, count1 in kmers.items():
for k2, count2 in kmers.items():
# if both kmers occurred in the sequence
if (count1 & count2) and (k1 != k2):
# if they overlap then add to edges
if k1[1:] == k2[:-1]:
edges.add((k1[:-1], k2[:-1]))
if k1[:-1] == k2[1:]:
edges.add((k2[:-1], k1[:-1]))
return edges
def plot_debruijn_graph(edges, width=500, height=500, vlshow=True):
"returns a toyplot graph from an input of edges"
graph = toyplot.graph(
[i[0] for i in edges],
[i[1] for i in edges],
width=width,
height=height,
tmarker=">",
vsize=5,
vstyle={"stroke": "black", "stroke-width": 2, "fill": "black"},
vlshow=vlshow,
estyle={"stroke": "black", "stroke-width": 2},
layout=toyplot.layout.FruchtermanReingold(edges=toyplot.layout.CurvedEdges()))
return graph
We now have all of the functions we need to assemble a genome sequence. Let's start with an easy example. Here we want to assemble a 50bp genome and we sequenced 1000 reads that are each 15bp long. So we can assume that the genome is probably covered completely by kmers of these reads. Using k=8 we can see below that a single Eulerian path is assembled from the short reads. This means the genome was successfully assembled.
random.seed(123)
genome = random_sequence(50)
reads = short_read_sequencing(genome, 1000, 15)
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
print(genome)
What if we now reduce the sequencing coverage to a much lower amount, so that there is possibly parts of the genome that are not covered by n-1 overlapping kmers. When we use 8mers like above, but with lower coverage, we now see that the genome cannot be assembled, and is instead broken into several smaller contigs.
random.seed(123)
genome = random_sequence(50)
reads = short_read_sequencing(genome, 100, 15)
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
Heterozygosity makes it much harder to assemble genomes since it introduces bubbles into the de Bruijn graph where a kmer can connect to multiple others. Let's look at an example where this is easier to see, here a 25bp genome. In the first example we sequence 1000 15bp reads and the genome assembles easily. In the second example, we sequence 500 15bp reads from each of two different copies of the genome which differ at two heterozygous sites. As a result, this introduces a number of loops into the de Bruijn graph, and thus makes it difficult to know unambiguously what the true sequence is. Finally, in a third example we assemble a genome where the two heterozygous sites are futher apart, further than the kmer size. You can see that now it does not create a loop. Thus, the kmer size and heterozygosity interact in setting the size of unique blocks that are useful for assembling the genome using de Bruijn graphs. In general, it is easiest to assemble genomes that are not highly heterozygous. For this reason researchers often use inbred individuals.
random.seed(1234)
genome1 = random_sequence(20)
reads1 = short_read_sequencing(genome1, 1000, 15)
kmers = get_kmer_count_from_reads(reads1, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
random.seed(1234)
genome1 = random_sequence(20)
# 500 reads from genome copy 1
reads1 = short_read_sequencing(genome1, 500, 15)
# 500 reads from genome copy 2
genome2 = list(genome1)
genome2[5] = "T"
genome2[9] = "T"
genome2 = "".join(genome2)
reads2 = short_read_sequencing(genome2, 500, 15)
reads = reads1 + reads2
# db graph
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
print(genome1)
print(genome2)
random.seed(1234)
genome1 = random_sequence(20)
# 500 reads from genome copy 1
reads1 = short_read_sequencing(genome1, 500, 15)
# 500 reads from genome copy 2
genome2 = list(genome1)
genome2[1] = "T"
genome2[19] = "C"
genome2 = "".join(genome2)
reads2 = short_read_sequencing(genome2, 500, 15)
reads = reads1 + reads2
# db graph
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
print(genome1)
print(genome2)
To plot a graph without the text labels shown use the argument vlshow=False
. This will be useful since you only need to evaluate whether the graph looks fully connected below for larger graphs, and not necessarily need to read the node labels.
# example
random.seed(1234)
genome1 = random_sequence(20)
reads1 = short_read_sequencing(genome1, 1000, 15)
kmers = get_kmer_count_from_reads(reads1, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges, vlshow=False, width=250, height=250);
# set a seed to make it repeatable
random.seed(1234)
# enter length of 50 here
genome1 = random_sequence(50)
# enter 1000 and 30 here
reads1 = short_read_sequencing(genome1, 1000, 30)
# choose a k size
kmers = get_kmer_count_from_reads(reads1, k=12)
# plot the graph
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges, vlshow=False, width=250, height=250);
# set a seed to make it repeatable
random.seed(1234)
# enter length of 50 here
genome1 = random_sequence(200)
# enter 1000 and 30 here
reads1 = short_read_sequencing(genome1, 1000, 50)
# choose a k size
kmers = get_kmer_count_from_reads(reads1, k=20)
# plot the graph
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges, vlshow=False, width=550, height=550);
There are many loops in the graph when it cannot be assembled. When the kmer size was smaller there were many more nodes and edges in the graph. This made it harder to read but and induced more loops in it. Larger genome size was harder to get a clean graph for because there were more loops probably caused by repeat kmers. The more reads and larger kmer size that was used the better the graph assembled.