Notebook 10.3: denovo Assembly difficulties

Reading:

This notebook is meant to accompany the following assigned article: Compeau, Phillip E. C., Pavel A. Pevzner, and Glenn Tesler. 2011. “How to Apply de Bruijn Graphs to Genome Assembly.” Nature Biotechnology 29 (11): 987–91. https://doi.org/10.1038/nbt.2023.

Learning objectives:

By the end of this notebook you should:

  1. Understand how short-reads and genetic variation affect genome assembly.
  2. Be able to follow code for de Bruijn graph assembly from short reads.

Overview:

In this notebook we are going to write Python functions to assemble a contig from a set of sequenced reads broken into kmers. This differs from the last two notebooks where we have generated kmers directly from the known genome sequence. Here the reads will cover small portions of the full genome, and they will contain variation. We are getting closer to the real task of genome assembly, which is to infer the genome sequence from the kind of data that available to us. Next week we will explore how we add long-read sequences to gain additional information. But for now, we are sticking with assembly based only on short read sequences.

Shotgun genome sequencing

Shotgun sequencing is an imperfect way to assemble genomes, but remains the most common starting point for genome assembly today. It is cheap compared to long-read methods, it does not require any special methods for DNA extraction (e.g., high molecular weight extractions that minimize breakage to get longer reads), and it doesn't require much in terms of preparing the genomic library for sequencing. Shotgun sequencing typically involves either a sonication step, or enzymatic digestion to cut the genome randomly into smaller fragments. These fragments are expected to randomly start at any point throughout the genome. Thus, if we sequence enough of them we expect to have full coverage of the entire genome.

In practice, we typically aim to cover every base in the genome with at least 50X coverage in order to accurately identify which regions are heterozygous, and to correct for sequencing errors. For large genomes this can require enormous sequencing efforts, with literally trillions of reads. It's worth noting that although this method is considered cheap, especially for small genomes, to attain 50X coverage of very large genomes (>10Gb), as can be found in organisms like ferns, amphibians, and shrimps, can cost >$50K, and sometimes much more. So cheap is relative. Illumina is the dominant technology for short read sequencing. For genome assembly paired-end sequencing is always used since the distance between paired reads provides additional information that can be used to span repeats.

Generate sequenced reads

Let's generate a random genome sequence that will serve as the template for generating short read data that we will then use to try to re-assemble the genome. The function short_read_sequencing() below will generate short reads from this genome.

In [1]:
import random
import toyplot
In [2]:
def random_sequence(seqlen):
    return "".join([random.choice("ACGT") for i in range(seqlen)])
In [3]:
def short_read_sequencing(sequence, nreads, readlen):
    "generate short reads from a circular genome"
    
    # do not allow reads to be longer than genome
    assert len(sequence) > readlen, "readlen must be shorter than sequence"
    
    # get random start positions of short reads
    starts = [random.randint(0, len(sequence)) for i in range(nreads)]
    
    # return reads as a list, generate reads by slicing from sequence
    reads = []
    for position in starts:
        end = position + readlen
        
        # if read extends past end then loop to beginning of sequence
        if end > len(sequence):
            read = sequence[position:len(sequence)] + sequence[0:end-len(sequence)]
        else:
            read = sequence[position:position + readlen]
        
        # append to reads list
        reads.append(read)
    return reads
In [4]:
# get a random genome sequence
genome = random_sequence(100)

# get 100 reads randomly drawn from the genome
reads = short_read_sequencing(genome, 100, 25)

# print the first 20 reads
reads[:20]
Out[4]:
['ATACGCGGGCCAGGATGTGCTGCTT',
 'CCTTGGCCAAAAATCCCCACGGGTT',
 'GAGCCTTGGCCAAAAATCCCCACGG',
 'TGCACGAAAATACGCGGGCCAGGAT',
 'CAGGATGTGCTGCTTGCCGGAAAAT',
 'GAGCGGGAGCCTTGGCCAAAAATCC',
 'CTTGCCGGAAAATCAAAAGAGCGGG',
 'GATGTGCTGCTTGCCGGAAAATCAA',
 'AATCCCCACGGGTTTGAGGAAACAG',
 'AGGAAACAGTACTAATGCACGAAAA',
 'TGCTGCTTGCCGGAAAATCAAAAGA',
 'CACGGGTTTGAGGAAACAGTACTAA',
 'ATACGCGGGCCAGGATGTGCTGCTT',
 'ATGTGCTGCTTGCCGGAAAATCAAA',
 'AGAGCGGGAGCCTTGGCCAAAAATC',
 'TGTGCTGCTTGCCGGAAAATCAAAA',
 'GCCTTGGCCAAAAATCCCCACGGGT',
 'GAAAATACGCGGGCCAGGATGTGCT',
 'AAAAATCCCCACGGGTTTGAGGAAA',
 'TGGCCAAAAATCCCCACGGGTTTGA']

Generating Kmers

We now have a 100bp genome (treated as circular) that has been sequenced to produce 100 reads that are each 25bp in length. The chances that these 100 reads started at every possible point in the genome is really low, and so it will be hard to build a de Bruijn graph from these reads as they are currently. Instead, we can break them down into smaller kmers. Here we will recycle the function we used last time to get kmers from a sequence, and we will instead get all kmers from all of the reads.

In [5]:
def get_kmer_count_from_sequence(sequence, k=3, cyclic=True):
    """
    Returns dictionary with keys representing all possible kmers in a sequence
    and values counting their occurrence in the sequence.
    """
    # dict to store kmers
    kmers = {}
    
    # count how many times each occurred in this sequence (treated as cyclic)
    for i in range(0, len(sequence)):
        kmer = sequence[i:i + k]
        
        # for cyclic sequence get kmers that wrap from end to beginning
        length = len(kmer)
        if cyclic:
            if len(kmer) != k:
                kmer += sequence[:(k - length)]
        
        # if not cyclic then skip kmers at end of sequence
        else:
            if len(kmer) != k:
                continue
        
        # count occurrence of this kmer in sequence
        if kmer in kmers:
            kmers[kmer] += 1
        else:
            kmers[kmer] = 1
    
    return kmers
In [6]:
def get_kmer_count_from_reads(reads, k=3):
    "Combines results of 'get_kmer_count_from_sequence()' across many reads"
   
    # a dictionary to store kmer counts in
    kmers = {}
    
    # iterate over reads
    for read in reads:
        
        # get kmer count for this read
        ikmers = get_kmer_count_from_sequence(read, k, cyclic=False)
        
        # add this kmer count to the global kmer counter across all reads
        for key, value in ikmers.items():
            if key in kmers:
                kmers[key] += value
            else:
                kmers[key] = value
                
    # return kmer counts
    return kmers
In [7]:
get_kmer_count_from_reads(reads)
Out[7]:
{'ATA': 25,
 'TAC': 45,
 'ACG': 73,
 'CGC': 24,
 'GCG': 43,
 'CGG': 93,
 'GGG': 68,
 'GGC': 51,
 'GCC': 95,
 'CCA': 74,
 'CAG': 45,
 'AGG': 47,
 'GGA': 84,
 'GAT': 26,
 'ATG': 44,
 'TGT': 24,
 'GTG': 26,
 'TGC': 96,
 'GCT': 49,
 'CTG': 23,
 'CTT': 51,
 'CCT': 27,
 'TTG': 70,
 'TGG': 26,
 'CAA': 47,
 'AAA': 229,
 'AAT': 90,
 'ATC': 51,
 'TCC': 25,
 'CCC': 47,
 'CAC': 49,
 'GGT': 24,
 'GTT': 23,
 'GAG': 62,
 'AGC': 41,
 'GCA': 24,
 'CGA': 24,
 'GAA': 68,
 'CCG': 21,
 'TCA': 25,
 'AAG': 20,
 'AGA': 19,
 'TTT': 21,
 'TGA': 24,
 'AAC': 20,
 'ACA': 20,
 'AGT': 19,
 'GTA': 19,
 'ACT': 20,
 'CTA': 19,
 'TAA': 20}

Get a de Bruijn graph

Again, this is the same function we used in the previous notebook for build a graph of edges form the kmers.

In [8]:
def get_debruijn_edges_from_kmers(kmers):
    """
    Every possible (k-1)mer (n-1 suffix and prefix of kmers) is assigned
    to a node, and we connect one node to another if the (k-1)mer overlaps 
    another. Nodes are (k-1)mers, edges are kmers.
    """
    # store edges as tuples in a set
    edges = set()
    
    # compare each (k-1)mer
    for k1, count1 in kmers.items():
        for k2, count2 in kmers.items():
            
            # if both kmers occurred in the sequence
            if (count1 & count2) and (k1 != k2):
            
                # if they overlap then add to edges
                if k1[1:] == k2[:-1]:
                    edges.add((k1[:-1], k2[:-1]))
                if k1[:-1] == k2[1:]:
                    edges.add((k2[:-1], k1[:-1]))

    return edges
In [9]:
def plot_debruijn_graph(edges, width=500, height=500, vlshow=True):
    "returns a toyplot graph from an input of edges"
    graph = toyplot.graph(
        [i[0] for i in edges],
        [i[1] for i in edges],
        width=width,
        height=height,
        tmarker=">", 
        vsize=5,
        vstyle={"stroke": "black", "stroke-width": 2, "fill": "black"},
        vlshow=vlshow,
        estyle={"stroke": "black", "stroke-width": 2},
        layout=toyplot.layout.FruchtermanReingold(edges=toyplot.layout.CurvedEdges()))
    return graph

Combine functions together

We now have all of the functions we need to assemble a genome sequence. Let's start with an easy example. Here we want to assemble a 50bp genome and we sequenced 1000 reads that are each 15bp long. So we can assume that the genome is probably covered completely by kmers of these reads. Using k=8 we can see below that a single Eulerian path is assembled from the short reads. This means the genome was successfully assembled.

In [10]:
random.seed(123)
genome = random_sequence(50)
reads = short_read_sequencing(genome, 1000, 15)
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
print(genome)
AGATGAATGGACCGGCCATATAAGTAAACCAGTTGTAGGTCGATTTTGAC
AAACCAGAACCAGTAAGTAAAAATGGACACAGATGACCAGTTACCGGCCAGATGAAAGGTCGAAGTAAACAGTTGTAATAAGTAATATAAGATGAATGATGGACCATTTTGACAGATGACAGTTGTCATATAACCAGTTGCCATATACCGGCCACGATTTTCGGCCATGAATGGAGACAGATGACCGGCGATGAATGATTTTGGCCATATGGACCGGGGCCATAGGTCGATGTAAACCGTAGGTCGTCGATTGTTGTAGTAAACCATAAGTAATAGGTCGTATAAGTTCGATTTTGAATGGTGACAGATGGACCGTGTAGGTTTGACAGTTGTAGGTTTGACATTTTGAC

Low coverage sequencing

What if we now reduce the sequencing coverage to a much lower amount, so that there is possibly parts of the genome that are not covered by n-1 overlapping kmers. When we use 8mers like above, but with lower coverage, we now see that the genome cannot be assembled, and is instead broken into several smaller contigs.

In [11]:
random.seed(123)
genome = random_sequence(50)
reads = short_read_sequencing(genome, 100, 15)
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
AAACCAGAACCAGTAAGTAAAAATGGACACAGATGACCAGTTACCGGCCAGATGAAAGGTCGAAGTAAACAGTTGTAATAAGTAATATAAGATGAATGATGGACCATTTTGACAGATGACAGTTGTCATATAACCAGTTGCCATATACCGGCCACGATTTTCGGCCATGAATGGAGACAGATGACCGGCGATGAATGATTTTGGCCATATGGACCGGGGCCATAGGTCGATGTAAACCGTAGGTCGTCGATTGTTGTAGTAAACCATAAGTAATAGGTCGTATAAGTTCGATTTTGAATGGTGACAGATGGACCGTGTAGGTTTGACAGTTGTAGGTTTGACATTTTGAC

Heterozygosity

Heterozygosity makes it much harder to assemble genomes since it introduces bubbles into the de Bruijn graph where a kmer can connect to multiple others. Let's look at an example where this is easier to see, here a 25bp genome. In the first example we sequence 1000 15bp reads and the genome assembles easily. In the second example, we sequence 500 15bp reads from each of two different copies of the genome which differ at two heterozygous sites. As a result, this introduces a number of loops into the de Bruijn graph, and thus makes it difficult to know unambiguously what the true sequence is. Finally, in a third example we assemble a genome where the two heterozygous sites are futher apart, further than the kmer size. You can see that now it does not create a loop. Thus, the kmer size and heterozygosity interact in setting the size of unique blocks that are useful for assembling the genome using de Bruijn graphs. In general, it is easiest to assemble genomes that are not highly heterozygous. For this reason researchers often use inbred individuals.

In [12]:
random.seed(1234)
genome1 = random_sequence(20)
reads1 = short_read_sequencing(genome1, 1000, 15)
kmers = get_kmer_count_from_reads(reads1, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
AAAAAAGAAAAAGCAAAAGCAAAAGCAAAAAGTTCAAGCAAAAAGTTCAAATAAAAACAATAAAGCAAAGAGTTCACATAAAAACAAAGTTCAATAAACACAATAGCAAAGTGTTCACATAAAAAATCACAATTTCACAA
In [13]:
random.seed(1234)
genome1 = random_sequence(20)

# 500 reads from genome copy 1
reads1 = short_read_sequencing(genome1, 500, 15)

# 500 reads from genome copy 2
genome2 = list(genome1)
genome2[5] = "T"
genome2[9] = "T"
genome2 = "".join(genome2)
reads2 = short_read_sequencing(genome2, 500, 15)
reads = reads1 + reads2

# db graph
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
print(genome1)
print(genome2)
TAAAAAAGCAAAGTTCACAA
TAAAATAGCTAAGTTCACAA
AAAAAAGAAAAAGCAAAAGCAAAAATAGAAAGCAAAAAGTTCAAATAGCAAGCAAAAAGTTCAAATAAAAAATAGCTACAATAAAGCAAAGAGCTAAGAGTTCACATAAAAAATAAAATATAGCTACAAAGTTCAATAAACACAATACTAAGTTGCAAAGTGCTAAGTGTTCACATAAAAAATAAAATATAAGTTCTAGCTAATCACAATTTCACAA
In [14]:
random.seed(1234)
genome1 = random_sequence(20)

# 500 reads from genome copy 1
reads1 = short_read_sequencing(genome1, 500, 15)

# 500 reads from genome copy 2
genome2 = list(genome1)
genome2[1] = "T"
genome2[19] = "C"
genome2 = "".join(genome2)
reads2 = short_read_sequencing(genome2, 500, 15)
reads = reads1 + reads2

# db graph
kmers = get_kmer_count_from_reads(reads, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges);
print(genome1)
print(genome2)
TAAAAAAGCAAAGTTCACAA
TTAAAAAGCAAAGTTCACAC
AAAAAAGAAAAAGCAAAAGCAAAAGCAAAAAGTTCAAGCAAAAAGTTCAAATAAAAACAATAAACACTTAACTTAAAAGCAAAGAGTTCACATAAAAACAAAGTTCAATAAACACAATACACACTTCACTTAACTTAAAAGCAAAGTGTTCACATAAAAAATAAAAAGTCACAATTCACACTTTAAAAATTCACAATTCACAC

Tip for questions below:

To plot a graph without the text labels shown use the argument vlshow=False. This will be useful since you only need to evaluate whether the graph looks fully connected below for larger graphs, and not necessarily need to read the node labels.

In [15]:
# example
random.seed(1234)
genome1 = random_sequence(20)
reads1 = short_read_sequencing(genome1, 1000, 15)
kmers = get_kmer_count_from_reads(reads1, k=8)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges, vlshow=False, width=250, height=250);
[6] Action: Using the functions from above, complete the following: [1] Generate a random genome sequence of length 50; [2] generate 1000 short reads of length 30 from this genome; [3] get kmers from these reads using whatever k size you wish; [4] get de Bruijn graph edges and plot the graph. You should be able to copy and modify the code above to accomplish this.
In [16]:
# set a seed to make it repeatable
random.seed(1234)

# enter length of 50 here
genome1 = random_sequence(50)

# enter 1000 and 30 here
reads1 = short_read_sequencing(genome1, 1000, 30)

# choose a k size
kmers = get_kmer_count_from_reads(reads1, k=12)

# plot the graph
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges, vlshow=False, width=250, height=250);
[7] Action: Using the functions from above, complete the following: [1] Generate a random genome sequence of length 200; [2] try generating 100 or 1000 short reads of length 20 or 50 from this genome; [3] get kmers from these reads using whatever k size you wish; [4] get de Bruijn graph edges and plot the graph. You should be able to copy and modify the code above to accomplish this. Explore how the graph looks when you change the parameters, leave the final executed code in whichever example you like.
In [17]:
# set a seed to make it repeatable
random.seed(1234)

# enter length of 50 here
genome1 = random_sequence(200)

# enter 1000 and 30 here
reads1 = short_read_sequencing(genome1, 1000, 50)

# choose a k size
kmers = get_kmer_count_from_reads(reads1, k=20)

# plot the graph
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges, vlshow=False, width=550, height=550);
[8] Question: From looking at the graphs in the previous question, what did the graph look like when it could not be fully assembled? What kind of relationship did you observe between genome size, read length, and the number of reads needed to reach a fully assembled graph?

There are many loops in the graph when it cannot be assembled. When the kmer size was smaller there were many more nodes and edges in the graph. This made it harder to read but and induced more loops in it. Larger genome size was harder to get a clean graph for because there were more loops probably caused by repeat kmers. The more reads and larger kmer size that was used the better the graph assembled.