Notebook 10.2: De Bruijn graphs

Reading:

This notebook is meant to accompany the following assigned article: Compeau, Phillip E. C., Pavel A. Pevzner, and Glenn Tesler. 2011. “How to Apply de Bruijn Graphs to Genome Assembly.” Nature Biotechnology 29 (11): 987–91. https://doi.org/10.1038/nbt.2023.

Learning objectives:

By the end of this notebook you should:

  1. Understand the example from Box 1 of your reading.
  2. Be able to follow along with code for constructing a de Bruijn graph.
  3. Be able to visually interpret a de Bruijn graph as a genome assembly.
In [1]:
import random
import toyplot

Functions for inferring a de Bruijn graph

In the example below we start by looking at the examples from Box 1 and Fig. 2 from the assigned reading. Can we construct the string "0001110100" from its component 3-mers? To start, we need to extract the 3mers from the sequence. The function below will return a dictionary object where the keys include all observed kmers in the input alphabet, and the values represent the number of times that kmer was observed in the input sequence.

In [2]:
def get_kmer_count_from_sequence(sequence, k=3, cyclic=True):
    """
    Returns dictionary with keys representing all possible kmers in a sequence
    and values counting their occurrence in the sequence.
    """
    # dict to store kmers
    kmers = {}
    
    # count how many times each occurred in this sequence (treated as cyclic)
    for i in range(0, len(sequence)):
        kmer = sequence[i:i + k]
        
        # for cyclic sequence get kmers that wrap from end to beginning
        length = len(kmer)
        if cyclic:
            if len(kmer) != k:
                kmer += sequence[:(k - length)]
        
        # if not cyclic then skip kmers at end of sequence
        else:
            if len(kmer) != k:
                continue
        
        # count occurrence of this kmer in sequence
        if kmer in kmers:
            kmers[kmer] += 1
        else:
            kmers[kmer] = 1
    
    return kmers
In [3]:
# the binary sequence
binary = "0001110100"

# get all 3mers in the binary sequence
kmers = get_kmer_count_from_sequence(binary, k=3, cyclic=False)

# return the kmers dictionary
kmers
Out[3]:
{'000': 1,
 '001': 1,
 '011': 1,
 '111': 1,
 '110': 1,
 '101': 1,
 '010': 1,
 '100': 1}

Build the de Bruijn graph

A graph can be constructed from a list of edges that connect nodes. For example, [(A, B), (B, C), (C, A)] would describe a graph connecting the three nodes A, B, and C. As a directed graph, this states that A connects to B, B to C, and C to A. In terms of Python code we can represent this is many possible ways. Here we will use a set of tuples to store pairs of (k-1)mers representing an edge.

We can construct a graph connecting (k-1)mers as nodes using the code below. This compares all kmers to each other, if the n-1 suffix of one matches the n-1 prefix on another (i.e, they overlap) then the pair (which together form a kmer) is stored as an edge in the graph. It might surprise you to learn that we can infer a de Bruijn graph with only about 10 lines of Python code. Cool right?

In [4]:
def get_debruijn_edges_from_kmers(kmers):
    """
    Every possible (k-1)mer (n-1 suffix and prefix of kmers) is assigned
    to a node, and we connect one node to another if the (k-1)mer overlaps 
    another. Nodes are (k-1)mers, edges are kmers.
    """
    # store edges as tuples in a set
    edges = set()
    
    # compare each (k-1)mer
    for k1 in kmers:
        for k2 in kmers:
            if k1 != k2:            
                # if they overlap then add to edges
                if k1[1:] == k2[:-1]:
                    edges.add((k1[:-1], k2[:-1]))
                if k1[:-1] == k2[1:]:
                    edges.add((k2[:-1], k1[:-1]))

    return edges

This example is from Figure 2 of your reading. We use 4mers to get a de Bruijn graph of a long binary sequence.

In [5]:
# the binary sequence
binary = "0000110010111101"

# get all 4mers in the binary sequence
kmers = get_kmer_count_from_sequence(binary, k=4, cyclic=True)
print(kmers)

# get a set of edges for all 4-mers matching by n-1
edges = get_debruijn_edges_from_kmers(kmers)

# return edges
edges
{'0000': 1, '0001': 1, '0011': 1, '0110': 1, '1100': 1, '1001': 1, '0010': 1, '0101': 1, '1011': 1, '0111': 1, '1111': 1, '1110': 1, '1101': 1, '1010': 1, '0100': 1, '1000': 1}
Out[5]:
{('000', '000'),
 ('000', '001'),
 ('001', '010'),
 ('001', '011'),
 ('010', '100'),
 ('010', '101'),
 ('011', '110'),
 ('011', '111'),
 ('100', '000'),
 ('100', '001'),
 ('101', '010'),
 ('101', '011'),
 ('110', '100'),
 ('110', '101'),
 ('111', '110'),
 ('111', '111')}

Plot the de Bruijn graph

Once the graph is constructed we are not yet finished, we still need to find the shortest Eulerian path through the graph. This will represent the most accurate ordered assembly of kmers that will match the full string sequence. In this simple example the set of kmers fits Euler's theorem perfectly, each kmer in the string was observed, and each occurs only once, thus there is a simple path that is clearly visible in the graph.

The function below uses the Python library toyplot to generate a graph from a set of edges. It uses a force-directed algorithm to randomly space the nodes to make it easy to read. The edges of the graph are shown with arrows which indicates the direction. The function below itself simply takes the edge information we generated and a number of styling options for the plot.

In [6]:
def plot_debruijn_graph(edges, width=500, height=500):
    "returns a toyplot graph from an input of edges"
    graph = toyplot.graph(
        [i[0] for i in edges],
        [i[1] for i in edges],
        width=width,
        height=height,
        tmarker=">", 
        vsize=25,
        vstyle={"stroke": "black", "stroke-width": 2, "fill": "none"},
        vlstyle={"font-size": "11px"},
        estyle={"stroke": "black", "stroke-width": 2},
        layout=toyplot.layout.FruchtermanReingold(edges=toyplot.layout.CurvedEdges()))
    return graph
In [7]:
# print the cyclic binary string represented by the de Bruijn graph 
print(binary)

# plot the graph
plot_debruijn_graph(edges);
0000110010111101
000001010011100101110111
[4] Question: Compare the graph above to the one in Figure 2 of your reading. Although the placement of the nodes is not exactly the same, are the connections among them (i.e., the graph structure) the same? To check this, choose a (k-1)mer node on the graph and follow along the directed edges of the graph, and check the binary string to see if you can trace the entire path. Answer in Markdown below.

Yes the graph structure is the same in this figure as in Fig. 2 of the reading. If you start from any position in this graph you can trace a path through the nodes that would reconstruct the cyclic superstring.

Example with sequence data

In this example we will look at sequence data instead of a binary string, and we will explore how kmer length affects our ability to identify a single Eulerian path, versus multiple conflicting paths. We can easily construct a de Bruijn graph from the sequence data just like we did with the binary data by using the same functions we used above. Let's start by generating some random sequence data.

In [8]:
def random_sequence(seqlen):
    "Generate a random DNA sequence of a given length "
    return "".join([random.choice("ACGT") for i in range(seqlen)])
In [9]:
# set a random seed 
random.seed(123)

# get a random genome sequence
genome1 = random_sequence(25)
genome1
Out[9]:
'AGATGAATGGACCGGCCATATAAGT'

kmers and edges for sequence data

And now let's apply our functions to get kmers and edges from the sequence.

In [10]:
# not all possible kmers occur in this sequence, some occur multiple times.
kmers = get_kmer_count_from_sequence(genome1, k=3)
kmers
Out[10]:
{'AGA': 1,
 'GAT': 1,
 'ATG': 2,
 'TGA': 1,
 'GAA': 1,
 'AAT': 1,
 'TGG': 1,
 'GGA': 1,
 'GAC': 1,
 'ACC': 1,
 'CCG': 1,
 'CGG': 1,
 'GGC': 1,
 'GCC': 1,
 'CCA': 1,
 'CAT': 1,
 'ATA': 2,
 'TAT': 1,
 'TAA': 1,
 'AAG': 1,
 'AGT': 1,
 'GTA': 1,
 'TAG': 1}
In [11]:
# edges of connected (k-1)mers for k=3 representing the db graph
edges = get_debruijn_edges_from_kmers(kmers)
edges
Out[11]:
{('AA', 'AG'),
 ('AA', 'AT'),
 ('AC', 'CC'),
 ('AG', 'GA'),
 ('AG', 'GT'),
 ('AT', 'TA'),
 ('AT', 'TG'),
 ('CA', 'AT'),
 ('CC', 'CA'),
 ('CC', 'CG'),
 ('CG', 'GG'),
 ('GA', 'AA'),
 ('GA', 'AC'),
 ('GA', 'AT'),
 ('GC', 'CC'),
 ('GG', 'GA'),
 ('GG', 'GC'),
 ('GT', 'TA'),
 ('TA', 'AA'),
 ('TA', 'AG'),
 ('TA', 'AT'),
 ('TG', 'GA'),
 ('TG', 'GG')}

Get kmers and graph edges and plot for k=6

Let's put these functions all together to infer a de Bruijn graph and plot it for a sequence using kmers of size=6. You can see in the example below that the 6mers from this 25bp genome are able to uniquely map a Eulerian graph to represent the full genome sequence. The thing to take-away from this plot is the relationship of the directed edges. We can see that there is only one path that connects every node in the graph. This path spells out the genome sequence. In other words, we assembled the genome from its kmers!

In [12]:
# get kmers
kmers = get_kmer_count_from_sequence(genome1, k=6, cyclic=True)

# get db graph
edges = get_debruijn_edges_from_kmers(kmers)

# plot db graph
plot_debruijn_graph(edges, width=600, height=400);

# print the true sequence
print("the true sequence: {}".format(genome1))
the true sequence: AGATGAATGGACCGGCCATATAAGT
AAGTAAATGGACCGGAGATGAGTAGATAAGATATAATGAAATGGACATATCCATACCGGCCGGCCGAATGGACCGGATGAGCCATGGACCGGCCAGTAGATAAGTTAGATTATAATGAATTGGAC

We can run the same code as above but this time we provide the function get_kmer_count_from_sequence() with the argument cyclic=False. This is more realistic for a linear chromosome, rather than circular one. Here you can see the genome sequence even more clearly (again, ignore the loopiness of the graph edges, pay attention to which nodes are pointing to which).

In [13]:
# get kmers
kmers = get_kmer_count_from_sequence(genome1, k=6, cyclic=False)

# get db graph
edges = get_debruijn_edges_from_kmers(kmers)

# plot db graph
plot_debruijn_graph(edges, width=600, height=400);

# print the true sequence
print("the true sequence: {}".format(genome1))
the true sequence: AGATGAATGGACCGGCCATATAAGT
AATGGACCGGAGATGATAAGATATAATGAAATGGACATATCCATACCGGCCGGCCGAATGGACCGGATGAGCCATGGACCGGCCATATAATGAATTGGAC

So what's so hard about genome assembly? Well, let's try k=4:

Let's take a look at what happens when we try to infer the same graph using smaller kmers. What in the world! As described in box 2 of your reading, the real world of genome assembly is much more difficult than the toy examples we've seen so far. One reason for this is that given the kmer size that we choose, which is often a limitation imposed by the short read datatype, there may be too much repetition of edges connecting the same kmers such that there is not a single unique Eulerian path through the graph. In the example below we can see that a few nodes have multiple edges coming in or out. In this case, because each internal edge has the same number of edges coming in as out, it still actually forms a Eulerian path! However, it is ambiguous to us which way through the path is the true one. These loops in the path are called bubbles, and genome assembly software is optimized to try to solve them.

In [14]:
kmers = get_kmer_count_from_sequence(genome1, k=4, cyclic=False)
edges = get_debruijn_edges_from_kmers(kmers)
plot_debruijn_graph(edges, width=800, height=400);
AAGAATACCAGAATAATGCATCCACCGCGGGAAGACGATGCCGGAGGCTAATATTGATGG
[5] Question: Box 2 of your reading describes four hidden assumptions of de Bruijn graph assembly. List these four assumptions below. For each list whether we have addressed this assumption yet in our examples.

(1) We generated all kmers present in the genome. Here we used code to generate all kmers so we know that none are missing but in real genomes it is possible that some kmers simply were not sequenced.

(2) Kmers will be error free. We did not address this in our code, but instead generated them as error free. We could have introduced random errors into our sequences to test their effect on the kmer count and graph construction.

(3) Each kmer occurs only once. We did not address this in our code. Using longer kmers will increase the probability that kmers are not repeated often in the sequence. For highly repetitive genomes longer kmers work better. Otherwise repeats will induce loops in the graph.

(4) Genomes are cyclic. We addressed this in our functions above by allowing an option to either include kmers that stretch from the end of the string back to the beginning or not.