Notebook 11.2: denovo PacBio genome assembly

Learning objectives:

By the end of this notebook you will:

  • Experience running PacBio long-read assembly.
  • Compare long read assembly to short-read and hybrid assemblies from notebook 1.
  • Learn about alternative assembly software.

Assigned readings:

Read these papers carefully, we will discuss in class:

  • Li, Fay-Wei, and Alex Harkess. “A Guide to Sequence Your Favorite Plant Genomes.” Applications in Plant Sciences 6, no. 3 (n.d.): e1030. https://doi.org/10.1002/aps3.1030.
  • Wang et al. (2020) The draft nuclear genome assembly ofEucalyptuspauciflora: a pipeline for comparing de novoassemblies. GigaScience. link.
  • Lightfoot et al. (2017) Single-molecule sequencing and Hi-C-based proximity-guided assembly of amaranth (Amaranthus hypochondriacus) chromosomes provide insights into genome evolution. link.

You do not need to read this entire paper, but it will be used as a reference for this notebook:

  • Sumpter, Nicholas, Margi Butler, and Russell Poulter. “Single-Phase PacBio De Novo Assembly of the Genome of the Chytrid Fungus Batrachochytrium Dendrobatidis, a Pathogen of Amphibia.” Microbiology Resource Announcements 7, no. 21 (November 2018). https://doi.org/10.1128/MRA.01348-18.
In [1]:
import pandas as pd
In [2]:
# conda install bioconda::miniasm
# conda install bioconda::minimap2
# conda install bioconda::canu

Batrachochytrium dendrobatidis genome assembly

In this notebook we will continue to examine the published genome assembly data for Bd (https://www.ncbi.nlm.nih.gov/pubmed/30533847) that was recently assembled with PacBio reads. In the last notebook we assembled a related genome using short-read and hybrid short-read + PacBio data. Here we will assemble the genome using only PacBio long reads. In terms of the algorithms involved in assembly this will employ a quite different approach from the de Bruijn graph based approach used in the first notebook. We will try two different approaches, a very fast assembler called miniasm, and a much slower but more accurate method called canu. The latter is the method that the authors used in the study. Both are considered OLC assemblers (overlap-layout-consensus) which infers a Hamiltonian graph by mapping all reads against each other.

Miniasm assembly

The miniasm assembly is performed in two steps. We first use the tool minmap2 to calculate the overlaps among all of the pacbio reads in the fastq data file. Then we input this overlap information along with the raw data to miniasm which will infer a graph assembly -- the best way to uniquely connect all overlapping sequences. This involves splitting ambiguous loops in the graph and calculating the shortest path through the graph. The result can be converted to a final fasta file representing the consensus assembled contigs.

Overlap: get overlap of PacBio reads

Here we align all of the PacBio reads to each other. For very large datasets this can be time consuming. New algorithms (e.g., minimap2 have dramatically improved the speed of this step recently. Although this step runs pretty fast it is too memory-intensive (requires many Gb of RAM) to run on the small servers that we have available on binder or codio. We will instead continue to examine pre-prepared result files.

In [3]:
%%bash

# minimap2 -x ava-pb -t 8 SRR7825134.fastq SRR7825134.fastq > SRR7825134.paf

This returns a result in the form of a pairwise alignment format (.paf) file. To avoid you having to compute this file I uploaded the first 20 lines of the file to a URL so you can easily look at it. Because the format is tabular we can load it as a DataFrame with pandas to look at it below. This file contains information about the positions where long reads in the data overlap each other.

In [4]:
# URL to the snippet of the PAF file hosted online
url = ("https://raw.githubusercontent.com/genomics-course/"
       "11-assembly/master/notebooks/SRR7825134.20lines.paf")

# load the PAF file as a dataframe and return
pd.read_csv(url, sep="\t", header=None)
Out[4]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 SRR7825134.1 7563 729 7455 + SRR7825134.66305 9555 1321 8140 1083 6903 0 tp:A:S cm:i:74 s1:i:1008 dv:f:0.1064
1 SRR7825134.1 7563 85 5480 + SRR7825134.135869 13250 7604 12972 995 5486 0 tp:A:S cm:i:69 s1:i:935 dv:f:0.1004
2 SRR7825134.1 7563 729 7404 + SRR7825134.50128 16210 653 7265 970 6743 0 tp:A:S cm:i:54 s1:i:913 dv:f:0.1179
3 SRR7825134.1 7563 97 7455 - SRR7825134.7296 13396 5948 13367 950 7458 0 tp:A:S cm:i:59 s1:i:910 dv:f:0.1181
4 SRR7825134.1 7563 55 6643 - SRR7825134.119454 10268 218 6869 957 6739 0 tp:A:S cm:i:72 s1:i:889 dv:f:0.1062
5 SRR7825134.1 7563 55 6046 + SRR7825134.29632 9799 302 6226 946 6066 0 tp:A:S cm:i:69 s1:i:887 dv:f:0.1042
6 SRR7825134.1 7563 85 7455 + SRR7825134.17180 11608 3420 10764 938 7461 0 tp:A:S cm:i:58 s1:i:878 dv:f:0.1188
7 SRR7825134.1 7563 1218 7521 - SRR7825134.102167 12373 5408 11550 937 6353 0 tp:A:S cm:i:58 s1:i:865 dv:f:0.1130
8 SRR7825134.1 7563 882 7455 + SRR7825134.130891 14700 6664 13189 937 6668 0 tp:A:S cm:i:60 s1:i:865 dv:f:0.1131
9 SRR7825134.1 7563 55 7455 + SRR7825134.76456 12524 1553 8891 940 7497 0 tp:A:S cm:i:62 s1:i:862 dv:f:0.1163
10 SRR7825134.1 7563 1417 7455 + SRR7825134.108690 12058 87 6116 904 6114 0 tp:A:S cm:i:52 s1:i:860 dv:f:0.1154
11 SRR7825134.1 7563 85 7455 - SRR7825134.30530 12614 3930 11182 916 7419 0 tp:A:S cm:i:59 s1:i:857 dv:f:0.1182
12 SRR7825134.1 7563 55 7455 + SRR7825134.30529 14545 3138 10545 903 7501 0 tp:A:S cm:i:68 s1:i:848 dv:f:0.1129
13 SRR7825134.1 7563 622 7455 + SRR7825134.106363 13941 1245 8103 892 6945 0 tp:A:S cm:i:59 s1:i:843 dv:f:0.1154
14 SRR7825134.1 7563 85 7404 + SRR7825134.79502 23813 15497 22975 905 7521 0 tp:A:S cm:i:54 s1:i:833 dv:f:0.1212
15 SRR7825134.1 7563 729 7455 - SRR7825134.159776 13305 6592 13260 889 6800 0 tp:A:S cm:i:51 s1:i:828 dv:f:0.1203
16 SRR7825134.1 7563 85 7455 + SRR7825134.64273 15867 4127 11510 858 7453 0 tp:A:S cm:i:64 s1:i:817 dv:f:0.1151
17 SRR7825134.1 7563 55 7404 + SRR7825134.142751 15244 2637 9951 865 7422 0 tp:A:S cm:i:53 s1:i:815 dv:f:0.1219
18 SRR7825134.1 7563 95 6474 - SRR7825134.131064 22929 55 6394 867 6458 0 tp:A:S cm:i:56 s1:i:812 dv:f:0.1148
19 SRR7825134.1 7563 55 7455 + SRR7825134.20818 11224 2186 9710 877 7583 0 tp:A:S cm:i:52 s1:i:808 dv:f:0.1229

Layout: assemble contigs from overlapping reads

The program miniasm is an OLC assembler, it takes the raw long-read data (.fastq) and the mapping information (.paf) and constructs contigs by constructing a graph that connects sequences based on the overlaps. An output is produced in graphical fragment assembly format (.gfa). This format can be used in some software (e.g. Bandage) to visualize the assembly. It contains within it the assembled contig sequence information as well as the loops and bubbles (ambiguities) in their assembly (the graph).

The code to run the miniasm assembler is commented below and you do not need to run it.

In [5]:
%%bash

# miniasm -f SRR7825134.fastq SRR7825134.paf > SRR7825134.gfa

Please run the two code cells below which will download the bandage graphical tool and execute it on the inferred GFA assembly graph to create a visualization of the assembly. Then run the cell below to display the assembly graph. You can see that the assembly has resulted in several distinct and disconnected contigs, and that several of them contain bubbles.

In [6]:
%%bash

# download and unzip the Bandage command line tool
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip -q
unzip -qq -foo Bandage_Ubuntu_dynamic_v0_8_1.zip 

# create an assembly graph image with Bandage 
./Bandage image \
    SRR7825134.gfa \
    SRR7825134.gfa.png \
    --names \
    --lengths \
    --outline 0 \
    --colour random \
    --fontsize 10 \
    --iter 4
In [7]:
from IPython.display import Image
Image(filename='SRR7825134.gfa.png')
Out[7]:

If we just want the best estimate of the assembly we can extract the contig sequences from this file to get a fasta file. This can be done using the unix tool awk to extract the lines from the .gfa file that start with "S" and convert them to fasta format, like bellow

In [8]:
%%bash

# awk '$1 ~/S/ {print ">"$2"\n"$3}' SRR7825134.gfa > assembly_miniasm.fasta

Consensus: polish the assembled contigs by mapping reads again

Technically the last step by miniasm performs both a layout and consensus step of the OLC assembly. The assembly fasta file was constructed from the overlaps among the long reads. However, it is common with long-read assemblies to apply one or more additional consensus steps that aim to polish the final base calls. Example programs for final polishing include racon and quiver which use PacBio reads to polish the data, or pilon which can use Illumina data to polish it (The lastest version of pilon added the support for using long reads). We will skip this step for now since it is often very time consuming.

Before we analyze this genome file, let's try assembling the genome again from PacBio data using a different program.

Canu assembly

The program Canu is considered one of the best long-read assemblers, but it is much slower than the last approach we tried. While the assembly above using miniasm finished in just a few minutes, the canu assembly below took several hours. Once we get to the assembly comparisons you can decide if you think the longer run time was worth it.

One of the reasons canu is much slower is that it performs the full suite of steps we've discussed: pre-cleanup of reads, read overlap calculations, layout graph construction, consensus contig construction, and post-analysis contig polishing. A nice feature of canu is that you can tell it to do all of this with a single command rather than requiring multiple different programs or commands.

In [9]:
# canu parameters explained 
# -p: prefix name for output files
# -d: output directory name
# -genomeSize: estimate of size
# -pacbio-raw: the raw fastq data
In [10]:
%%bash

# canu -p SRR7825124 -d assembly_canu genomeSize=30m -pacbio-raw SRR7825134.fastq

While canu is running it prints a long log file to the screen explaining each step of the analysis and how much resources it is consuming, very similar to spades in the last notebook. We'll skip looking at the log file for now though and continue to comparing the resulting files.

Compare all four genome assemblies

Again we will use quast to assess the quality of our genome assemblies. The command below inputs the assembled contigs in fasta format of all four assemblies we've completed using four different approaches. It measures summary statistics of assembly quality and also BUSCO scores. You can read the report here.

In [11]:
%%bash

# quast.py \
#     assembly_spades/scaffolds.fasta \
#     assembly_spades_hybrid/scaffolds.fasta \
#     assembly_miniasm.fasta \
#     assembly_canu/SRR7825124.fasta \
#     -o comparison \
#     --conserved-genes-finding \ 
#     --fungus
[8] Question: Which of the four assembly methods yielded the best assembly in your opinion? Which statistics did you base this off of? Why do you think the contig N50 was higher in some assemblies versus others? Why do you think the complete BUSCO score was higher in some versus others? Answer in Markdown below.

It seems overall that the canu long-read assembly is the best. This is based on the number of contigs (100), N50 (860Kb), and BUSCO scores (90%). The longer reads in pacbio assemblies allow them to connect many smaller fragments together to get a larger N50. The canu assembly probably does best because it polishes the long-reads before finding their overlaps. I think that the high error rate in the pacbio only assemblies might make it hard to identify homology with the single-copy orthologs in BUSCO, and so they have lower scores.

[9] Question: What assembly software did Sumpter et al. use in the paper to assemble the *Bd* genome? How do their assembly statistics compare to ours? Why do you think they might differ?

Sumpter et al. used Canu. Our canu assembly actually has a better N50 and fewer contigs than theirs. This might be because the canu software has improved and we have a newer version.

You just completed several genome assemblies that in total took only a few hours. Are you surprised by how easy or difficult this was? The time and expense involved with attaining the data is the hardest part, followed by getting sufficient computational resources to assemble the genome. But overall, it's not a super difficult task. The canu and spades assemblies could be done with a single command to the programs.

[10] Question: Search on google scholar for another published genome from within the last three years. In the Markdown cell below report the citation for the publication, how large the genome is, which technology was used to sequence it, and a summary of the genome assembly (e.g., N50 and/or number of scaffolds/contigs). Try to find an example for an organism that is of interest to you.
In [ ]:
# answer here