Notebook 11.2: denovo PacBio genome assembly¶

Learning objectives:¶

By the end of this notebook you will:

Experience running PacBio long-read assembly.
Compare long read assembly to short-read and hybrid assemblies from notebook 1.
Learn about alternative assembly software.

Assigned readings:¶

Read these papers carefully, we will discuss in class:

Li, Fay-Wei, and Alex Harkess. “A Guide to Sequence Your Favorite Plant Genomes.” Applications in Plant Sciences 6, no. 3 (n.d.): e1030. https://doi.org/10.1002/aps3.1030.

Wang et al. (2020) The draft nuclear genome assembly ofEucalyptuspauciflora: a pipeline for comparing de novoassemblies. GigaScience. link.

Lightfoot et al. (2017) Single-molecule sequencing and Hi-C-based proximity-guided assembly of amaranth (Amaranthus hypochondriacus) chromosomes provide insights into genome evolution. link.

You do not need to read this entire paper, but it will be used as a reference for this notebook:

Sumpter, Nicholas, Margi Butler, and Russell Poulter. “Single-Phase PacBio De Novo Assembly of the Genome of the Chytrid Fungus Batrachochytrium Dendrobatidis, a Pathogen of Amphibia.” Microbiology Resource Announcements 7, no. 21 (November 2018). https://doi.org/10.1128/MRA.01348-18.

import pandas as pd

# conda install bioconda::miniasm
# conda install bioconda::minimap2
# conda install bioconda::canu

Batrachochytrium dendrobatidis genome assembly¶

In this notebook we will continue to examine the published genome assembly data for Bd (https://www.ncbi.nlm.nih.gov/pubmed/30533847) that was recently assembled with PacBio reads. In the last notebook we assembled a related genome using short-read and hybrid short-read + PacBio data. Here we will assemble the genome using only PacBio long reads. In terms of the algorithms involved in assembly this will employ a quite different approach from the de Bruijn graph based approach used in the first notebook. We will try two different approaches, a very fast assembler called miniasm, and a much slower but more accurate method called canu. The latter is the method that the authors used in the study. Both are considered OLC assemblers (overlap-layout-consensus) which infers a Hamiltonian graph by mapping all reads against each other.

Miniasm assembly¶

The miniasm assembly is performed in two steps. We first use the tool minmap2 to calculate the overlaps among all of the pacbio reads in the fastq data file. Then we input this overlap information along with the raw data to miniasm which will infer a graph assembly -- the best way to uniquely connect all overlapping sequences. This involves splitting ambiguous loops in the graph and calculating the shortest path through the graph. The result can be converted to a final fasta file representing the consensus assembled contigs.

Overlap: get overlap of PacBio reads¶

Here we align all of the PacBio reads to each other. For very large datasets this can be time consuming. New algorithms (e.g., minimap2 have dramatically improved the speed of this step recently. Although this step runs pretty fast it is too memory-intensive (requires many Gb of RAM) to run on the small servers that we have available on binder or codio. We will instead continue to examine pre-prepared result files.

%%bash

# minimap2 -x ava-pb -t 8 SRR7825134.fastq SRR7825134.fastq > SRR7825134.paf

This returns a result in the form of a pairwise alignment format (.paf) file. To avoid you having to compute this file I uploaded the first 20 lines of the file to a URL so you can easily look at it. Because the format is tabular we can load it as a DataFrame with pandas to look at it below. This file contains information about the positions where long reads in the data overlap each other.

# URL to the snippet of the PAF file hosted online
url = ("https://raw.githubusercontent.com/genomics-course/"
       "11-assembly/master/notebooks/SRR7825134.20lines.paf")

# load the PAF file as a dataframe and return
pd.read_csv(url, sep="\t", header=None)

Layout: assemble contigs from overlapping reads¶

The program miniasm is an OLC assembler, it takes the raw long-read data (.fastq) and the mapping information (.paf) and constructs contigs by constructing a graph that connects sequences based on the overlaps. An output is produced in graphical fragment assembly format (.gfa). This format can be used in some software (e.g. Bandage) to visualize the assembly. It contains within it the assembled contig sequence information as well as the loops and bubbles (ambiguities) in their assembly (the graph).

The code to run the miniasm assembler is commented below and you do not need to run it.

%%bash

# miniasm -f SRR7825134.fastq SRR7825134.paf > SRR7825134.gfa

Please run the two code cells below which will download the bandage graphical tool and execute it on the inferred GFA assembly graph to create a visualization of the assembly. Then run the cell below to display the assembly graph. You can see that the assembly has resulted in several distinct and disconnected contigs, and that several of them contain bubbles.

%%bash

# download and unzip the Bandage command line tool
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip -q
unzip -qq -foo Bandage_Ubuntu_dynamic_v0_8_1.zip 

# create an assembly graph image with Bandage 
./Bandage image \
    SRR7825134.gfa \
    SRR7825134.gfa.png \
    --names \
    --lengths \
    --outline 0 \
    --colour random \
    --fontsize 10 \
    --iter 4

from IPython.display import Image
Image(filename='SRR7825134.gfa.png')

If we just want the best estimate of the assembly we can extract the contig sequences from this file to get a fasta file. This can be done using the unix tool awk to extract the lines from the .gfa file that start with "S" and convert them to fasta format, like bellow

%%bash

# awk '$1 ~/S/ {print ">"$2"\n"$3}' SRR7825134.gfa > assembly_miniasm.fasta

Consensus: polish the assembled contigs by mapping reads again¶

Technically the last step by miniasm performs both a layout and consensus step of the OLC assembly. The assembly fasta file was constructed from the overlaps among the long reads. However, it is common with long-read assemblies to apply one or more additional consensus steps that aim to polish the final base calls. Example programs for final polishing include racon and quiver which use PacBio reads to polish the data, or pilon which can use Illumina data to polish it (The lastest version of pilon added the support for using long reads). We will skip this step for now since it is often very time consuming.

Before we analyze this genome file, let's try assembling the genome again from PacBio data using a different program.

Canu assembly¶

The program Canu is considered one of the best long-read assemblers, but it is much slower than the last approach we tried. While the assembly above using miniasm finished in just a few minutes, the canu assembly below took several hours. Once we get to the assembly comparisons you can decide if you think the longer run time was worth it.

One of the reasons canu is much slower is that it performs the full suite of steps we've discussed: pre-cleanup of reads, read overlap calculations, layout graph construction, consensus contig construction, and post-analysis contig polishing. A nice feature of canu is that you can tell it to do all of this with a single command rather than requiring multiple different programs or commands.

# canu parameters explained 
# -p: prefix name for output files
# -d: output directory name
# -genomeSize: estimate of size
# -pacbio-raw: the raw fastq data

%%bash

# canu -p SRR7825124 -d assembly_canu genomeSize=30m -pacbio-raw SRR7825134.fastq

While canu is running it prints a long log file to the screen explaining each step of the analysis and how much resources it is consuming, very similar to spades in the last notebook. We'll skip looking at the log file for now though and continue to comparing the resulting files.

Compare all four genome assemblies¶

Again we will use quast to assess the quality of our genome assemblies. The command below inputs the assembled contigs in fasta format of all four assemblies we've completed using four different approaches. It measures summary statistics of assembly quality and also BUSCO scores. You can read the report here.

%%bash

# quast.py \
#     assembly_spades/scaffolds.fasta \
#     assembly_spades_hybrid/scaffolds.fasta \
#     assembly_miniasm.fasta \
#     assembly_canu/SRR7825124.fasta \
#     -o comparison \
#     --conserved-genes-finding \ 
#     --fungus

[8] Question: Which of the four assembly methods yielded the best assembly in your opinion? Which statistics did you base this off of? Why do you think the contig N50 was higher in some assemblies versus others? Why do you think the complete BUSCO score was higher in some versus others? Answer in Markdown below.

It seems overall that the canu long-read assembly is the best. This is based on the number of contigs (100), N50 (860Kb), and BUSCO scores (90%). The longer reads in pacbio assemblies allow them to connect many smaller fragments together to get a larger N50. The canu assembly probably does best because it polishes the long-reads before finding their overlaps. I think that the high error rate in the pacbio only assemblies might make it hard to identify homology with the single-copy orthologs in BUSCO, and so they have lower scores.

[9] Question: What assembly software did Sumpter et al. use in the paper to assemble the *Bd* genome? How do their assembly statistics compare to ours? Why do you think they might differ?

Sumpter et al. used Canu. Our canu assembly actually has a better N50 and fewer contigs than theirs. This might be because the canu software has improved and we have a newer version.

You just completed several genome assemblies that in total took only a few hours. Are you surprised by how easy or difficult this was? The time and expense involved with attaining the data is the hardest part, followed by getting sufficient computational resources to assemble the genome. But overall, it's not a super difficult task. The canu and spades assemblies could be done with a single command to the programs.

[10] Question: Search on google scholar for another published genome from within the last three years. In the Markdown cell below report the citation for the publication, how large the genome is, which technology was used to sequence it, and a summary of the genome assembly (e.g., N50 and/or number of scaffolds/contigs). Try to find an example for an organism that is of interest to you.

# answer here

	0	1	2	3	4	5	6	7	8	9	10	12	13	14	15
0	SRR7825134.1	7563	729	7455	+	SRR7825134.66305	9555	1321	8140	1083	6903	tp:A:S	cm:i:74	s1:i:1008	dv:f:0.1064
1	SRR7825134.1	7563	85	5480	+	SRR7825134.135869	13250	7604	12972	995	5486	tp:A:S	cm:i:69	s1:i:935	dv:f:0.1004
2	SRR7825134.1	7563	729	7404	+	SRR7825134.50128	16210	653	7265	970	6743	tp:A:S	cm:i:54	s1:i:913	dv:f:0.1179
3	SRR7825134.1	7563	97	7455	-	SRR7825134.7296	13396	5948	13367	950	7458	tp:A:S	cm:i:59	s1:i:910	dv:f:0.1181
4	SRR7825134.1	7563	55	6643	-	SRR7825134.119454	10268	218	6869	957	6739	tp:A:S	cm:i:72	s1:i:889	dv:f:0.1062
5	SRR7825134.1	7563	55	6046	+	SRR7825134.29632	9799	302	6226	946	6066	tp:A:S	cm:i:69	s1:i:887	dv:f:0.1042
6	SRR7825134.1	7563	85	7455	+	SRR7825134.17180	11608	3420	10764	938	7461	tp:A:S	cm:i:58	s1:i:878	dv:f:0.1188
7	SRR7825134.1	7563	1218	7521	-	SRR7825134.102167	12373	5408	11550	937	6353	tp:A:S	cm:i:58	s1:i:865	dv:f:0.1130
8	SRR7825134.1	7563	882	7455	+	SRR7825134.130891	14700	6664	13189	937	6668	tp:A:S	cm:i:60	s1:i:865	dv:f:0.1131
9	SRR7825134.1	7563	55	7455	+	SRR7825134.76456	12524	1553	8891	940	7497	tp:A:S	cm:i:62	s1:i:862	dv:f:0.1163
10	SRR7825134.1	7563	1417	7455	+	SRR7825134.108690	12058	87	6116	904	6114	tp:A:S	cm:i:52	s1:i:860	dv:f:0.1154
11	SRR7825134.1	7563	85	7455	-	SRR7825134.30530	12614	3930	11182	916	7419	tp:A:S	cm:i:59	s1:i:857	dv:f:0.1182
12	SRR7825134.1	7563	55	7455	+	SRR7825134.30529	14545	3138	10545	903	7501	tp:A:S	cm:i:68	s1:i:848	dv:f:0.1129
13	SRR7825134.1	7563	622	7455	+	SRR7825134.106363	13941	1245	8103	892	6945	tp:A:S	cm:i:59	s1:i:843	dv:f:0.1154
14	SRR7825134.1	7563	85	7404	+	SRR7825134.79502	23813	15497	22975	905	7521	tp:A:S	cm:i:54	s1:i:833	dv:f:0.1212
15	SRR7825134.1	7563	729	7455	-	SRR7825134.159776	13305	6592	13260	889	6800	tp:A:S	cm:i:51	s1:i:828	dv:f:0.1203
16	SRR7825134.1	7563	85	7455	+	SRR7825134.64273	15867	4127	11510	858	7453	tp:A:S	cm:i:64	s1:i:817	dv:f:0.1151
17	SRR7825134.1	7563	55	7404	+	SRR7825134.142751	15244	2637	9951	865	7422	tp:A:S	cm:i:53	s1:i:815	dv:f:0.1219
18	SRR7825134.1	7563	95	6474	-	SRR7825134.131064	22929	55	6394	867	6458	tp:A:S	cm:i:56	s1:i:812	dv:f:0.1148
19	SRR7825134.1	7563	55	7455	+	SRR7825134.20818	11224	2186	9710	877	7583	tp:A:S	cm:i:52	s1:i:808	dv:f:0.1229