By the end of this notebook you will:
Read these papers carefully, we will discuss in class:
You do not need to read this entire paper, but it will be used as a reference for this notebook:
import pandas as pd
# conda install bioconda::miniasm
# conda install bioconda::minimap2
# conda install bioconda::canu
In this notebook we will continue to examine the published genome assembly data for Bd (https://www.ncbi.nlm.nih.gov/pubmed/30533847) that was recently assembled with PacBio reads. In the last notebook we assembled a related genome using short-read and hybrid short-read + PacBio data. Here we will assemble the genome using only PacBio long reads. In terms of the algorithms involved in assembly this will employ a quite different approach from the de Bruijn graph based approach used in the first notebook. We will try two different approaches, a very fast assembler called miniasm
, and a much slower but more accurate method called canu
. The latter is the method that the authors used in the study. Both are considered OLC assemblers (overlap-layout-consensus) which infers a Hamiltonian graph by mapping all reads against each other.
The miniasm assembly is performed in two steps. We first use the tool minmap2
to calculate the overlaps among all of the pacbio reads in the fastq data file. Then we input this overlap information along with the raw data to miniasm
which will infer a graph assembly -- the best way to uniquely connect all overlapping sequences. This involves splitting ambiguous loops in the graph and calculating the shortest path through the graph. The result can be converted to a final fasta file representing the consensus assembled contigs.
Here we align all of the PacBio reads to each other. For very large datasets this can be time consuming. New algorithms (e.g., minimap2
have dramatically improved the speed of this step recently. Although this step runs pretty fast it is too memory-intensive (requires many Gb of RAM) to run on the small servers that we have available on binder or codio. We will instead continue to examine pre-prepared result files.
%%bash
# minimap2 -x ava-pb -t 8 SRR7825134.fastq SRR7825134.fastq > SRR7825134.paf
This returns a result in the form of a pairwise alignment format (.paf) file. To avoid you having to compute this file I uploaded the first 20 lines of the file to a URL so you can easily look at it. Because the format is tabular we can load it as a DataFrame with pandas to look at it below. This file contains information about the positions where long reads in the data overlap each other.
# URL to the snippet of the PAF file hosted online
url = ("https://raw.githubusercontent.com/genomics-course/"
"11-assembly/master/notebooks/SRR7825134.20lines.paf")
# load the PAF file as a dataframe and return
pd.read_csv(url, sep="\t", header=None)
The program miniasm
is an OLC assembler, it takes the raw long-read data (.fastq) and the mapping information (.paf) and constructs contigs by constructing a graph that connects sequences based on the overlaps. An output is produced in graphical fragment assembly format (.gfa). This format can be used in some software (e.g. Bandage) to visualize the assembly. It contains within it the assembled contig sequence information as well as the loops and bubbles (ambiguities) in their assembly (the graph).
The code to run the miniasm
assembler is commented below and you do not need to run it.
%%bash
# miniasm -f SRR7825134.fastq SRR7825134.paf > SRR7825134.gfa
Please run the two code cells below which will download the bandage
graphical tool and execute it on the inferred GFA assembly graph to create a visualization of the assembly. Then run the cell below to display the assembly graph. You can see that the assembly has resulted in several distinct and disconnected contigs, and that several of them contain bubbles.
%%bash
# download and unzip the Bandage command line tool
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip -q
unzip -qq -foo Bandage_Ubuntu_dynamic_v0_8_1.zip
# create an assembly graph image with Bandage
./Bandage image \
SRR7825134.gfa \
SRR7825134.gfa.png \
--names \
--lengths \
--outline 0 \
--colour random \
--fontsize 10 \
--iter 4
from IPython.display import Image
Image(filename='SRR7825134.gfa.png')
If we just want the best estimate of the assembly we can extract the contig sequences from this file to get a fasta file. This can be done using the unix tool awk
to extract the lines from the .gfa
file that start with "S" and convert them to fasta format, like bellow
%%bash
# awk '$1 ~/S/ {print ">"$2"\n"$3}' SRR7825134.gfa > assembly_miniasm.fasta
Technically the last step by miniasm
performs both a layout and consensus step of the OLC assembly. The assembly fasta file was constructed from the overlaps among the long reads. However, it is common with long-read assemblies to apply one or more additional consensus steps that aim to polish the final base calls. Example programs for final polishing include racon
and quiver
which use PacBio reads to polish the data, or pilon
which can use Illumina data to polish it (The lastest version of pilon
added the support for using long reads). We will skip this step for now since it is often very time consuming.
Before we analyze this genome file, let's try assembling the genome again from PacBio data using a different program.
The program Canu is considered one of the best long-read assemblers, but it is much slower than the last approach we tried. While the assembly above using miniasm finished in just a few minutes, the canu assembly below took several hours. Once we get to the assembly comparisons you can decide if you think the longer run time was worth it.
One of the reasons canu is much slower is that it performs the full suite of steps we've discussed: pre-cleanup of reads, read overlap calculations, layout graph construction, consensus contig construction, and post-analysis contig polishing. A nice feature of canu is that you can tell it to do all of this with a single command rather than requiring multiple different programs or commands.
# canu parameters explained
# -p: prefix name for output files
# -d: output directory name
# -genomeSize: estimate of size
# -pacbio-raw: the raw fastq data
%%bash
# canu -p SRR7825124 -d assembly_canu genomeSize=30m -pacbio-raw SRR7825134.fastq
While canu is running it prints a long log file to the screen explaining each step of the analysis and how much resources it is consuming, very similar to spades in the last notebook. We'll skip looking at the log file for now though and continue to comparing the resulting files.
Again we will use quast to assess the quality of our genome assemblies. The command below inputs the assembled contigs in fasta format of all four assemblies we've completed using four different approaches. It measures summary statistics of assembly quality and also BUSCO scores. You can read the report here.
%%bash
# quast.py \
# assembly_spades/scaffolds.fasta \
# assembly_spades_hybrid/scaffolds.fasta \
# assembly_miniasm.fasta \
# assembly_canu/SRR7825124.fasta \
# -o comparison \
# --conserved-genes-finding \
# --fungus
It seems overall that the canu
long-read assembly is the best. This is based on the number of contigs (100), N50 (860Kb), and BUSCO scores (90%). The longer reads in pacbio assemblies allow them to connect many smaller fragments together to get a larger N50. The canu assembly probably does best because it polishes the long-reads before finding their overlaps. I think that the high error rate in the pacbio only assemblies might make it hard to identify homology with the single-copy orthologs in BUSCO, and so they have lower scores.
Sumpter et al. used Canu. Our canu assembly actually has a better N50 and fewer contigs than theirs. This might be because the canu software has improved and we have a newer version.
You just completed several genome assemblies that in total took only a few hours. Are you surprised by how easy or difficult this was? The time and expense involved with attaining the data is the hardest part, followed by getting sufficient computational resources to assemble the genome. But overall, it's not a super difficult task. The canu
and spades
assemblies could be done with a single command to the programs.
# answer here