EEEB GU4055
1. Review notebook assignments: read lengths.
2. Discuss the assigned readings: long read technologies.
3. Introduce new topic: de Bruijn graphs.
1. DNA sequencing review; and intro to Jupyter/Python.
2. Python bootcamp I: Basic objects.
3. Python bootcamp II: Scientific libraries.
4. Homology/Blast/GFF: Genome structure
5. Phylogenetics I: Sanger sequences to trees.
6. Recombination and Meiosis.
7. Inheritance and pedigrees.
8. Intro to Illumina and read mapping.
9. Intro to long-read technologies and read mapping.
10. Genome Assembly.
Loading fasta files as SeqIO record objects.
# The SeqIO module is useful for working with Fasta files
from Bio import SeqIO
# load a Fasta file from a path with Bio
record = SeqIO.parse(path, format="fasta")
# the record object makes the seq data accessible (e.g. length)
len(record)
600
A function to load fasta files and return lengths.
def getLengthDistribution(path):
"Return a list with lengths of each fasta record"
lenList = []
for record in SeqIO.parse(path, "fasta"):
lenList.append(len(record))
return lenList
# example:
getLengthDistribution("./files/sanger.aftertrim.removeCT.min500bp.fasta")
[600, 550, 811, 430, 762, ... ]
A dictionary to access long file names easier.
# Global dictionary of technology name to file paths.
file_path = {
'Sanger': 'files/sanger.total.aftertrim.removeCT.min500bp.fasta',
'PacBio': 'files/PB.Cell1and2.raw.fasta',
'Nanopore': 'files/LejlaControl.2D.min500bp.fasta',
}
# Global dictionary of technology name to colors for plotting
colors = {
'Sanger': '#4daf4a',
'PacBio': '#377eb8',
'Nanopore': '#984ea3',
}
A function to plot a histogram with matplotlib.
def makeLengthPlot(thisRunName, ax):
"Plot histogram of read lengths. Gets filepath from global dict"
# get readlengths
filepath = file_path[thisRunName])
lengths = getLengthDistribution(filepath)
# set histogram bin size
bins = range(500, 5000, 100)
# make histogram
sns.distplot(
length, bins=bins, kde=False,
color=colors[thisRunName],
ax=ax,
label=thisRunName,
hist_kws={"alpha": 1})
# plot styling: legend, xlimit, ylabel, ylimit
ax.legend()
ax.set_xlim([1, 5000])
ax.set_ylabel('number of {} reads'.format(thisRunName))
ax.set_ylim([0, 14000])
# make unique y-axis label for each run
if thisRunName == 'Sanger':
ax.set_ylim([0, 800])
A function to plot a histogram with matplotlib.
# initalize plot with 3 rows and 1 column
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(8,20))
# Plot Sanger, Pacbio, and Nanopore read length distributions
makeLengthPlot('Sanger', ax[0])
makeLengthPlot('PacBio', ax[1])
makeLengthPlot('Nanopore', ax[2])
# add label to x axis
ax[2].set_xlabel('size [bp]')
# Show the plot
plt.show()
Question 4 (2 points): Print the average read length and the top ten longest reads of each run.
# 1. get filepath for Nanopore data set
npath = file_path["Nanopore"]
# 2. call getLengthDistribution on this file
readlens = getLengthDistribution(npath)
# 3. get top ten longest reads
topten = sorted(readlens)[::-1][:10]
print(topten)
# 4. get average length (of all reads)
avglen = sum(readlens) / len(readlens)
print(avglen)
1512.50977897917
[298549, 108360, 72322, 67205, 61366, 60592, 45980, 45605, 41057, 39953]
Question 4 (2 points): Print the average read length and the top ten longest reads of each run.
# get read lenghts of PacBio reads
readlens = getLengthDistribution(file_path["PacBio"]
# get top ten longest reads
topten = sorted(readlens)[::-1][:10]
print(topten)
# get average length (of all reads)
avglen = sum(readlens) / len(readlens)
print(avglen)
1370.1363583000705
[4641, 4327, 4200, 4173, 4125, 4018, 4008, 3989, 3943, 3872]
Question 5 (1 point): What is the avg. len of PacBio reads for Susie?
12.5KbQuestion 6 (2 points): In Gordon et al. (2016), the authors used PacBio to assemble an entire genome, while Loomis et al. (2013) used PacBio to sequence just one gene. What is that gene and why did the authors decide to use long-read sequencing to sequence it?
The gene is Fmr1, and it is implicated in Fragile X syndrome. Fragile X is an example of a 'trinucleotide repeat disorder': healthy individuals have between 6 - 53 CGG repeats in the 5' UTR of the Fmr1 gene, while individuals with Fragile X syndrome typically have 230+ CGG repeats. Genetic analysis of these diseases have been limited by an inability to sequence regions with expanded trinucleotide repeats using short-read DNA sequencing; however, PacBio's long reads can span the length of the repetitive region to better characterize the extent of trinucleotide expansion.
Question 7. (3 points) Homopolymers are consecutive repetitions of a letter in a DNA sequence. For example, the sequence CATAAAG has a homopolymer of length 3, composed to the letter A. Use graph above to explain why homopolymer-length errors are common in nanopore sequencing.
There are 1024 different 5-mers but the signals typically range from about 40-70 pA: this means that normal distributions for certain 5-mers overlap, which can make it difficult to infer which 5-mer generated a particular event. If the signals for adjacent 5-mers of the DNA strand are very similar, e.g. when sequencing through long homopolymers, then the machine may not detect a change in current, which introduces sequencing error.
# read mapping always starts by "indexing" the reference fasta
# and here also links the nanpore data.
nanopolish index -d ecoli ecoli_2kb_region/fast5_files/ ecoli_2kb_region/reads.fasta
[readdb] indexing ecoli_2kb_region/fast5_files/
[readdb] num reads: 112, num reads with path to fast5: 112
# map ONT data to the reference and write output to SAM
minimap2 -ax map-ont \
ecoli_2kb_region/ref.fa \
ecoli_2kb_region/reads.fasta > ecoli_2kb_region/aligned.sam
[M::mm_idx_gen::0.289*0.60] collected minimizers
[M::mm_idx_gen::0.318*0.80] sorted minimizers
[M::main::0.319*0.80] loaded/built the index for 4319 target sequence(s)
[M::mm_mapopt_update::0.333*0.81] mid_occ = 11
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 4319
[M::mm_idx_stat::0.340*0.82] distinct minimizers: 731433 (98.40% are singletons); average occurrences: 1.024; average spacing: 5.463
[M::worker_pipeline::0.470*1.38] mapped 112 sequences
[M::main] Version: 2.15-r915-dirty
[M::main] CMD: ./minimap2/minimap2 -ax map-ont ecoli_2kb_region/ref.fa ecoli_2kb_region/reads.fasta
[M::main] Real time: 0.478 sec; CPU: 0.656 sec; Peak RSS: 0.051 GB
Align k-mer 'events' to the reference genome. We will run eventalign and pass in the FASTA file, the bam file, and the reference genome. The program will output the file aligned.events.txt.
# get aligned kmer events
nanopolish eventalign \
--reads ecoli_2kb_region/reads.fasta \
--bam ecoli_2kb_region/aligned.sorted.bam \
--genome ecoli_2kb_region/ref.fa \
--scale-events > ecoli_2kb_region/aligned.events.txt
[post-run summary] total reads: 537, unparseable: 0, qc fail: 0,
could not calibrate: 0, no alignment: 0, bad fast5: 0
Question 9. (3 points) Why are there four events at position 3? What happened?
The first three rows represent a scenerio in which the event detector erroneously called three events where only one should have been emitted. Note that the current for these three events are all plausibly drawn from the expected distribution `N(101.43, 1.632)`. In the fourth row, the `nanopolish` model detects that this event is a likely sequencing artefact, and therefore the output `NNNNNN` appears in the model_kmer column so that the site can be discarded before downstream analysis.
Question 10. (1 point) Why is the sequence of the 6-mer of the reference kmer different than the model 6-mer?
It is the reverse complement.
Question 11. (1 point) Each event contains a 6-mer. What is the total length of the DNA sequence that contains the set of all 6-mers represented in the `aligned_events` table?
print(aligned_events.position.min())
print(aligned_events.position.max())
# 0
# 4275
The first mapped position is 0, the last mapped position is 4275, the k-mer length is 6bp; thus: 1 (account for index=0) + 4275 + 6bp = 4282 bp