EEEB GU4055
1. Review notebook assignments: Hi-C.
2. Discuss the assigned readings: Composite assembly.
3. Introduce new topic: reduced-representation sequencing.
1. DNA sequencing review; and intro to Jupyter/Python.
2. Python bootcamp I: Basic objects.
3. Python bootcamp II: Scientific libraries.
4. Homology/Blast/GFF: Genome structure
5. Phylogenetics I: Sanger sequences to trees.
6. Recombination and Meiosis.
7. Inheritance and pedigrees.
8. Intro to Illumina and read mapping.
9. Intro to long-read technologies and read mapping.
10. Genome Assembly in theory.
11. Genome Assembly in practice.
12. Scaffolding.
13. Phylogenetics II: RAD-seq
14. Phylogenetics II: SNPs, gene trees and species trees
Chromosome conformation capture (3C) describes the structure of the
genome within a cell; it's organization and structure. Better than
microscopy, can tell us how close together (potentially interacting)
some regions of the genome are (such as promoters and enhancers).
Hi-C: A highthroughput version of 3C is based a library preparation to
build chimeric reads followed by short-read sequencing of paired-end
reads. Creates a contact map of interactions
correlated to spatial distance.
Restriction digestion; streptavidin bead extraction; paired-seq.
When a genome is digested with a restriction enzyme the genome is broken into smaller fragments. Each fragment will begin and end with a characteristic overhang of the restriction enzyme. For the restriction enzyme HindIII, the recognition site is AAGCTT, and the cut occurs between the two A's in the 5' direction (A^AGCTT) such that it leaves one A at the end a fragment, and AGCTT at the beginning of the next fragment. Let's see what this looks like:
def random_sequence(length):
"return a random sequence of DNA"
return "".join(np.random.choice(list("ACGT"), size=length))
def restriction_digest(sequence, recognition, cut):
"""
restriction digest a genome sequence at the given (recognition) site and
split the site at the given position (cut) to leave overhangs.
"""
# cut sequence at every occurence of recognition site
fragments = sequence.split(recognition)
# add overhang that results from sequence splitting within the recognition site
fragments = [recognition[cut:] + i + recognition[:cut] for i in fragments]
return fragments
# generate a 5Mb genome
seq = random_sequence(5000000)
# digest the genome at every HindIII site
fragments = restriction_digest(seq, "AAGCTT", 1)
# print headers
print("Restriction recognition site: A^AGCTT")
print("Expected: [overhang-AGCTT][sequence][overhang-A]")
# check the beginning and end of the first 10 fragments
for i in range(10):
print(fragments[i][:5], fragments[i][5:10], '...', fragments[i][-1:])
Restriction recognition site: A^AGCTT
Expected: [overhang-AGCTT][sequence][overhang-A]
AGCTT TACAA ... A
AGCTT AATGG ... A
AGCTT CCGTT ... A
AGCTT TCCCC ... A
AGCTT AGCGA ... A
AGCTT GGATT ... A
AGCTT ATATA ... A
...
# get fragment lengths binned
flens = np.histogram([len(i) for i in fragments], bins=50)
# plot distribution of fragment lengths
toyplot.bars(
flens,
width=400,
height=300,
xlabel="fragment size",
ylabel="number of fragments",
);
Action 1: Repeat for PstI enzyme: CTGCA^G
# digest the genome at every PstI site
fragments = restriction_digest(seq, "CTGCAG", 5)
# print headers
print("Expected: [overhang-G][sequence][overhang-CTGCA]")
# check the beginning and end of the first 10 fragments
for i in range(10):
print(fragments[i][:1], '...', fragments[i][5:10], fragments[i][-5:])
#print(fragments[i][:5], fragments[i][5:10], '...', fragments[i][-1:])
Expected: [overhang-G][sequence][overhang-CTGCA]
G ... TTAGC CTGCA
G ... TTCAG CTGCA
G ... CCTTA CTGCA
G ... CTTTA CTGCA
G ... GAGCT CTGCA
G ... CCCGG CTGCA
G ... ATCAC CTGCA
G ... TATGT CTGCA
G ... AATAC CTGCA
G ... TCACG CTGCA
Visit https://eaton-lab.org/data/ to view two genome reports from Dovetail Inc. assemblies.
It's an exciting time for phylogenetics...
Background on the method and application of RAD-seq:
Applied phylogenomics example: