EEEB GU4055
        
        1. Review notebook assignments: Hi-C.
        
        2. Discuss the assigned readings: Composite assembly.
        
        3. Introduce new topic: reduced-representation sequencing.
        
    
        
        1. DNA sequencing review; and intro to Jupyter/Python.
        
        2. Python bootcamp I: Basic objects.
        
        3. Python bootcamp II: Scientific libraries.
        
        4. Homology/Blast/GFF: Genome structure
        
        5. Phylogenetics I: Sanger sequences to trees.
        
        6. Recombination and Meiosis.
        
        7. Inheritance and pedigrees.
        
        8. Intro to Illumina and read mapping.
        
        9. Intro to long-read technologies and read mapping.
        
        10. Genome Assembly in theory.
        
        11. Genome Assembly in practice.
        
        12. Scaffolding.
        
        13. Phylogenetics II: RAD-seq  
        
        14. Phylogenetics II: SNPs, gene trees and species trees 
    
        Chromosome conformation capture (3C) describes the structure of the 
        genome within a cell; it's organization and structure. Better than 
        microscopy, can tell us how close together (potentially interacting)
        some regions of the genome are (such as promoters and enhancers).
        
        Hi-C: A highthroughput version of 3C is based a library preparation to 
        build chimeric reads followed by short-read sequencing of paired-end 
        reads. Creates a contact map of interactions 
        correlated to spatial distance.
        
Restriction digestion; streptavidin bead extraction; paired-seq.
    When a genome is digested with a restriction enzyme the genome is broken into smaller fragments. Each fragment will begin and end with a characteristic overhang of the restriction enzyme. For the restriction enzyme HindIII, the recognition site is AAGCTT, and the cut occurs between the two A's in the 5' direction (A^AGCTT) such that it leaves one A at the end a fragment, and AGCTT at the beginning of the next fragment. Let's see what this looks like:
   
  def random_sequence(length):
      "return a random sequence of DNA"
      return "".join(np.random.choice(list("ACGT"), size=length))
  def restriction_digest(sequence, recognition, cut):
      """
      restriction digest a genome sequence at the given (recognition) site and
      split the site at the given position (cut) to leave overhangs. 
      """
      # cut sequence at every occurence of recognition site
      fragments = sequence.split(recognition)
      
      # add overhang that results from sequence splitting within the recognition site
      fragments = [recognition[cut:] + i + recognition[:cut] for i in fragments]
      return fragments
        
    
  # generate a 5Mb genome
  seq = random_sequence(5000000)
  # digest the genome at every HindIII site
  fragments = restriction_digest(seq, "AAGCTT", 1)
  # print headers
  print("Restriction recognition site: A^AGCTT")
  print("Expected: [overhang-AGCTT][sequence][overhang-A]")
  # check the beginning and end of the first 10 fragments
  for i in range(10):
      print(fragments[i][:5], fragments[i][5:10], '...', fragments[i][-1:])
      
        
Restriction recognition site: A^AGCTT
Expected: [overhang-AGCTT][sequence][overhang-A]
AGCTT TACAA ... A
AGCTT AATGG ... A
AGCTT CCGTT ... A
AGCTT TCCCC ... A
AGCTT AGCGA ... A
AGCTT GGATT ... A
AGCTT ATATA ... A
...
    
# get fragment lengths binned 
flens = np.histogram([len(i) for i in fragments], bins=50)
# plot distribution of fragment lengths
toyplot.bars(
    flens,
    width=400,
    height=300,
    xlabel="fragment size", 
    ylabel="number of fragments",
);
      
    
    Action 1: Repeat for PstI enzyme: CTGCA^G
# digest the genome at every PstI site
fragments = restriction_digest(seq, "CTGCAG", 5)
# print headers
print("Expected: [overhang-G][sequence][overhang-CTGCA]")
# check the beginning and end of the first 10 fragments
for i in range(10):
    print(fragments[i][:1], '...', fragments[i][5:10], fragments[i][-5:])
    #print(fragments[i][:5], fragments[i][5:10], '...', fragments[i][-1:])
        
        
Expected: [overhang-G][sequence][overhang-CTGCA]
G ... TTAGC CTGCA
G ... TTCAG CTGCA
G ... CCTTA CTGCA
G ... CTTTA CTGCA
G ... GAGCT CTGCA
G ... CCCGG CTGCA
G ... ATCAC CTGCA
G ... TATGT CTGCA
G ... AATAC CTGCA
G ... TCACG CTGCA
        
    
    
    
Visit https://eaton-lab.org/data/ to view two genome reports from Dovetail Inc. assemblies.
It's an exciting time for phylogenetics...
Background on the method and application of RAD-seq:
Applied phylogenomics example: