EEEB GU4055
1. Review notebook assignments.
2. Discuss the assigned reading.
3. Introduction to Python.
"I didn't really understand it, can I just move on without learning
the subject from this unit?"
No, most assignments will build on earlier lessons. If you do not
learn the basics now you will get lost later on.
Please attend office hours to seek help if you fall behind.
We completed this in class, but did you have difficulty on your own either with jupyter or jupyter in codio?
Lines starting with hash (#) are only comments
# This is the general format of unix command line tools
$ program -option1 -option2 target
An example command line program
# e.g., the 'pwd' program with no option or target prints your cur dir
$ pwd
/home/deren/
# The echo command prints text to the screen (stdout)
$ echo "hello\tworld"
hello\tworld
An example with option -e
# The -e option renders special characters as well (e.g., tab)
$ echo -e "hello\tworld"
hello world
Use the `%%head` header to execute entire jupyter cell as bash code:
%%bash
echo -e "hello\tworld"
hello world
When an error is detected the Python interpreter will return a message
to the cell output with a hint about the error.
For example, if we
tried to execute bash code in a Python cell:
# we forgot the %%bash header, the code below is not valid Python.
echo -e "hello\tworld"
File "ipython-input-458-239334a501c4", line 1
echo -e "hello\tworld"
^
SyntaxError: invalid syntax
What is a reference genome? A fasta file.
>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTG
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCA
TGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCAT
CCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCT
TCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTc
attgtataaCTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACT
AATATTACAGAAAAATCCCCACAAAAATCacctaaacataaaaatattctacttttcaacaataataCATAAAC
GCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAATATTGCAATTTGCTTGAACGGATG
CAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTATTCA
AATAATACGGTAGTGGCTCAAACTCATGCGGGTGCTATGATACAATTATATCTTATTTCCATTCCCATATGCTA
ATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGT
GATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATgtcaaataattttacgGTAATATAACTTAT
Read a compressed fasta file (e.g., a genome) and PIPE the output to view just a portion of it. The head command is extremely useful for "peeking" at large files. It's fast!
# pass the result of zcat to the program head
$ zcat genomes/virus.fna.gz | head -n 10
>NC_037667.1 Pandoravirus quercus, complete genome
CCGGTACAGTGAGCGGTTCACGGCCTGGCCACGGTCGACGGAGTGCCGTGCGATGCCATCGGCGACGGCCG
CGCGGGCATTCGCACGTGCGACCACAGCCGTCAGTGGTACTGGCGGGACGAGGCCGTCGGGGTGACGGACG
ACCTGCTCGATGCCATCACACGATGCGCCGAGTACGCGCACGATACCATCAGGGCGCCGTTGGCGAGCAAA
GAGATTATGGAGTTCAGCGTCCGTTGCACCCGCCAGGCGGCGGCCGGAGGCGACGACGTCACGGACCCCAT
GGACGCGAGGCCAGGCGCACGTGGCGCGCCTATCGCATGCACGCGCGCGTGTTCAGCGCCATCGCGTTGCT
ACCGCTGAGCATGATGGCGACGGCGGGTCTGCCCTTCTATGACGTGCGCCGGTACGCGCTGGTGGCGGCCC
GCCGCGCCGAACGCGCGTCGAGCCTGCTCCCAACACGCGTGCGACCAGACACCCTTGCGCACGAGGTGATG
GGCGATGGGCGTCTTCCGCGGCGCTCAATCGCGCACAGCCTCTTTGCAAGTTGGTTCGAACGCAATTACGC
CTACGAGGACGCCAGCGGCATCGACGCCGTGTGGTACGACCATCTCGGTCAAGAGGGCACCCACGAGACCG
We will revisit this file format in the next assignment; it introduces how genomic features are related (e.g., gene -> mRNA transcript -> exon -> CDS). For now, we are using it to practice reading and parsing a tab-delimited file.
Return a tab-delimited table with the positions of all of the telomeres in the Yeast genome. Each line should have the following information: seqid, type, start, stop.
# read file | not lines start w/ # | fields 1,3,4,5 | only w/ 'telomere'
zcat genomes/yeast.gff.gz | \
grep -v "^#" | \
cut -f 1,3-5 | \
grep -w 'telomere'
NC_001133.9 telomere 1 801
NC_001133.9 telomere 229411 230218
NC_001134.8 telomere 1 6608
NC_001134.8 telomere 812379 813184
NC_001135.5 telomere 1 1098
NC_001135.5 telomere 315783 316620
NC_001136.10 telomere 1 904
NC_001136.10 telomere 1524625 1531933
NC_001137.3 telomere 1 6473
...
The refseq/ directory contains genomes that are annotated,
meaning they contain files with information about where genes, and other
genomic features (e.g., telomeres) are located. While we of course
hope to learn what all of these features are, you don't need to know
yet. They are simply labeled things in the genome that are marked with
a start and stop position.
Link to FTP
Python has been around for 20 years, and so there are many resources for
learning Python online. For extra help I recommend looking for *modern*
tutorials. That means tutorials that use IPython/jupyter, and which teach
Python3 as opposed to Python2.
The tutorials we have selected, and created, aim for modern Python use.
Learn by doing. Run the pre-written code in the assignment notebooks,
modify it, see how it changes, learn from it. *Try* to solve the
assigned problems on your own before you seek help.
Search for answers when you are stuck. If you get an error, type that
error into google to learn what it means. If you want to learn how to
create a list of list objects in python, google "python create a list
of lists"
Don't stress out. You can't learn programming all at once, it takes
time and *practice*. You'll pick it up through repetition, reading code,
and *trying* to solve problems with code. That is the purpose of our
exercises.
Take advantage of the *interactive* nature of Python in jupyter. Use [tab]-completion to view attributes/functions of objects, and use shift+tab to view documentation notes for functions.