# Notebook 1.1: Introduction to bash¶

### Learning objectives¶

This notebook will demonstrate several introductory bash scripting skills that are useful for bioinformatics, but applied in the context of learning about genomic data. By the end of this exercise you will:

1. Recognize that the genome can be described using text data.
2. Be familiar with the fasta and GFF/GTF genomic file formats.
3. Know how to execute bash code in jupyter notebooks.
4. Recognize there are many useful unix tools for manipulating text/genomic data.

### Prior knowledge¶

I am sure that some of you may already have a lot of experience with bash scripting, while others may have none at all. If you are in the latter group, do not worry! You do not need to have any prior bioinformatics experience to complete the assignments in this class.

After completing this notebook, if you want to seek a more thorough introduction to bash there are many resources online. Two that I recommend include:

1. The first 4 modules in the Linux Command Line Interface Tutorial on Codio, which you can access with your University UNI. This can be found in codio by clicking on the Courses tab on the left side, and then the Recommended tab on the top bar.
1. This short and [simple introductory bash tutorial].(https://opensource.com/article/19/10/programming-bash-syntax-tools)

### What to do when you get stuck?¶

Throughout this course you will surely at some point try to write code to answer a problem and it will return an error message instead of the correct answer. This is expected. It happens all the time. And it is a great learning opportunity. When you see an error message be sure to read it carefully. Some parts of the message may look very confusing, these likely represent a stack trace, which attempts to tell you which part of the code is causing a problem. In addition to this, you will usually get an error message. The message tells you the type of error.

To learn how to solve a problem that is causing an error you can first try modifying your code to see whether there is an obvious fix, such as a typo or syntax error. If you still can't solve the problem, try googling the error message to find out what it means, and to see how other people have encountered and solved similar problems.

Every error has happened before. If you encountered it, then millions of other people have surely encountered the same problem as well. You will be surprised how often you can find the answer you are looking for online. We will return to this topic frequently, about how to find answers to bioinformatics questions. The best trick in bioinformatics is learning how to ask google the right questions to find your answers.

Reminder: Questions that require you to respond using Markdown (text) or Actions that require you to edit and execute code to produce new results will be highlighted on a green background.

### Topic: reference genomes¶

The human genome sequencing project completed in 2003 was a major breakthrough in biology that heralded a new era of genome-centric research. Most people are now familiar with the term genome, and the fact that a genome represents the full genetic sequence of an individual. But how familiar are you with the concept of the genome as a data object? How is a genome represented? Is it in one file, or many? Where is it published and how can you find it? Does it contain every DNA base-pair, or every gene in a genome, or can some be missing?

The short answer is No; even the most complete genome assemblies usually have some amount of missing information. For small genomes, like viruses, we can easily sequence every base, however, there remains variation among individuals and populations that is not represented in a single reference genome. For larger genomes, like humans, some parts of the genome are difficult to assemble because they are highly repetitive (e.g., much of the Y chromosome), and these are often missing or error-prone.

Many genome assemblies are published only in draft form, meaning that they are still a work in progress. The best genomes are assembled into complete chromosomes, whereas many others are still represented as thousands of assembled fragments which we don't yet know how to piece together correctly. New technologies for improving the contiguity of genome assemblies will be a major topic in this class, and it is something that is evolving rapidly. But for now, let's start by looking at a few small complete genomes.

### DNA as text¶

To look at genomes means to look at text files. That's right, genomic data can be represented as simple text. The genome sequence can be represented using the letters A, C, G, and T to refer to the nucleotides (molecules) that make up DNA. Pretty convenient, huh? In addition to the sequence of a genome a published reference assembly usually contains several additional files containing meta information about the genome, such as an annotation of where genes are located. We'll discuss many of those file later as well.

Because genomic data files are typically large, analyzing or even just looking at genomic data requires learning to use efficient software tools for working with large text files. It may surprise you to learn that many of the best tools for this are actually the basic command line tools developed for unix based computers. Computer file systems are also written in text, and so many of the most common and basic programming tools are designed specifically to be able to very efficiently read, edit, or analyze text.

### Jupyter notebooks¶

Jupyter notebooks are a powerful tool for combining code and written text together to create documents for sharing your work. In science we refer to this as reproducibility -- creating documents that others can use to reproduce your work entirely. In this course, we will use notebooks to share code with you for completing assignments; you can run the code, learn from it, and even edit the code or text. Upon completing each notebook, you will have entered in new values and produced new results, which you can then save, download and submit as your completed assignment for us to grade. If you need a refresher on how to use jupyter notebooks look back at the first notebook for this unit.

### Bash and Python¶

The primary coding language used in Jupyter notebooks is Python. However, other languages can be used as well. In this class we will primarily use just two languages: Python and Bash. (We will cover bash briefly first and then learn Python in more depth.)

Bash) is a language for interacting with a terminal (it is a command line language). We will primarily use bash to run executable computer programs in a terminal. Those program themselves may be written in a variety of computer languages, often C, fortran, or even Python. There are several languages similar to bash for running commands in a terminal, of which bash is the most commonly used.

### Running bash in jupyter¶

In jupyter you can open a terminal at any time to execute bash code directly in it if you wish. We demonstrated this in class. However, for the purpose of your assignments we will have you instead execute bash code inside of jupyter notebooks. Again, this is for the sake of reproducibility, notebooks keep a record of the code that you execute, so it is easy to share, and for our purposes, to hand in as an assignment.

To say that we are going to "run a command line program using bash inside of a jupyter notebook" is describing a multi-layered process: jupyter actually uses Python to connect to a temporary bash terminal, which then runs the command line program, captures the output, and returns it to jupyter, which displays the output in your notebook. This sounds complicated, but it's all happening behind the scenes. To run bash code in the notebook all you have to do is append %%bash to the top of a code cell.

## Unix command line tools¶

Most of the tools we are going to use in this tutorial will be available on any *nix based computer system (e,g., Linux, MacOSX), although the specific version that comes pre-installed on your system may vary slightly (e.g., you would replace zcat on linux with gunzip -c on Mac).

In this notebook we will explore the following unix programs. There are many more that we do not mention here, and there are many tutorials available on-line to learn more about them.

1. pwd: print working directory
2. mkdir: make a new directory
3. ls: list contents in a directory
4. wget: download from a URL (download or sub for curl on Mac)
5. head: print first N lines of a file
6. cat/zcat: stream/read through a file line by line
7. grep: select lines from a file based on matching characters
8. cut: select elements from text based on a delimiter/separator
9. wc: count words, lines, or bytes in a file
10. awk: match a pattern and perform action on conditional clause
In [1]:
%%bash
# pwd: 'print working directory'
pwd

/home/deren/Documents/genomics/1-introduction/notebooks


### Atomized functions in *nix command-line programs¶

Please watch the following youtube video for an introduction to the philosophy behind the design of *nix-based command line programs that form the backbone of the bash shell: https://www.youtube.com/watch?v=tc4ROCJYbm0. The core idea is that many small programs with simple functions can be combined together to accomplish very complex tasks.

### The format of command line tools¶

As we discussed in class, almost all command line programs follow a similar syntax (set of rules) for how to enter arguments to them to make them run. You always start with a program name, and then optionally with a set or arguments (flags) that start with a dash. Finally, some programs require input (target) that they process to return a result. Below are some examples of this syntax. Next we will try it out with some interactive examples.

# program followed by optional flags and optional target
<program name> [-option1] [-option2] [target]

# example, to call the program ls (list)
ls

# to call the program with a target (location to list)
ls ../notebooks

# to call the program with an option and target
ls -l ../notebooks


### Create a new directory to store downloaded genome files¶

Here we will use the mkdir command to create a new directory (we'll add the -p option so that it doesn't matter if the directory already exists.) In this code block we also use the ls command with the option -l to show the contents of our current directory as a list. We can run as many bash commands as we want on separate lines of the same code block. However, when creating notebooks of your own it is generally good practice to break up code into atomized steps. From the output of ls below you can see that the new directory we created (genomes/) is now located inside of our current directory (called notebooks/), and is alongside several notebook files.

In [2]:
%%bash
# make new directory called genomes in our current directory
mkdir -p ./genomes/

# show everything in my current directory as a list
ls -l

total 100
drwxr-xr-x 2 deren deren  4096 Jan 22 10:21 genomes
-rw-r--r-- 1 deren deren 11094 Jan 21 18:12 nb-1.0-jupyter.ipynb
-rw-r--r-- 1 deren deren 75287 Jan 22 10:44 nb-1.1-bash-genomes.ipynb
-rw-r--r-- 1 deren deren  7623 Jan 21 17:47 nb-1.2-genome-databases.ipynb


### Where do we get genomic data?¶

There are several standardized databases for storing genomes and genomic data. We will use NCBI the most, since it is hosted by the US National Institutes of Health (NIH). We will cover genomic databases in more depth next week.

Open this link in a new tab https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/ to view an NCBI FTP site where the reference Human genome is hosted. As you can see in the URL we are visiting a site that is structured like a set of folders on a computer system, and in which we are jumping to vertebrate_mammalian/Homo_sapiens, which suggests that there are many other vertebrate mammalian genomes on this site as well. At the new page click on the link for GCF_000001405.39_GRCh38.p13/. This is the latest published human reference genome. That's right, assemblies are continually being updated as new data and new software tools become available, or as new individuals are sequenced.

As you can see there are many files containing different aspects of genome information. Some describe how the genome was assembled, others how many genes there are and where they are located. The assembled genome sequence itself is in a fasta file. This is the file we are interested in right now. Do not click to download yet, we will instead use a command line program to download files. This is generally good form, since you'll leave a record in your notebook of exactly where you got your data from (i.e., it is good for reproducibility).

### Download a few (small) reference genome fasta files from public URLs¶

We will use the unix command wget to retrieve data from a URL, with the option -O (capital o, not a zero) to designate the location where we want the file to be saved (for simplicity we also rename the files to a simpler name: virus, yeast, and chicken). Let's also add the option -q (quiet) to repress progress messages from being printed during the downloads (unless your download fails, then remove the -q to see the error message). This command will take somewhere between 30 seconds and several minutes depending on your internet speed.

In [3]:
%%bash
# to make this code more readable we will save the URLs as variables
url1="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/Pandoravirus_quercus/latest_assembly_versions/GCF_003233895.1_ASM323389v1/GCF_003233895.1_ASM323389v1_genomic.fna.gz"
url2="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz"
url3="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gallus_gallus/representative/GCF_000002315.6_GRCg6a/GCF_000002315.6_GRCg6a_genomic.fna.gz"

# now we can run the wget commands by referring to these variables
wget $url1 -q -O ./genomes/virus.fna.gz wget$url2 -q -O ./genomes/yeast.fna.gz


### Format of the GFF file¶

Let's take a look at the file. It is tab-delimited, and contains several columns with information about where genomic regions start and stop, and what information is known about those regions. The fields listed on each line of the GFF file are as follows:

seqid - name of the chromosome or scaffold;
source - name of the program, database, or project that generated this feature;
type - type of feature. Must be a term or accession from the SOFA sequence ontology;
start - Start position of the feature, with sequence numbering starting at 1;
end - End position of the feature, with sequence numbering starting at 1;
score - A floating point value;
strand - defined as + (forward) or - (reverse);
phase - One of '0', '1' or '2': which base in the first base of a codon;
attributes - A semicolon-separated list of tag-value pairs of additional info;

Let's look at the first few lines of the file using head. The first few lines start with #, these are comment lines that make up the header of the file. This is for storing metadata; information about this file and how it was made. The contents of the file are GFF table are then listed below.

In [14]:
%%bash
# look at a few lines of the GFF
zcat genomes/yeast.gff.gz | head -n 15

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCF_000146045.2
#!annotation-source SGD R64-2-1
##sequence-region NC_001133.9 1 230218
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292
NC_001133.9	RefSeq	region	1	230218	.	+	.	ID=NC_001133.9:1..230218;Dbxref=taxon:559292;Name=I;chromosome=I;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=S288C
NC_001133.9	RefSeq	telomere	1	801	.	-	.	ID=id-NC_001133.9:1..801;Dbxref=SGD:S000028862;Note=TEL01L%3B Telomeric region on the left arm of Chromosome I%3B composed of an X element core sequence%2C X element combinatorial repeats%2C and a short terminal stretch of telomeric repeats;gbkey=telomere
NC_001133.9	RefSeq	origin_of_replication	707	776	.	+	.	ID=id-NC_001133.9:707..776;Dbxref=SGD:S000121252;Note=ARS102~Autonomously Replicating Sequence;gbkey=rep_origin
NC_001133.9	RefSeq	gene	1807	2169	.	-	.	ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8;end_range=2169,.;gbkey=Gene;gene=PAU8;gene_biotype=protein_coding;locus_tag=YAL068C;partial=true;start_range=.,1807
NC_001133.9	RefSeq	mRNA	1807	2169	.	-	.	ID=rna-NM_001180043.1;Parent=gene-YAL068C;Dbxref=GeneID:851229,Genbank:NM_001180043.1;Name=NM_001180043.1;end_range=2169,.;gbkey=mRNA;gene=PAU8;locus_tag=YAL068C;partial=true;product=seripauperin PAU8;start_range=.,1807;transcript_id=NM_001180043.1
NC_001133.9	RefSeq	exon	1807	2169	.	-	.	ID=exon-NM_001180043.1-1;Parent=rna-NM_001180043.1;Dbxref=GeneID:851229,Genbank:NM_001180043.1;end_range=2169,.;gbkey=mRNA;gene=PAU8;locus_tag=YAL068C;partial=true;product=seripauperin PAU8;start_range=.,1807;transcript_id=NM_001180043.1
NC_001133.9	RefSeq	CDS	1807	2169	.	-	0	ID=cds-NP_009332.1;Parent=rna-NM_001180043.1;Dbxref=SGD:S000002142,GeneID:851229,Genbank:NP_009332.1;Name=NP_009332.1;Note=hypothetical protein%3B member of the seripauperin multigene family encoded mainly in subtelomeric regions;experiment=EXISTENCE:mutant phenotype:GO:0030437 ascospore formation [PMID:12586695];gbkey=CDS;gene=PAU8;locus_tag=YAL068C;product=seripauperin PAU8;protein_id=NP_009332.1


### Selecting fields from a delimited file¶

As you can see it is kind of hard to read the GFF file using a terminal, especially without turning on line-wrapping for long-lines (which unfortunately is not easy to do within a jupyter notebook). The basic unix tools are great for quickly searching for elements in the GFF file, because again, these files can sometimes be very large. The Yeast genome is considered quite small, and its GFF file is still more than 26K lines long, so it would still be hard to make sense of if you tried to interact with it using excel.

In the next class we will learn other convenient ways for examining and working with tabular data using Python. But for now let's explore some the fast and powerful ways to work with data tables from a terminal using bash tools.

### Using cut¶

The program cut) is useful for selecting delimited text (text with a separating character). Below we will combine cut with zcat and grep to select elements from the GFF file. Let's start by selecting just the first few columns so that the results are a bit more readable. We will also use grep reverse matching (-v) which tells it to exclude (as opposed to include) the first few lines of metadata that match our query.

This is our first complex chain of commands in bash, we are piping together 4 bash programs! Each plays a small role and together they can accomplish a rather specific and complex task of showing the first few rows and columns of this data table.

In [15]:
%%bash
# read file | get lines not starting with # | get fields 1-5 | show first 10 lines
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | head -n 10

NC_001133.9	RefSeq	region	1	230218
NC_001133.9	RefSeq	telomere	1	801
NC_001133.9	RefSeq	origin_of_replication	707	776
NC_001133.9	RefSeq	gene	1807	2169
NC_001133.9	RefSeq	mRNA	1807	2169
NC_001133.9	RefSeq	exon	1807	2169
NC_001133.9	RefSeq	CDS	1807	2169
NC_001133.9	RefSeq	gene	2480	2707
NC_001133.9	RefSeq	mRNA	2480	2707
NC_001133.9	RefSeq	exon	2480	2707


### Nested features¶

As you can see from the few features in the GFF file above, the first element is defined as a region and this spans from position 1-230219. The next feature is defined as a telomere which stretches from positions 1-801. How can both of these features start at position 1? It is because the second is nested within the first. It is a telomere located in the first region. In this case, region 1 is referring to the first chromosome. Let's confirm this by looking at all of the region elements in the file. Below you can see that there are many distinct regions, each starting from position 1 (on different chromosomes) and stretching to a different end position (depending on the length of the chromosome).

In [16]:
%%bash
# read file | exclude lines starting with # | get fields 1-5 | keep lines with word 'region'
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | grep -w 'region'

NC_001133.9	RefSeq	region	1	230218
NC_001134.8	RefSeq	region	1	813184
NC_001135.5	RefSeq	region	1	316620
NC_001136.10	RefSeq	region	1	1531933
NC_001137.3	RefSeq	region	1	576874
NC_001138.5	RefSeq	region	1	270161
NC_001139.9	RefSeq	region	1	1090940
NC_001140.6	RefSeq	region	1	562643
NC_001141.2	RefSeq	region	1	439888
NC_001142.9	RefSeq	region	1	745751
NC_001143.9	RefSeq	region	1	666816
NC_001144.5	RefSeq	region	1	1078177
NC_001145.3	RefSeq	region	1	924431
NC_001146.8	RefSeq	region	1	784333
NC_001147.6	RefSeq	region	1	1091291
NC_001148.4	RefSeq	region	1	948066
NC_001224.1	RefSeq	region	1	85779


Something to take note of here: each region here has a different seqid (the first column in the data). This is because the seqid refers to chromosomes/scaffolds and each region here is a chromosome. All of the other elements nested inside of the regions have the same seqid, but different start and stop positions. We can see this by running the same command as above, but adding the -A argument to grep so that it also prints 1 additional line after each match. Here you can see that the telomere that starts each chromosome has the same seqid as the chromosome, and stretches from 1 to some length.

In [17]:
%%bash
# read file | not lines start w/ # | only fields 1-5 | only w/ 'region' & print next line
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | grep -w 'region' -A1

NC_001133.9	RefSeq	region	1	230218
NC_001133.9	RefSeq	telomere	1	801
--
NC_001134.8	RefSeq	region	1	813184
NC_001134.8	RefSeq	telomere	1	6608
--
NC_001135.5	RefSeq	region	1	316620
NC_001135.5	RefSeq	telomere	1	1098
--
NC_001136.10	RefSeq	region	1	1531933
NC_001136.10	RefSeq	telomere	1	904
--
NC_001137.3	RefSeq	region	1	576874
NC_001137.3	RefSeq	telomere	1	6473
--
NC_001138.5	RefSeq	region	1	270161
NC_001138.5	RefSeq	telomere	1	5530
--
NC_001139.9	RefSeq	region	1	1090940
NC_001139.9	RefSeq	telomere	1	781
--
NC_001140.6	RefSeq	region	1	562643
NC_001140.6	RefSeq	telomere	1	5505
--
NC_001141.2	RefSeq	region	1	439888
NC_001141.2	RefSeq	telomere	1	7784
--
NC_001142.9	RefSeq	region	1	745751
NC_001142.9	RefSeq	telomere	1	7767
--
NC_001143.9	RefSeq	region	1	666816
NC_001143.9	RefSeq	telomere	1	807
--
NC_001144.5	RefSeq	region	1	1078177
NC_001144.5	RefSeq	telomere	1	12085
--
NC_001145.3	RefSeq	region	1	924431
NC_001145.3	RefSeq	telomere	1	6344
--
NC_001146.8	RefSeq	region	1	784333
NC_001146.8	RefSeq	telomere	1	7428
--
NC_001147.6	RefSeq	region	1	1091291
NC_001147.6	RefSeq	telomere	1	847
--
NC_001148.4	RefSeq	region	1	948066
NC_001148.4	RefSeq	telomere	1	7223
--
NC_001224.1	RefSeq	region	1	85779
NC_001224.1	RefSeq	gene	731	802


### Counting genes in the GFF file¶

One type of feature listed in the GFF file are genes. As we said earlier, this term is used to describe a particular type of feature according to a set of definitions (an ontology) for how to define genomic elements (more on this next week). Let's search for how many genes are in the Yeast GFF file by grepping the word "gene". Take note, we are not searching the whole file for just the term gene, because it actually appears quite commonly (for example, in the words genetic or geneID). Instead, we first pull out the column of features so that we are only searching this, and then search for word matches (-w option) with grep. Finally, we can use the -c option of grep to count the number of matching lines rather than return them.

In [18]:
%%bash
# read file | get 3rd field | grep -w = match to word 'gene', -c count number of times
zcat genomes/yeast.gff.gz | cut -f 3 | grep -wc "gene"

6427

Question: Using google search to find how many genes are estimated to be in the yeast (Saccharomyces cereviseae) genome. Is it similar to what you found in the grep command above? Why do you think the results might vary? List the online source you used for the number of genes in yeast.

### Response:

This number matches closely to an estimate online. The number probably changes among different annotated assemblies because they are *estimates*. Many sources available, just google.

### Pipes: from simple to complex¶

Using pipes you can combine many unix commands to accomplish complex tasks. Although the function of each individual program may seem limited, as you learn more and more of these atomic tasks you can combine them to do just about aything with text, and often quite fast. Here we combine several commands to count the top ten features that occur most in the Yeast GFF file and order and print them.

Don't worry, we do not expect you to memorize all of these commands. It takes time to learn and remember which types of bash tools exist, and what they can do. Simply being exposed to each is the first step though.

In [19]:
%%bash
# extend a command over multiple lines
zcat genomes/yeast.gff.gz | grep -v '^#' | cut -sf 3 | sort | uniq -c | sort -rn | head

   6760 exon
6427 gene
6366 CDS
5995 mRNA
383 long_terminal_repeat
360 origin_of_replication
299 tRNA
76 snoRNA
64 centromere
50 mobile_genetic_element


#### Splitting commands over multiple lines¶

Sometimes long commands can become difficult to read. You can split a command over multiple lines by ending each line with a forward slash character. This makes it easier to read. This is the same command as above, where you can clearly see each line starts with the name of a bash tool that is being called, which includes 7 in total.

In [20]:
%%bash
# extend a command over multiple lines
zcat genomes/yeast.gff.gz | \
grep -v '^#' | \
cut -s -f 3 | \
sort | \
uniq -c | \
sort -rn | \
head

   6760 exon
6427 gene
6366 CDS
5995 mRNA
383 long_terminal_repeat
360 origin_of_replication
299 tRNA
76 snoRNA
64 centromere
50 mobile_genetic_element


### Parsing attribute information from the GFF¶

The final column of the GFF file, the attributes, is a bit more difficult to read and comprehend, but this is where the real details about each genetic feature are found.

For example, you can find gene names, and descriptions of their functions in the attributes section, as well as get ontology IDs that can be used to learn more about a type of feature. Below I use cut to select the 9th (last) column of the GFF, then grep to select only rows that include the text "ID=gene", and then head to print only the first match. The result is the attribute text of the first gene in the file.

In [21]:
%%bash
# select the attributes (last column) of the first gene
zcat genomes/yeast.gff.gz | cut -f 9 | grep "ID=gene" | head -n 1

ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8;end_range=2169,.;gbkey=Gene;gene=PAU8;gene_biotype=protein_coding;locus_tag=YAL068C;partial=true;start_range=.,1807


As you can see, the information in the attribute field is one long string of text that is delimited by semicolons. So to parse it we need to split elements on this character. Below is an example using a combination of grep and cut commands like we used above, but now we print the first and third items in the attributes string, and apply this to the first 10 genes in the file.

Again, I realize this string of commands looks complex. Do not be overwhelmed by it. Try to read it part by part to understand what each element is doing. You can practice by taking it apart from beginning to end. We will try examples of this in the next notebook.

In [22]:
%%bash
# print the first 10 genes and the gene names
zcat genomes/yeast.gff.gz | cut -f 9 | grep "ID=gene" | cut -d';' -f 1,3 | head -n 10

ID=gene-YAL068C;Name=PAU8
ID=gene-YAL067W-A;Name=YAL067W-A
ID=gene-YAL067C;Name=SEO1
ID=gene-YAL065C;Name=YAL065C
ID=gene-YAL064W-B;Name=YAL064W-B
ID=gene-YAL064C-A;Name=TDA8
ID=gene-YAL064W;Name=YAL064W
ID=gene-YAL063C-A;Name=YAL063C-A
ID=gene-YAL063C;Name=FLO9
ID=gene-YAL062W;Name=GDH3


### Using awk to match text¶

Another great tool that can perform similar tasks to this is called awk. The basic format of an awk command is in the form of a string argument stating "x == y {do z}".

It is similar to cut in that it is designed to work on delimited fields of text (e.g., columns of the data file). Below we print the attributes field ($9) if the type field ($3) matches 'gene'. In other words, we are asking it to print the 9th column if the word 'gene' is in the 3rd column. Using awk we can accomplish similar tasks to what we did above using grep and cut.

NB: Awk is a complex tool that can take a long time to master. We do not expect you to master it today. Just try it out. By learning that it exists and what it can do you can more easily google later to find an awk command that will accomplish the task you are attempting. But don't worry if you find this last section difficult.

In [23]:
%%bash
# if field 3 is gene then return field 9
zcat genomes/yeast.gff.gz | awk '$3 == "gene" {print$9}' | head -n 5

ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8;end_range=2169,.;gbkey=Gene;gene=PAU8;gene_biotype=protein_coding;locus_tag=YAL068C;partial=true;start_range=.,1807
ID=gene-YAL067W-A;Dbxref=GeneID:1466426;Name=YAL067W-A;end_range=2707,.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=YAL067W-A;partial=true;start_range=.,2480
ID=gene-YAL067C;Dbxref=GeneID:851230;Name=SEO1;end_range=9016,.;gbkey=Gene;gene=SEO1;gene_biotype=protein_coding;locus_tag=YAL067C;partial=true;start_range=.,7235
ID=gene-YAL065C;Dbxref=GeneID:851232;Name=YAL065C;end_range=11951,.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=YAL065C;partial=true;start_range=.,11565
ID=gene-YAL064W-B;Dbxref=GeneID:851233;Name=YAL064W-B;end_range=12426,.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=YAL064W-B;partial=true;start_range=.,12046


### Awk, pipe, and repeat¶

We can combine multiple awk calls so that we can parse the results from one command with another. For example, we could select elements of each line by splitting on tabs using the -F (field) argument, and then parse those results by splitting them with a different delimiter like a semicolon by using -F ;. This can be useful for parsing a result like in the example above. Let's try this to select a few items (1, 2, and 3) from the attributes field (field 9) of lines in the GFF file describing genes.

In [24]:
%%bash
# if field 3 is gene then return field 9, then split on ; and select first 3 items
zcat genomes/yeast.gff.gz | \
awk -F '\t' '$3 == "gene" {print$9}' | \
awk -F ';' '{print $1,$2,$3}' | \ head -n 10  ID=gene-YAL068C Dbxref=GeneID:851229 Name=PAU8 ID=gene-YAL067W-A Dbxref=GeneID:1466426 Name=YAL067W-A ID=gene-YAL067C Dbxref=GeneID:851230 Name=SEO1 ID=gene-YAL065C Dbxref=GeneID:851232 Name=YAL065C ID=gene-YAL064W-B Dbxref=GeneID:851233 Name=YAL064W-B ID=gene-YAL064C-A Dbxref=GeneID:851234 Name=TDA8 ID=gene-YAL064W Dbxref=GeneID:851235 Name=YAL064W ID=gene-YAL063C-A Dbxref=GeneID:1466427 Name=YAL063C-A ID=gene-YAL063C Dbxref=GeneID:851236 Name=FLO9 ID=gene-YAL062W Dbxref=GeneID:851237 Name=GDH3  ### Formatting string results¶ We can also format the returned results from an awk command to make the results prettier or easier to parse by the next command in a pipe. We can do this using the OFS (output field separator) argument to awk. Below we output the result of the first awk command as semicolon separated values, and then parse the next string on semicolons. Read the comment lines above each line of code below to follow along with what it is doing. Try modifying part of the code to see how it changes the output. Remember you can look back at this cell to see what the GFF file looks like that we are parsing. In [25]: %%bash # decompress and stream gff file and pipe to next command zcat genomes/yeast.gff.gz | \ # split on tab; if field 3 is gene then return (start;stop;attributes) awk -F '\t' '{OFS=";"}$3 == "gene" {print $4,$5,$9} ' | \ # split on semicolon; get attributes 1 2 5 return as \t separated awk -F ';' '{OFS="\t"} {print$1,$2,$5}' | \

# show only first 10 results
head -n 10

1807	2169	Name=PAU8
2480	2707	Name=YAL067W-A
7235	9016	Name=SEO1
11565	11951	Name=YAL065C
12046	12426	Name=YAL064W-B
13363	13743	Name=TDA8
21566	21850	Name=YAL064W
22395	22685	Name=YAL063C-A
24000	27968	Name=FLO9
31567	32940	Name=GDH3


### Get the seqid, start, and end positions of all telomeres¶

How many telomeres do you expect to find on each chromosome, and where should they be located? Use this information to make sure your results make sense for the exercise below. Note: not all genomes are likely to have telomeres accurately assembled, since these regions are often highly repetitive. In smaller genomes like Yeast they are easier to assemble.

Action: Use any of the tools we have discussed so far. Return a tab-delimited table with the positions of all of the telomeres in the Yeast genome. Each line should have the following information: (seqid) (feature type) (start) (stop). Write your code in the cell block below. **Hint**: consider combining several commands together using pipes that accomplish the tasks as listed below. Look at the examples above in the notebook for how to to each of these steps individually.  read file | exclude if starts w/ # | get fields 1,3,4,5 | include if has 'telomere'
In [5]:
%%bash

# read file | exclude lines starting with # | get fields 1-5 | keep lines with word 'region'
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1,3,4,5 | grep -w 'telomere'

NC_001133.9	telomere	1	801
NC_001133.9	telomere	229411	230218
NC_001134.8	telomere	1	6608
NC_001134.8	telomere	812379	813184
NC_001135.5	telomere	1	1098
NC_001135.5	telomere	315783	316620
NC_001136.10	telomere	1	904
NC_001136.10	telomere	1524625	1531933
NC_001137.3	telomere	1	6473
NC_001137.3	telomere	569599	576874
NC_001138.5	telomere	1	5530
NC_001138.5	telomere	269731	270161
NC_001139.9	telomere	1	781
NC_001139.9	telomere	1083635	1090940
NC_001140.6	telomere	1	5505
NC_001140.6	telomere	556105	562643
NC_001141.2	telomere	1	7784
NC_001141.2	telomere	439068	439888
NC_001142.9	telomere	1	7767
NC_001142.9	telomere	744902	745751
NC_001143.9	telomere	1	807
NC_001143.9	telomere	665904	666816
NC_001144.5	telomere	1	12085
NC_001144.5	telomere	1064281	1078177
NC_001145.3	telomere	1	6344
NC_001145.3	telomere	923541	924431
NC_001146.8	telomere	1	7428
NC_001146.8	telomere	783278	784333
NC_001147.6	telomere	1	847
NC_001147.6	telomere	1083922	1091291
NC_001148.4	telomere	1	7223
NC_001148.4	telomere	942396	948010

Action: When completed save and download this notebook in HTML format and submit it along with the other notebooks for this unit to courseworks.