Notebook 1.1: Introduction to bash

Learning objectives

This notebook will demonstrate several introductory bash scripting skills that are useful for bioinformatics, but applied in the context of learning about genomic data. By the end of this exercise you will:

  1. Recognize that the genome can be described using text data.
  2. Be familiar with the fasta and GFF/GTF genomic file formats.
  3. Know how to execute bash code in jupyter notebooks.
  4. Recognize there are many useful unix tools for manipulating text/genomic data.

Prior knowledge

I am sure that some of you may already have a lot of experience with bash scripting, while others may have none at all. If you are in the latter group, do not worry! You do not need to have any prior bioinformatics experience to complete the assignments in this class.

After completing this notebook, if you want to seek a more thorough introduction to bash there are many resources online. Two that I recommend include:

  1. The first 4 modules in the Linux Command Line Interface Tutorial on Codio, which you can access with your University UNI. This can be found in codio by clicking on the Courses tab on the left side, and then the Recommended tab on the top bar.
  1. This short and [simple introductory bash tutorial].(

What to do when you get stuck?

Throughout this course you will surely at some point try to write code to answer a problem and it will return an error message instead of the correct answer. This is expected. It happens all the time. And it is a great learning opportunity. When you see an error message be sure to read it carefully. Some parts of the message may look very confusing, these likely represent a stack trace, which attempts to tell you which part of the code is causing a problem. In addition to this, you will usually get an error message. The message tells you the type of error.

To learn how to solve a problem that is causing an error you can first try modifying your code to see whether there is an obvious fix, such as a typo or syntax error. If you still can't solve the problem, try googling the error message to find out what it means, and to see how other people have encountered and solved similar problems.

Every error has happened before. If you encountered it, then millions of other people have surely encountered the same problem as well. You will be surprised how often you can find the answer you are looking for online. We will return to this topic frequently, about how to find answers to bioinformatics questions. The best trick in bioinformatics is learning how to ask google the right questions to find your answers.

Reminder: Questions that require you to respond using Markdown (text) or Actions that require you to edit and execute code to produce new results will be highlighted on a green background.

Topic: reference genomes

The human genome sequencing project completed in 2003 was a major breakthrough in biology that heralded a new era of genome-centric research. Most people are now familiar with the term genome, and the fact that a genome represents the full genetic sequence of an individual. But how familiar are you with the concept of the genome as a data object? How is a genome represented? Is it in one file, or many? Where is it published and how can you find it? Does it contain every DNA base-pair, or every gene in a genome, or can some be missing?

The short answer is No; even the most complete genome assemblies usually have some amount of missing information. For small genomes, like viruses, we can easily sequence every base, however, there remains variation among individuals and populations that is not represented in a single reference genome. For larger genomes, like humans, some parts of the genome are difficult to assemble because they are highly repetitive (e.g., much of the Y chromosome), and these are often missing or error-prone.

Many genome assemblies are published only in draft form, meaning that they are still a work in progress. The best genomes are assembled into complete chromosomes, whereas many others are still represented as thousands of assembled fragments which we don't yet know how to piece together correctly. New technologies for improving the contiguity of genome assemblies will be a major topic in this class, and it is something that is evolving rapidly. But for now, let's start by looking at a few small complete genomes.

DNA as text

To look at genomes means to look at text files. That's right, genomic data can be represented as simple text. The genome sequence can be represented using the letters A, C, G, and T to refer to the nucleotides (molecules) that make up DNA. Pretty convenient, huh? In addition to the sequence of a genome a published reference assembly usually contains several additional files containing meta information about the genome, such as an annotation of where genes are located. We'll discuss many of those file later as well.

Because genomic data files are typically large, analyzing or even just looking at genomic data requires learning to use efficient software tools for working with large text files. It may surprise you to learn that many of the best tools for this are actually the basic command line tools developed for unix based computers. Computer file systems are also written in text, and so many of the most common and basic programming tools are designed specifically to be able to very efficiently read, edit, or analyze text.

Jupyter notebooks

Jupyter notebooks are a powerful tool for combining code and written text together to create documents for sharing your work. In science we refer to this as reproducibility -- creating documents that others can use to reproduce your work entirely. In this course, we will use notebooks to share code with you for completing assignments; you can run the code, learn from it, and even edit the code or text. Upon completing each notebook, you will have entered in new values and produced new results, which you can then save, download and submit as your completed assignment for us to grade. If you need a refresher on how to use jupyter notebooks look back at the first notebook for this unit.

Bash and Python

The primary coding language used in Jupyter notebooks is Python. However, other languages can be used as well. In this class we will primarily use just two languages: Python and Bash. (We will cover bash briefly first and then learn Python in more depth.)

Bash) is a language for interacting with a terminal (it is a command line language). We will primarily use bash to run executable computer programs in a terminal. Those program themselves may be written in a variety of computer languages, often C, fortran, or even Python. There are several languages similar to bash for running commands in a terminal, of which bash is the most commonly used.

Running bash in jupyter

In jupyter you can open a terminal at any time to execute bash code directly in it if you wish. We demonstrated this in class. However, for the purpose of your assignments we will have you instead execute bash code inside of jupyter notebooks. Again, this is for the sake of reproducibility, notebooks keep a record of the code that you execute, so it is easy to share, and for our purposes, to hand in as an assignment.

To say that we are going to "run a command line program using bash inside of a jupyter notebook" is describing a multi-layered process: jupyter actually uses Python to connect to a temporary bash terminal, which then runs the command line program, captures the output, and returns it to jupyter, which displays the output in your notebook. This sounds complicated, but it's all happening behind the scenes. To run bash code in the notebook all you have to do is append %%bash to the top of a code cell.

Unix command line tools

Most of the tools we are going to use in this tutorial will be available on any *nix based computer system (e,g., Linux, MacOSX), although the specific version that comes pre-installed on your system may vary slightly (e.g., you would replace zcat on linux with gunzip -c on Mac).

In this notebook we will explore the following unix programs. There are many more that we do not mention here, and there are many tutorials available on-line to learn more about them.

  1. pwd: print working directory
  2. mkdir: make a new directory
  3. ls: list contents in a directory
  4. wget: download from a URL (download or sub for curl on Mac)
  5. head: print first N lines of a file
  6. cat/zcat: stream/read through a file line by line
  7. grep: select lines from a file based on matching characters
  8. cut: select elements from text based on a delimiter/separator
  9. wc: count words, lines, or bytes in a file
  10. awk: match a pattern and perform action on conditional clause
In [1]:
# pwd: 'print working directory'

Atomized functions in *nix command-line programs

Please watch the following youtube video for an introduction to the philosophy behind the design of *nix-based command line programs that form the backbone of the bash shell: The core idea is that many small programs with simple functions can be combined together to accomplish very complex tasks.

The format of command line tools

As we discussed in class, almost all command line programs follow a similar syntax (set of rules) for how to enter arguments to them to make them run. You always start with a program name, and then optionally with a set or arguments (flags) that start with a dash. Finally, some programs require input (target) that they process to return a result. Below are some examples of this syntax. Next we will try it out with some interactive examples.

# program followed by optional flags and optional target
<program name> [-option1] [-option2] [target]

# example, to call the program ls (list)

# to call the program with a target (location to list)
ls ../notebooks

# to call the program with an option and target
ls -l ../notebooks

Create a new directory to store downloaded genome files

Here we will use the mkdir command to create a new directory (we'll add the -p option so that it doesn't matter if the directory already exists.) In this code block we also use the ls command with the option -l to show the contents of our current directory as a list. We can run as many bash commands as we want on separate lines of the same code block. However, when creating notebooks of your own it is generally good practice to break up code into atomized steps. From the output of ls below you can see that the new directory we created (genomes/) is now located inside of our current directory (called notebooks/), and is alongside several notebook files.

In [2]:
# make new directory called genomes in our current directory
mkdir -p ./genomes/

# show everything in my current directory as a list
ls -l
total 100
drwxr-xr-x 2 deren deren  4096 Jan 22 10:21 genomes
-rw-r--r-- 1 deren deren 11094 Jan 21 18:12 nb-1.0-jupyter.ipynb
-rw-r--r-- 1 deren deren 75287 Jan 22 10:44 nb-1.1-bash-genomes.ipynb
-rw-r--r-- 1 deren deren  7623 Jan 21 17:47 nb-1.2-genome-databases.ipynb

Where do we get genomic data?

There are several standardized databases for storing genomes and genomic data. We will use NCBI the most, since it is hosted by the US National Institutes of Health (NIH). We will cover genomic databases in more depth next week.

Open this link in a new tab to view an NCBI FTP site where the reference Human genome is hosted. As you can see in the URL we are visiting a site that is structured like a set of folders on a computer system, and in which we are jumping to vertebrate_mammalian/Homo_sapiens, which suggests that there are many other vertebrate mammalian genomes on this site as well. At the new page click on the link for GCF_000001405.39_GRCh38.p13/. This is the latest published human reference genome. That's right, assemblies are continually being updated as new data and new software tools become available, or as new individuals are sequenced.

As you can see there are many files containing different aspects of genome information. Some describe how the genome was assembled, others how many genes there are and where they are located. The assembled genome sequence itself is in a fasta file. This is the file we are interested in right now. Do not click to download yet, we will instead use a command line program to download files. This is generally good form, since you'll leave a record in your notebook of exactly where you got your data from (i.e., it is good for reproducibility).

Download a few (small) reference genome fasta files from public URLs

We will use the unix command wget to retrieve data from a URL, with the option -O (capital o, not a zero) to designate the location where we want the file to be saved (for simplicity we also rename the files to a simpler name: virus, yeast, and chicken). Let's also add the option -q (quiet) to repress progress messages from being printed during the downloads (unless your download fails, then remove the -q to see the error message). This command will take somewhere between 30 seconds and several minutes depending on your internet speed.

In [3]:
# to make this code more readable we will save the URLs as variables

# now we can run the wget commands by referring to these variables
wget $url1 -q -O ./genomes/virus.fna.gz
wget $url2 -q -O ./genomes/yeast.fna.gz
wget $url3 -q -O ./genomes/chicken.fna.gz

The format of the genome files

We just downloaded three fasta files. Such files will usually have a suffix like .fna or .fasta. You can see that all of our files have the ending .fna.gz, meaning that they are in fasta format, but they are also gzip compressed (.gz). Compression is used to store large files more efficiently. When files are compressed we sometimes need to use additional tools to decompress them before reading the contents. Many genomic software tools are designed to work with compressed files so that we can just enter the compressed file into the program directly. Here we will leave the files compressed to save disk space (e.g. the compressed Chicken genome is about 0.3 Gb) and simply decompress their contents on the fly as we read/stream it.

In [4]:
# show the contents of the genomes directory
ls -lh genomes
total 320M
-rw-r--r-- 1 deren deren 314M Feb 22  2019 chicken.fna.gz
-rw-r--r-- 1 deren deren 588K Jun 24  2018 virus.fna.gz
-rw-r--r-- 1 deren deren 3.7M Apr  9  2018 yeast.fna.gz
-rw-r--r-- 1 deren deren 1.6M Dec 22 12:30 yeast.gff.gz

Look at the first couple lines of the compressed file

A convenient tool for looking at just a few lines from a large file is the command head. We can pass this an additional argument to tell it how many lines to show us from the top of the file. Let's start by looking at the first 2 lines of one of the compressed genome files...

It looks like gibberish, right? Well, that's what compressed files look like if you don't decompress them.

In [5]:
# read a few lines of the compressed virus genome (it will look ugly)
head -n 2 genomes/virus.fna.gz
U�s�B{����.�]֣ZC�E�M-�]���YiŃN�tTc� X�I��]�",�tn�n�G�CiL�a���g}�p�5�,|ޒ�pޱ,~�Jp��PG-4�y�;M�ѹ�ə���*����j.T���q��LK�x���t㌎g�eWGҴ��j&r:�d����l�9S����4��_�Hhv��L_<�ߞ'{��Z�9�

cat / zcat to read/stream the entire contents of a file

Here we are using two tricks: first, the command zcat can be used as a replacement for cat, the normal command we would use to stream a file to stdout (the output stream). The second trick here is that we are using a pipe. As you learned in your linux tutorial, pipes are used to send the output of one program directly into the input of another. Here we pipe the streamed output of zcat directly into the program head. This way we do not read the entire file into memory at once, which would be too much text for a jupyter notebook to display comfortably, and instead capture only the first N lines to print to stdout. As you can see, now that we are using the zcat command to decompress the stream of text it looks like sensible DNA data.

In [6]:
# run zcat and PIPE the output into head
zcat genomes/virus.fna.gz | head -n 10
>NC_037667.1 Pandoravirus quercus, complete genome
Action: In the code cell below write and run a bash command like the one above to stream the contents of a genome file using zcat and pipe it into the head command. In your code block, select the *yeast genome file* instead of the virus, and change the -n option of head to print 20 lines instead of 10.
In [1]:
# write code here to run zcat and PIPE the output into head
zcat genomes/yeast.fna.gz | head -n 10
>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence

The fasta format

The fasta format is a simple file format used to list contiguous DNA sequences with identifiers. The sequences in a fasta file could be genes, or non-coding regions, or transposable elements, etc., or it could be all of those things ordered as a large contiguous sequence (a contig), or a series of contigs (a scaffold), or even as a complete chromosome, or multiple chromosomes. A small example is below:

> sequence 1 name and other notes about sequence 1
> sequence 2 name and other notes about sequence 2

In the virus genome file we can see that the name of the first sequence in the file is labeled "... complete genome", this is becuase the entire genome of this virus is contained on a single assembled chromosome.

Searching text (grep)

The names of each sequence in a fasta file begin with the character >. When a character is used to separate elements in a file like this we refer to it as a delimiter. The > character in a fasta file delimits where one sequence stops and the next one begins.

Knowing this, we can easily extract the names of all of the sequences in a fasta file by using the command line tool grep to extract lines that match a pattern we are searching for. You can find a quick grep tutorial here and in many other places online. We will demonstrate some usage below.

Let's add an additional pipe (remember, this connects the output from one program to be the input to another) to our command from above to now extract the sequence names in the Yeast genome file as we read/stream though it using the cat command. As you can see, the only lines of the file that are returned are those which matched the grep search pattern (i.e., contained a > character).

In [8]:
# pipe zcat output to grep
zcat genomes/yeast.fna.gz | grep ">"
>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
>NC_001134.8 Saccharomyces cerevisiae S288C chromosome II, complete sequence
>NC_001135.5 Saccharomyces cerevisiae S288C chromosome III, complete sequence
>NC_001136.10 Saccharomyces cerevisiae S288C chromosome IV, complete sequence
>NC_001137.3 Saccharomyces cerevisiae S288C chromosome V, complete sequence
>NC_001138.5 Saccharomyces cerevisiae S288C chromosome VI, complete sequence
>NC_001139.9 Saccharomyces cerevisiae S288C chromosome VII, complete sequence
>NC_001140.6 Saccharomyces cerevisiae S288C chromosome VIII, complete sequence
>NC_001141.2 Saccharomyces cerevisiae S288C chromosome IX, complete sequence
>NC_001142.9 Saccharomyces cerevisiae S288C chromosome X, complete sequence
>NC_001143.9 Saccharomyces cerevisiae S288C chromosome XI, complete sequence
>NC_001144.5 Saccharomyces cerevisiae S288C chromosome XII, complete sequence
>NC_001145.3 Saccharomyces cerevisiae S288C chromosome XIII, complete sequence
>NC_001146.8 Saccharomyces cerevisiae S288C chromosome XIV, complete sequence
>NC_001147.6 Saccharomyces cerevisiae S288C chromosome XV, complete sequence
>NC_001148.4 Saccharomyces cerevisiae S288C chromosome XVI, complete sequence
>NC_001224.1 Saccharomyces cerevisiae S288c mitochondrion, complete genome

Sequence names/labels

We can see from above that all of the sequences in the Yeast genome are complete chromosomes. This is not always the case, though. Below we run a grep search to get all lines of the Chicken genome file that contain "chromosome 1" in their names. As you can see, there are many scaffolds that match our search term, meaning there are many scaffolds in this assembly that are thought to be located somewhere on chromosome 1.

In [9]:
zcat genomes/chicken.fna.gz | grep "chromosome 1" | head -n 10
>NC_006088.5 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1, GRCg6a
>NW_020109737.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_134_RANDOM
>NW_020109738.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_360_RANDOM
>NW_020109739.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_123_RANDOM
>NW_020109740.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_132_RANDOM
>NW_020109741.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_330_RANDOM
>NW_020109742.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_247_RANDOM
>NW_020109743.1 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 1 unlocalized genomic scaffold, GRCg6a CHR1_347_RANDOM
>NC_006097.5 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 10, GRCg6a
>NC_006098.5 Gallus gallus breed Red Jungle Fowl isolate RJF #256 chromosome 11, GRCg6a
Question: Why do you think the Chicken genome assembly contains many scaffolds with Chromosome 1 in the name whereas the Yeast genome has only one scaffold labeled chromosome I? Why does it take longer to search the Chicken genome? Answer in the markdown cell below.


The chromosome 1 is assembled completely in Yeast, but in Chicken there are "unlocalized" regions that cannot be placed yet. It takes longer to search the chicken genome because it is much larger.

Genome structure: mitochondrial genomes

We can use grep to match lines of a file in order to answer basic questions about its contents. For example, we can search for the term "mitochond" in each genome to see if there is a sequence that is labeled as being from a mitochondrian.

In [10]:

zcat genomes/virus.fna.gz | grep mitochond
zcat genomes/yeast.fna.gz | grep mitochond
zcat genomes/chicken.fna.gz | grep mitochond
>NC_001224.1 Saccharomyces cerevisiae S288c mitochondrion, complete genome
>NC_040902.1 Gallus gallus spadiceus isolate YP19903 breed Red jungle fowl mitochondrion, complete genome
Question: Which one of the genomes does not have a chromosome with "mitochond" in its name (as we would expect if it has a mitochondrial genome)? Why would this genome not have a mitochondrian? Answer in the Markdown cell below.


Only eukaryotes have a mitochondrian so it makes sense that a virus lacks mitochondrial DNA.

How large is each genome?

Organisms differ widely in their genome sizes, but how do we know how large genomes are? (read here)

We can estimate genome sizes from several types of measurements. The most common way is to use flow cytometry to estimate genome size from the mass of DNA in a cell. However, in addition, to the extent that we believe that our genome is reasonably accurately assembled (a topic we will discuss later), we can also estimate the genome size by simply asking how many characters are in the fasta file of the assembled genome.

For this, we can use the program wc, which stands for word count. We will also pass the option -m to return the result as a count of characters (as opposed to lines or bytes or the other things wc can count). Remember, we still need to decompress the file when running this command, since as we saw above the contents of compressed files look very different from the decompressed version. Below we use zcat to pipe a stream of decompressed text into wc.

In [11]:
# count characters in each file
zcat genomes/virus.fna.gz | wc -m 
zcat genomes/yeast.fna.gz | wc -m
zcat genomes/chicken.fna.gz | wc -m
In [12]:
# more accurately, we should exclude seq name lines using grep (line starting with '>')
zcat genomes/virus.fna.gz | grep -v '^>' | wc -m 
zcat genomes/yeast.fna.gz | grep -v '^>' | wc -m
zcat genomes/chicken.fna.gz | grep -v '^>' | wc -m
Question: Using google search to find the size of the baker's yeast genome (Saccharomyces cereviseae). Does the number of characters in the genome file match closely to the estimated genome size? List the source you used for the yeast genome size.


Yes it matches closely. 12 Mb ([Organism]&cmd=DetailsSearch)

Genome annotations: Features (exons, introns, repeats, etc.)

Once a genome has been assembled the next task is typically to try to describe the contents of the genome in terms of the function of its different parts. This is termed genome annotatation. Where are the genes, how many are there, and within each how many exons or introns are there? Where are repetitive elements, and how much of the genome do they compose? How such features are identified will be the subject of next week's assignment, but assuming they have been identified correctly, we will look now at how these results are saved and how we can access that information.

General feature format (GFF/GTF)

There are several formats for storing identified genome features called GFF or GTF or GFF3 files. In this tabular formatted file feature names are mapped to coordinates of the genome in terms of the scaffold or chromosome names and their start:stop positions. Features are named according to a specific naming system (ontology) that we will discuss more in the future. But for now, take note that it is a hierarchical system (subunits nested within higher level units), as represented by the figure below. All of the elements below gene1 are parts that make up the unit we are calling gene 1, which includes messenger RNAs, exons, and coding sequences (exons - introns - UTRs).

    gene1            ================================    ID=gene1  
    mRNA1            ================================    ID=mRNA1;Parent=gene1  
    five_prime_UTR   ==                                  Parent=mRNA1  
    CDS1               ==....=====...........==          Parent=mRNA1 (3 rows)  
    three_prime_UTR                            ======    Parent=mRNA1
    mRNA2            ================================    ID=mRNA2;Parent=gene1
    exon             ====                                Parent=mRNA2
    CDS2               ==....................==          Parent=mRNA2 (2 rows)
    exon                                     ========    Parent=mRNA2

Download the GFF file for the Yeast genome assembly.

As before we will use the wget command to download the file which is available from the same FTP location as the genome fasta file that we downloaded earlier. The command for wget takes the URL and we tell it to write the file to the genomes directory.

In [13]:
# download GFF file for Yeast assembly from URL
wget $url -q -O ./genomes/yeast.gff.gz

Format of the GFF file

Let's take a look at the file. It is tab-delimited, and contains several columns with information about where genomic regions start and stop, and what information is known about those regions. The fields listed on each line of the GFF file are as follows:

seqid - name of the chromosome or scaffold;
source - name of the program, database, or project that generated this feature;
type - type of feature. Must be a term or accession from the SOFA sequence ontology;
start - Start position of the feature, with sequence numbering starting at 1;
end - End position of the feature, with sequence numbering starting at 1;
score - A floating point value;
strand - defined as + (forward) or - (reverse);
phase - One of '0', '1' or '2': which base in the first base of a codon;
attributes - A semicolon-separated list of tag-value pairs of additional info;

Let's look at the first few lines of the file using head. The first few lines start with #, these are comment lines that make up the header of the file. This is for storing metadata; information about this file and how it was made. The contents of the file are GFF table are then listed below.

In [14]:
# look at a few lines of the GFF
zcat genomes/yeast.gff.gz | head -n 15
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCF_000146045.2
#!annotation-source SGD R64-2-1
##sequence-region NC_001133.9 1 230218
NC_001133.9	RefSeq	region	1	230218	.	+	.	ID=NC_001133.9:1..230218;Dbxref=taxon:559292;Name=I;chromosome=I;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=S288C
NC_001133.9	RefSeq	telomere	1	801	.	-	.	ID=id-NC_001133.9:1..801;Dbxref=SGD:S000028862;Note=TEL01L%3B Telomeric region on the left arm of Chromosome I%3B composed of an X element core sequence%2C X element combinatorial repeats%2C and a short terminal stretch of telomeric repeats;gbkey=telomere
NC_001133.9	RefSeq	origin_of_replication	707	776	.	+	.	ID=id-NC_001133.9:707..776;Dbxref=SGD:S000121252;Note=ARS102~Autonomously Replicating Sequence;gbkey=rep_origin
NC_001133.9	RefSeq	gene	1807	2169	.	-	.	ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8;end_range=2169,.;gbkey=Gene;gene=PAU8;gene_biotype=protein_coding;locus_tag=YAL068C;partial=true;start_range=.,1807
NC_001133.9	RefSeq	mRNA	1807	2169	.	-	.	ID=rna-NM_001180043.1;Parent=gene-YAL068C;Dbxref=GeneID:851229,Genbank:NM_001180043.1;Name=NM_001180043.1;end_range=2169,.;gbkey=mRNA;gene=PAU8;locus_tag=YAL068C;partial=true;product=seripauperin PAU8;start_range=.,1807;transcript_id=NM_001180043.1
NC_001133.9	RefSeq	exon	1807	2169	.	-	.	ID=exon-NM_001180043.1-1;Parent=rna-NM_001180043.1;Dbxref=GeneID:851229,Genbank:NM_001180043.1;end_range=2169,.;gbkey=mRNA;gene=PAU8;locus_tag=YAL068C;partial=true;product=seripauperin PAU8;start_range=.,1807;transcript_id=NM_001180043.1
NC_001133.9	RefSeq	CDS	1807	2169	.	-	0	ID=cds-NP_009332.1;Parent=rna-NM_001180043.1;Dbxref=SGD:S000002142,GeneID:851229,Genbank:NP_009332.1;Name=NP_009332.1;Note=hypothetical protein%3B member of the seripauperin multigene family encoded mainly in subtelomeric regions;experiment=EXISTENCE:mutant phenotype:GO:0030437 ascospore formation [PMID:12586695];gbkey=CDS;gene=PAU8;locus_tag=YAL068C;product=seripauperin PAU8;protein_id=NP_009332.1

Selecting fields from a delimited file

As you can see it is kind of hard to read the GFF file using a terminal, especially without turning on line-wrapping for long-lines (which unfortunately is not easy to do within a jupyter notebook). The basic unix tools are great for quickly searching for elements in the GFF file, because again, these files can sometimes be very large. The Yeast genome is considered quite small, and its GFF file is still more than 26K lines long, so it would still be hard to make sense of if you tried to interact with it using excel.

In the next class we will learn other convenient ways for examining and working with tabular data using Python. But for now let's explore some the fast and powerful ways to work with data tables from a terminal using bash tools.

Using cut

The program cut) is useful for selecting delimited text (text with a separating character). Below we will combine cut with zcat and grep to select elements from the GFF file. Let's start by selecting just the first few columns so that the results are a bit more readable. We will also use grep reverse matching (-v) which tells it to exclude (as opposed to include) the first few lines of metadata that match our query.

This is our first complex chain of commands in bash, we are piping together 4 bash programs! Each plays a small role and together they can accomplish a rather specific and complex task of showing the first few rows and columns of this data table.

In [15]:
# read file | get lines not starting with # | get fields 1-5 | show first 10 lines
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | head -n 10
NC_001133.9	RefSeq	region	1	230218
NC_001133.9	RefSeq	telomere	1	801
NC_001133.9	RefSeq	origin_of_replication	707	776
NC_001133.9	RefSeq	gene	1807	2169
NC_001133.9	RefSeq	mRNA	1807	2169
NC_001133.9	RefSeq	exon	1807	2169
NC_001133.9	RefSeq	CDS	1807	2169
NC_001133.9	RefSeq	gene	2480	2707
NC_001133.9	RefSeq	mRNA	2480	2707
NC_001133.9	RefSeq	exon	2480	2707

Nested features

As you can see from the few features in the GFF file above, the first element is defined as a region and this spans from position 1-230219. The next feature is defined as a telomere which stretches from positions 1-801. How can both of these features start at position 1? It is because the second is nested within the first. It is a telomere located in the first region. In this case, region 1 is referring to the first chromosome. Let's confirm this by looking at all of the region elements in the file. Below you can see that there are many distinct regions, each starting from position 1 (on different chromosomes) and stretching to a different end position (depending on the length of the chromosome).

In [16]:
# read file | exclude lines starting with # | get fields 1-5 | keep lines with word 'region'
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | grep -w 'region'
NC_001133.9	RefSeq	region	1	230218
NC_001134.8	RefSeq	region	1	813184
NC_001135.5	RefSeq	region	1	316620
NC_001136.10	RefSeq	region	1	1531933
NC_001137.3	RefSeq	region	1	576874
NC_001138.5	RefSeq	region	1	270161
NC_001139.9	RefSeq	region	1	1090940
NC_001140.6	RefSeq	region	1	562643
NC_001141.2	RefSeq	region	1	439888
NC_001142.9	RefSeq	region	1	745751
NC_001143.9	RefSeq	region	1	666816
NC_001144.5	RefSeq	region	1	1078177
NC_001145.3	RefSeq	region	1	924431
NC_001146.8	RefSeq	region	1	784333
NC_001147.6	RefSeq	region	1	1091291
NC_001148.4	RefSeq	region	1	948066
NC_001224.1	RefSeq	region	1	85779

Something to take note of here: each region here has a different seqid (the first column in the data). This is because the seqid refers to chromosomes/scaffolds and each region here is a chromosome. All of the other elements nested inside of the regions have the same seqid, but different start and stop positions. We can see this by running the same command as above, but adding the -A argument to grep so that it also prints 1 additional line after each match. Here you can see that the telomere that starts each chromosome has the same seqid as the chromosome, and stretches from 1 to some length.

In [17]:
# read file | not lines start w/ # | only fields 1-5 | only w/ 'region' & print next line
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | grep -w 'region' -A1
NC_001133.9	RefSeq	region	1	230218
NC_001133.9	RefSeq	telomere	1	801
NC_001134.8	RefSeq	region	1	813184
NC_001134.8	RefSeq	telomere	1	6608
NC_001135.5	RefSeq	region	1	316620
NC_001135.5	RefSeq	telomere	1	1098
NC_001136.10	RefSeq	region	1	1531933
NC_001136.10	RefSeq	telomere	1	904
NC_001137.3	RefSeq	region	1	576874
NC_001137.3	RefSeq	telomere	1	6473
NC_001138.5	RefSeq	region	1	270161
NC_001138.5	RefSeq	telomere	1	5530
NC_001139.9	RefSeq	region	1	1090940
NC_001139.9	RefSeq	telomere	1	781
NC_001140.6	RefSeq	region	1	562643
NC_001140.6	RefSeq	telomere	1	5505
NC_001141.2	RefSeq	region	1	439888
NC_001141.2	RefSeq	telomere	1	7784
NC_001142.9	RefSeq	region	1	745751
NC_001142.9	RefSeq	telomere	1	7767
NC_001143.9	RefSeq	region	1	666816
NC_001143.9	RefSeq	telomere	1	807
NC_001144.5	RefSeq	region	1	1078177
NC_001144.5	RefSeq	telomere	1	12085
NC_001145.3	RefSeq	region	1	924431
NC_001145.3	RefSeq	telomere	1	6344
NC_001146.8	RefSeq	region	1	784333
NC_001146.8	RefSeq	telomere	1	7428
NC_001147.6	RefSeq	region	1	1091291
NC_001147.6	RefSeq	telomere	1	847
NC_001148.4	RefSeq	region	1	948066
NC_001148.4	RefSeq	telomere	1	7223
NC_001224.1	RefSeq	region	1	85779
NC_001224.1	RefSeq	gene	731	802

Counting genes in the GFF file

One type of feature listed in the GFF file are genes. As we said earlier, this term is used to describe a particular type of feature according to a set of definitions (an ontology) for how to define genomic elements (more on this next week). Let's search for how many genes are in the Yeast GFF file by grepping the word "gene". Take note, we are not searching the whole file for just the term gene, because it actually appears quite commonly (for example, in the words genetic or geneID). Instead, we first pull out the column of features so that we are only searching this, and then search for word matches (-w option) with grep. Finally, we can use the -c option of grep to count the number of matching lines rather than return them.

In [18]:
# read file | get 3rd field | grep -w = match to word 'gene', -c count number of times
zcat genomes/yeast.gff.gz | cut -f 3 | grep -wc "gene"
Question: Using google search to find how many genes are estimated to be in the yeast (Saccharomyces cereviseae) genome. Is it similar to what you found in the grep command above? Why do you think the results might vary? List the online source you used for the number of genes in yeast.


This number matches closely to an estimate online. The number probably changes among different annotated assemblies because they are *estimates*. Many sources available, just google.

Pipes: from simple to complex

Using pipes you can combine many unix commands to accomplish complex tasks. Although the function of each individual program may seem limited, as you learn more and more of these atomic tasks you can combine them to do just about aything with text, and often quite fast. Here we combine several commands to count the top ten features that occur most in the Yeast GFF file and order and print them.

Don't worry, we do not expect you to memorize all of these commands. It takes time to learn and remember which types of bash tools exist, and what they can do. Simply being exposed to each is the first step though.

In [19]:
# extend a command over multiple lines
zcat genomes/yeast.gff.gz | grep -v '^#' | cut -sf 3 | sort | uniq -c | sort -rn | head
   6760 exon
   6427 gene
   6366 CDS
   5995 mRNA
    383 long_terminal_repeat
    360 origin_of_replication
    299 tRNA
     76 snoRNA
     64 centromere
     50 mobile_genetic_element

Splitting commands over multiple lines

Sometimes long commands can become difficult to read. You can split a command over multiple lines by ending each line with a forward slash character. This makes it easier to read. This is the same command as above, where you can clearly see each line starts with the name of a bash tool that is being called, which includes 7 in total.

In [20]:
# extend a command over multiple lines
zcat genomes/yeast.gff.gz | \
    grep -v '^#' | \
    cut -s -f 3 | \
    sort | \
    uniq -c | \
    sort -rn | \
   6760 exon
   6427 gene
   6366 CDS
   5995 mRNA
    383 long_terminal_repeat
    360 origin_of_replication
    299 tRNA
     76 snoRNA
     64 centromere
     50 mobile_genetic_element

Parsing attribute information from the GFF

The final column of the GFF file, the attributes, is a bit more difficult to read and comprehend, but this is where the real details about each genetic feature are found.

For example, you can find gene names, and descriptions of their functions in the attributes section, as well as get ontology IDs that can be used to learn more about a type of feature. Below I use cut to select the 9th (last) column of the GFF, then grep to select only rows that include the text "ID=gene", and then head to print only the first match. The result is the attribute text of the first gene in the file.

In [21]:
# select the attributes (last column) of the first gene
zcat genomes/yeast.gff.gz | cut -f 9 | grep "ID=gene" | head -n 1

As you can see, the information in the attribute field is one long string of text that is delimited by semicolons. So to parse it we need to split elements on this character. Below is an example using a combination of grep and cut commands like we used above, but now we print the first and third items in the attributes string, and apply this to the first 10 genes in the file.

Again, I realize this string of commands looks complex. Do not be overwhelmed by it. Try to read it part by part to understand what each element is doing. You can practice by taking it apart from beginning to end. We will try examples of this in the next notebook.

In [22]:
# print the first 10 genes and the gene names
zcat genomes/yeast.gff.gz | cut -f 9 | grep "ID=gene" | cut -d';' -f 1,3 | head -n 10

Using awk to match text

Another great tool that can perform similar tasks to this is called awk. The basic format of an awk command is in the form of a string argument stating "x == y {do z}".

It is similar to cut in that it is designed to work on delimited fields of text (e.g., columns of the data file). Below we print the attributes field ($9) if the type field ($3) matches 'gene'. In other words, we are asking it to print the 9th column if the word 'gene' is in the 3rd column. Using awk we can accomplish similar tasks to what we did above using grep and cut.

NB: Awk is a complex tool that can take a long time to master. We do not expect you to master it today. Just try it out. By learning that it exists and what it can do you can more easily google later to find an awk command that will accomplish the task you are attempting. But don't worry if you find this last section difficult.

In [23]:
# if field 3 is gene then return field 9
zcat genomes/yeast.gff.gz | awk '$3 == "gene" {print $9}' | head -n 5

Awk, pipe, and repeat

We can combine multiple awk calls so that we can parse the results from one command with another. For example, we could select elements of each line by splitting on tabs using the -F (field) argument, and then parse those results by splitting them with a different delimiter like a semicolon by using -F ;. This can be useful for parsing a result like in the example above. Let's try this to select a few items (1, 2, and 3) from the attributes field (field 9) of lines in the GFF file describing genes.

In [24]:
# if field 3 is gene then return field 9, then split on ; and select first 3 items
zcat genomes/yeast.gff.gz | \
    awk -F '\t' '$3 == "gene" {print $9}' | \
    awk -F ';' '{print $1,$2,$3}' | \
    head -n 10
ID=gene-YAL068C Dbxref=GeneID:851229 Name=PAU8
ID=gene-YAL067W-A Dbxref=GeneID:1466426 Name=YAL067W-A
ID=gene-YAL067C Dbxref=GeneID:851230 Name=SEO1
ID=gene-YAL065C Dbxref=GeneID:851232 Name=YAL065C
ID=gene-YAL064W-B Dbxref=GeneID:851233 Name=YAL064W-B
ID=gene-YAL064C-A Dbxref=GeneID:851234 Name=TDA8
ID=gene-YAL064W Dbxref=GeneID:851235 Name=YAL064W
ID=gene-YAL063C-A Dbxref=GeneID:1466427 Name=YAL063C-A
ID=gene-YAL063C Dbxref=GeneID:851236 Name=FLO9
ID=gene-YAL062W Dbxref=GeneID:851237 Name=GDH3

Formatting string results

We can also format the returned results from an awk command to make the results prettier or easier to parse by the next command in a pipe. We can do this using the OFS (output field separator) argument to awk. Below we output the result of the first awk command as semicolon separated values, and then parse the next string on semicolons. Read the comment lines above each line of code below to follow along with what it is doing. Try modifying part of the code to see how it changes the output. Remember you can look back at this cell to see what the GFF file looks like that we are parsing.

In [25]:

# decompress and stream gff file and pipe to next command
zcat genomes/yeast.gff.gz | \
    # split on tab; if field 3 is gene then return (start;stop;attributes)
    awk -F '\t' '{OFS=";"}  $3 == "gene" {print $4,$5,$9} ' | \
    # split on semicolon; get attributes 1 2 5 return as \t separated
    awk -F ';'  '{OFS="\t"} {print $1,$2,$5}' | \
    # show only first 10 results
    head -n 10
1807	2169	Name=PAU8
2480	2707	Name=YAL067W-A
7235	9016	Name=SEO1
11565	11951	Name=YAL065C
12046	12426	Name=YAL064W-B
13363	13743	Name=TDA8
21566	21850	Name=YAL064W
22395	22685	Name=YAL063C-A
24000	27968	Name=FLO9
31567	32940	Name=GDH3

Get the seqid, start, and end positions of all telomeres

How many telomeres do you expect to find on each chromosome, and where should they be located? Use this information to make sure your results make sense for the exercise below. Note: not all genomes are likely to have telomeres accurately assembled, since these regions are often highly repetitive. In smaller genomes like Yeast they are easier to assemble.

Action: Use any of the tools we have discussed so far. Return a tab-delimited table with the positions of all of the telomeres in the Yeast genome. Each line should have the following information: (seqid) (feature type) (start) (stop). Write your code in the cell block below. **Hint**: consider combining several commands together using pipes that accomplish the tasks as listed below. Look at the examples above in the notebook for how to to each of these steps individually. ` read file | exclude if starts w/ # | get fields 1,3,4,5 | include if has 'telomere' `
In [5]:

# read file | exclude lines starting with # | get fields 1-5 | keep lines with word 'region'
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1,3,4,5 | grep -w 'telomere'
NC_001133.9	telomere	1	801
NC_001133.9	telomere	229411	230218
NC_001134.8	telomere	1	6608
NC_001134.8	telomere	812379	813184
NC_001135.5	telomere	1	1098
NC_001135.5	telomere	315783	316620
NC_001136.10	telomere	1	904
NC_001136.10	telomere	1524625	1531933
NC_001137.3	telomere	1	6473
NC_001137.3	telomere	569599	576874
NC_001138.5	telomere	1	5530
NC_001138.5	telomere	269731	270161
NC_001139.9	telomere	1	781
NC_001139.9	telomere	1083635	1090940
NC_001140.6	telomere	1	5505
NC_001140.6	telomere	556105	562643
NC_001141.2	telomere	1	7784
NC_001141.2	telomere	439068	439888
NC_001142.9	telomere	1	7767
NC_001142.9	telomere	744902	745751
NC_001143.9	telomere	1	807
NC_001143.9	telomere	665904	666816
NC_001144.5	telomere	1	12085
NC_001144.5	telomere	1064281	1078177
NC_001145.3	telomere	1	6344
NC_001145.3	telomere	923541	924431
NC_001146.8	telomere	1	7428
NC_001146.8	telomere	783278	784333
NC_001147.6	telomere	1	847
NC_001147.6	telomere	1083922	1091291
NC_001148.4	telomere	1	7223
NC_001148.4	telomere	942396	948010
Action: When completed save and download this notebook in HTML format and submit it along with the other notebooks for this unit to courseworks.