This notebook will demonstrate several introductory bash scripting skills that are useful for bioinformatics, but applied in the context of learning about genomic data. By the end of this exercise you will:
I am sure that some of you may already have a lot of experience with bash scripting, while others may have none at all. If you are in the latter group, do not worry! You do not need to have any prior bioinformatics experience to complete the assignments in this class.
After completing this notebook, if you want to seek a more thorough introduction to bash there are many resources online. Two that I recommend include:
Throughout this course you will surely at some point try to write code to answer a problem and it will return an error message instead of the correct answer. This is expected. It happens all the time. And it is a great learning opportunity. When you see an error message be sure to read it carefully. Some parts of the message may look very confusing, these likely represent a stack trace, which attempts to tell you which part of the code is causing a problem. In addition to this, you will usually get an error message. The message tells you the type of error.
To learn how to solve a problem that is causing an error you can first try modifying your code to see whether there is an obvious fix, such as a typo or syntax error. If you still can't solve the problem, try googling the error message to find out what it means, and to see how other people have encountered and solved similar problems.
Every error has happened before. If you encountered it, then millions of other people have surely encountered the same problem as well. You will be surprised how often you can find the answer you are looking for online. We will return to this topic frequently, about how to find answers to bioinformatics questions. The best trick in bioinformatics is learning how to ask google the right questions to find your answers.
The human genome sequencing project completed in 2003 was a major breakthrough in biology that heralded a new era of genome-centric research. Most people are now familiar with the term genome, and the fact that a genome represents the full genetic sequence of an individual. But how familiar are you with the concept of the genome as a data object? How is a genome represented? Is it in one file, or many? Where is it published and how can you find it? Does it contain every DNA base-pair, or every gene in a genome, or can some be missing?
The short answer is No; even the most complete genome assemblies usually have some amount of missing information. For small genomes, like viruses, we can easily sequence every base, however, there remains variation among individuals and populations that is not represented in a single reference genome. For larger genomes, like humans, some parts of the genome are difficult to assemble because they are highly repetitive (e.g., much of the Y chromosome), and these are often missing or error-prone.
Many genome assemblies are published only in draft form, meaning that they are still a work in progress. The best genomes are assembled into complete chromosomes, whereas many others are still represented as thousands of assembled fragments which we don't yet know how to piece together correctly. New technologies for improving the contiguity of genome assemblies will be a major topic in this class, and it is something that is evolving rapidly. But for now, let's start by looking at a few small complete genomes.
To look at genomes means to look at text files. That's right, genomic data can be represented as simple text. The genome sequence can be represented using the letters A, C, G, and T to refer to the nucleotides (molecules) that make up DNA. Pretty convenient, huh? In addition to the sequence of a genome a published reference assembly usually contains several additional files containing meta information about the genome, such as an annotation of where genes are located. We'll discuss many of those file later as well.
Because genomic data files are typically large, analyzing or even just looking at genomic data requires learning to use efficient software tools for working with large text files. It may surprise you to learn that many of the best tools for this are actually the basic command line tools developed for unix based computers. Computer file systems are also written in text, and so many of the most common and basic programming tools are designed specifically to be able to very efficiently read, edit, or analyze text.
Jupyter notebooks are a powerful tool for combining code and written text together to create documents for sharing your work. In science we refer to this as reproducibility -- creating documents that others can use to reproduce your work entirely. In this course, we will use notebooks to share code with you for completing assignments; you can run the code, learn from it, and even edit the code or text. Upon completing each notebook, you will have entered in new values and produced new results, which you can then save, download and submit as your completed assignment for us to grade. If you need a refresher on how to use jupyter notebooks look back at the first notebook for this unit.
The primary coding language used in Jupyter notebooks is Python. However, other languages can be used as well. In this class we will primarily use just two languages: Python and Bash. (We will cover bash briefly first and then learn Python in more depth.)
Bash) is a language for interacting with a terminal (it is a command line language). We will primarily use bash to run executable computer programs in a terminal. Those program themselves may be written in a variety of computer languages, often C, fortran, or even Python. There are several languages similar to bash for running commands in a terminal, of which bash is the most commonly used.
In jupyter you can open a terminal at any time to execute bash code directly in it if you wish. We demonstrated this in class. However, for the purpose of your assignments we will have you instead execute bash code inside of jupyter notebooks. Again, this is for the sake of reproducibility, notebooks keep a record of the code that you execute, so it is easy to share, and for our purposes, to hand in as an assignment.
To say that we are going to "run a command line program using bash inside of a jupyter notebook" is describing a multi-layered process: jupyter actually uses Python to connect to a temporary bash terminal, which then runs the command line program, captures the output, and returns it to jupyter, which displays the output in your notebook. This sounds complicated, but it's all happening behind the scenes. To run bash code in the notebook all you have to do is append %%bash
to the top of a code cell.
Most of the tools we are going to use in this tutorial will be available on any *nix based computer system (e,g., Linux, MacOSX), although the specific version that comes pre-installed on your system may vary slightly (e.g., you would replace zcat
on linux with gunzip -c
on Mac).
In this notebook we will explore the following unix programs. There are many more that we do not mention here, and there are many tutorials available on-line to learn more about them.
pwd
: print working directory mkdir
: make a new directory ls
: list contents in a directory wget
: download from a URL (download or sub for curl
on Mac) head
: print first N lines of a file cat/zcat
: stream/read through a file line by line grep
: select lines from a file based on matching characters cut
: select elements from text based on a delimiter/separator wc
: count words, lines, or bytes in a file awk
: match a pattern and perform action on conditional clause %%bash
# pwd: 'print working directory'
pwd
Please watch the following youtube video for an introduction to the philosophy behind the design of *nix-based command line programs that form the backbone of the bash shell: https://www.youtube.com/watch?v=tc4ROCJYbm0. The core idea is that many small programs with simple functions can be combined together to accomplish very complex tasks.
As we discussed in class, almost all command line programs follow a similar syntax (set of rules) for how to enter arguments to them to make them run. You always start with a program name, and then optionally with a set or arguments (flags) that start with a dash. Finally, some programs require input (target) that they process to return a result. Below are some examples of this syntax. Next we will try it out with some interactive examples.
# program followed by optional flags and optional target
<program name> [-option1] [-option2] [target]
# example, to call the program ls (list)
ls
# to call the program with a target (location to list)
ls ../notebooks
# to call the program with an option and target
ls -l ../notebooks
Here we will use the mkdir
command to create a new directory (we'll add the -p
option so that it doesn't matter if the directory already exists.) In this code block we also use the ls
command with the option -l
to show the contents of our current directory as a list. We can run as many bash commands as we want on separate lines of the same code block. However, when creating notebooks of your own it is generally good practice to break up code into atomized steps. From the output of ls
below you can see that the new directory we created (genomes/) is now located inside of our current directory (called notebooks/), and is alongside several notebook files.
%%bash
# make new directory called genomes in our current directory
mkdir -p ./genomes/
# show everything in my current directory as a list
ls -l
There are several standardized databases for storing genomes and genomic data. We will use NCBI the most, since it is hosted by the US National Institutes of Health (NIH). We will cover genomic databases in more depth next week.
Open this link in a new tab https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/ to view an NCBI FTP site where the reference Human genome is hosted. As you can see in the URL we are visiting a site that is structured like a set of folders on a computer system, and in which we are jumping to vertebrate_mammalian/Homo_sapiens
, which suggests that there are many other vertebrate mammalian genomes on this site as well. At the new page click on the link for GCF_000001405.39_GRCh38.p13/
. This is the latest published human reference genome. That's right, assemblies are continually being updated as new data and new software tools become available, or as new individuals are sequenced.
As you can see there are many files containing different aspects of genome information. Some describe how the genome was assembled, others how many genes there are and where they are located. The assembled genome sequence itself is in a fasta file. This is the file we are interested in right now. Do not click to download yet, we will instead use a command line program to download files. This is generally good form, since you'll leave a record in your notebook of exactly where you got your data from (i.e., it is good for reproducibility).
We will use the unix command wget
to retrieve data from a URL, with the option -O
(capital o, not a zero) to designate the location where we want the file to be saved (for simplicity we also rename the files to a simpler name: virus, yeast, and chicken). Let's also add the option -q
(quiet) to repress progress messages from being printed during the downloads (unless your download fails, then remove the -q
to see the error message). This command will take somewhere between 30 seconds and several minutes depending on your internet speed.
%%bash
# to make this code more readable we will save the URLs as variables
url1="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/Pandoravirus_quercus/latest_assembly_versions/GCF_003233895.1_ASM323389v1/GCF_003233895.1_ASM323389v1_genomic.fna.gz"
url2="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz"
url3="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gallus_gallus/representative/GCF_000002315.6_GRCg6a/GCF_000002315.6_GRCg6a_genomic.fna.gz"
# now we can run the wget commands by referring to these variables
wget $url1 -q -O ./genomes/virus.fna.gz
wget $url2 -q -O ./genomes/yeast.fna.gz
wget $url3 -q -O ./genomes/chicken.fna.gz
We just downloaded three fasta files
. Such files will usually have a suffix like .fna
or .fasta
. You can see that all of our files have the ending .fna.gz
, meaning that they are in fasta format, but they are also gzip compressed (.gz). Compression is used to store large files more efficiently. When files are compressed we sometimes need to use additional tools to decompress them before reading the contents. Many genomic software tools are designed to work with compressed files so that we can just enter the compressed file into the program directly. Here we will leave the files compressed to save disk space (e.g. the compressed Chicken genome is about 0.3 Gb) and simply decompress their contents on the fly as we read/stream it.
%%bash
# show the contents of the genomes directory
ls -lh genomes
A convenient tool for looking at just a few lines from a large file is the command head
. We can pass this an additional argument to tell it how many lines to show us from the top of the file. Let's start by looking at the first 2 lines of one of the compressed genome files...
It looks like gibberish, right? Well, that's what compressed files look like if you don't decompress them.
%%bash
# read a few lines of the compressed virus genome (it will look ugly)
head -n 2 genomes/virus.fna.gz
cat / zcat
to read/stream the entire contents of a file¶Here we are using two tricks: first, the command zcat
can be used as a replacement for cat
, the normal command we would use to stream a file to stdout (the output stream). The second trick here is that we are using a pipe
. As you learned in your linux tutorial, pipes are used to send the output of one program directly into the input of another. Here we pipe the streamed output of zcat
directly into the program head
. This way we do not read the entire file into memory at once, which would be too much text for a jupyter notebook to display comfortably, and instead capture only the first N lines to print to stdout. As you can see, now that we are using the zcat
command to decompress the stream of text it looks like sensible DNA data.
%%bash
# run zcat and PIPE the output into head
zcat genomes/virus.fna.gz | head -n 10
%%bash
# write code here to run zcat and PIPE the output into head
zcat genomes/yeast.fna.gz | head -n 10
The fasta format is a simple file format used to list contiguous DNA sequences with identifiers. The sequences in a fasta file could be genes, or non-coding regions, or transposable elements, etc., or it could be all of those things ordered as a large contiguous sequence (a contig), or a series of contigs (a scaffold), or even as a complete chromosome, or multiple chromosomes. A small example is below:
> sequence 1 name and other notes about sequence 1
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAA
> sequence 2 name and other notes about sequence 2
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAAAAAAATTTTTTTTCCCCCCCCCCCTTTTTTTTTTTTTGGGGGGGGGGG
AAAAAAAAAAAAAA
In the virus genome file we can see that the name of the first sequence in the file is labeled "... complete genome", this is becuase the entire genome of this virus is contained on a single assembled chromosome.
The names of each sequence in a fasta file begin with the character >
. When a character is used to separate elements in a file like this we refer to it as a delimiter. The >
character in a fasta file delimits where one sequence stops and the next one begins.
Knowing this, we can easily extract the names of all of the sequences in a fasta file by using the command line tool grep
to extract lines that match a pattern we are searching for. You can find a quick grep tutorial here and in many other places online. We will demonstrate some usage below.
Let's add an additional pipe (remember, this connects the output from one program to be the input to another) to our command from above to now extract the sequence names in the Yeast genome file as we read/stream though it using the cat command. As you can see, the only lines of the file that are returned are those which matched the grep
search pattern (i.e., contained a >
character).
%%bash
# pipe zcat output to grep
zcat genomes/yeast.fna.gz | grep ">"
We can see from above that all of the sequences in the Yeast genome are complete chromosomes. This is not always the case, though. Below we run a grep
search to get all lines of the Chicken genome file that contain "chromosome 1" in their names. As you can see, there are many scaffolds that match our search term, meaning there are many scaffolds in this assembly that are thought to be located somewhere on chromosome 1.
%%bash
zcat genomes/chicken.fna.gz | grep "chromosome 1" | head -n 10
We can use grep
to match lines of a file in order to answer basic questions about its contents. For example, we can search for the term "mitochond" in each genome to see if there is a sequence that is labeled as being from a mitochondrian.
%%bash
zcat genomes/virus.fna.gz | grep mitochond
zcat genomes/yeast.fna.gz | grep mitochond
zcat genomes/chicken.fna.gz | grep mitochond
Organisms differ widely in their genome sizes, but how do we know how large genomes are? (read here)
We can estimate genome sizes from several types of measurements. The most common way is to use flow cytometry to estimate genome size from the mass of DNA in a cell. However, in addition, to the extent that we believe that our genome is reasonably accurately assembled (a topic we will discuss later), we can also estimate the genome size by simply asking how many characters are in the fasta file of the assembled genome.
For this, we can use the program wc
, which stands for word count. We will also pass the option -m
to return the result as a count of characters (as opposed to lines or bytes or the other things wc
can count). Remember, we still need to decompress the file when running this command, since as we saw above the contents of compressed files look very different from the decompressed version. Below we use zcat
to pipe a stream of decompressed text into wc
.
%%bash
# count characters in each file
zcat genomes/virus.fna.gz | wc -m
zcat genomes/yeast.fna.gz | wc -m
zcat genomes/chicken.fna.gz | wc -m
%%bash
# more accurately, we should exclude seq name lines using grep (line starting with '>')
zcat genomes/virus.fna.gz | grep -v '^>' | wc -m
zcat genomes/yeast.fna.gz | grep -v '^>' | wc -m
zcat genomes/chicken.fna.gz | grep -v '^>' | wc -m
Once a genome has been assembled the next task is typically to try to describe the contents of the genome in terms of the function of its different parts. This is termed genome annotatation. Where are the genes, how many are there, and within each how many exons or introns are there? Where are repetitive elements, and how much of the genome do they compose? How such features are identified will be the subject of next week's assignment, but assuming they have been identified correctly, we will look now at how these results are saved and how we can access that information.
There are several formats for storing identified genome features called GFF or GTF or GFF3 files. In this tabular formatted file feature names are mapped to coordinates of the genome in terms of the scaffold or chromosome names and their start:stop positions. Features are named according to a specific naming system (ontology) that we will discuss more in the future. But for now, take note that it is a hierarchical system (subunits nested within higher level units), as represented by the figure below. All of the elements below gene1 are parts that make up the unit we are calling gene 1, which includes messenger RNAs, exons, and coding sequences (exons - introns - UTRs).
gene1 ================================ ID=gene1
mRNA1 ================================ ID=mRNA1;Parent=gene1
five_prime_UTR == Parent=mRNA1
CDS1 ==....=====...........== Parent=mRNA1 (3 rows)
three_prime_UTR ====== Parent=mRNA1
mRNA2 ================================ ID=mRNA2;Parent=gene1
exon ==== Parent=mRNA2
CDS2 ==....................== Parent=mRNA2 (2 rows)
exon ======== Parent=mRNA2
As before we will use the wget
command to download the file which is available from the same FTP location as the genome fasta file that we downloaded earlier. The command for wget takes the URL and we tell it to write the file to the genomes directory.
%%bash
# download GFF file for Yeast assembly from URL
url="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz"
wget $url -q -O ./genomes/yeast.gff.gz
Let's take a look at the file. It is tab-delimited, and contains several columns with information about where genomic regions start and stop, and what information is known about those regions. The fields
listed on each line of the GFF file are as follows:
seqid
- name of the chromosome or scaffold;
source
- name of the program, database, or project that generated this feature;
type
- type of feature. Must be a term or accession from the SOFA sequence ontology;
start
- Start position of the feature, with sequence numbering starting at 1;
end
- End position of the feature, with sequence numbering starting at 1;
score
- A floating point value;
strand
- defined as + (forward) or - (reverse);
phase
- One of '0', '1' or '2': which base in the first base of a codon;
attributes
- A semicolon-separated list of tag-value pairs of additional info;
Let's look at the first few lines of the file using head
. The first few lines start with #
, these are comment lines that make up the header of the file. This is for storing metadata; information about this file and how it was made. The contents of the file are GFF table are then listed below.
%%bash
# look at a few lines of the GFF
zcat genomes/yeast.gff.gz | head -n 15
As you can see it is kind of hard to read the GFF file using a terminal, especially without turning on line-wrapping for long-lines (which unfortunately is not easy to do within a jupyter notebook). The basic unix tools are great for quickly searching for elements in the GFF file, because again, these files can sometimes be very large. The Yeast genome is considered quite small, and its GFF file is still more than 26K lines long, so it would still be hard to make sense of if you tried to interact with it using excel.
In the next class we will learn other convenient ways for examining and working with tabular data using Python. But for now let's explore some the fast and powerful ways to work with data tables from a terminal using bash tools.
cut
¶The program cut
) is useful for selecting delimited text (text with a separating character). Below we will combine cut
with zcat
and grep
to select elements from the GFF file. Let's start by selecting just the first few columns so that the results are a bit more readable. We will also use grep reverse matching (-v
) which tells it to exclude (as opposed to include) the first few lines of metadata that match our query.
This is our first complex chain of commands in bash, we are piping together 4 bash programs! Each plays a small role and together they can accomplish a rather specific and complex task of showing the first few rows and columns of this data table.
%%bash
# read file | get lines not starting with # | get fields 1-5 | show first 10 lines
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | head -n 10
As you can see from the few features in the GFF file above, the first element is defined as a region
and this spans from position 1-230219. The next feature is defined as a telomere which stretches from positions 1-801. How can both of these features start at position 1? It is because the second is nested within the first. It is a telomere located in the first region. In this case, region 1 is referring to the first chromosome. Let's confirm this by looking at all of the region
elements in the file. Below you can see that there are many distinct regions, each starting from position 1 (on different chromosomes) and stretching to a different end position (depending on the length of the chromosome).
%%bash
# read file | exclude lines starting with # | get fields 1-5 | keep lines with word 'region'
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | grep -w 'region'
Something to take note of here: each region here has a different seqid
(the first column in the data). This is because the seqid
refers to chromosomes/scaffolds and each region here is a chromosome. All of the other elements nested inside of the regions have the same seqid
, but different start and stop positions. We can see this by running the same command as above, but adding the -A
argument to grep
so that it also prints 1 additional line after each match. Here you can see that the telomere that starts each chromosome has the same seqid
as the chromosome, and stretches from 1 to some length.
%%bash
# read file | not lines start w/ # | only fields 1-5 | only w/ 'region' & print next line
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1-5 | grep -w 'region' -A1
One type of feature listed in the GFF file are genes
. As we said earlier, this term is used to describe a particular type of feature according to a set of definitions (an ontology) for how to define genomic elements (more on this next week). Let's search for how many genes are in the Yeast GFF file by grepping the word "gene". Take note, we are not searching the whole file for just the term gene, because it actually appears quite commonly (for example, in the words genetic or geneID). Instead, we first pull out the column of features so that we are only searching this, and then search for word matches (-w
option) with grep. Finally, we can use the -c
option of grep to count the number of matching lines rather than return them.
%%bash
# read file | get 3rd field | grep -w = match to word 'gene', -c count number of times
zcat genomes/yeast.gff.gz | cut -f 3 | grep -wc "gene"
Using pipes you can combine many unix commands to accomplish complex tasks. Although the function of each individual program may seem limited, as you learn more and more of these atomic tasks you can combine them to do just about aything with text, and often quite fast. Here we combine several commands to count the top ten features that occur most in the Yeast GFF file and order and print them.
Don't worry, we do not expect you to memorize all of these commands. It takes time to learn and remember which types of bash tools exist, and what they can do. Simply being exposed to each is the first step though.
%%bash
# extend a command over multiple lines
zcat genomes/yeast.gff.gz | grep -v '^#' | cut -sf 3 | sort | uniq -c | sort -rn | head
Sometimes long commands can become difficult to read. You can split a command over multiple lines by ending each line with a forward slash character. This makes it easier to read. This is the same command as above, where you can clearly see each line starts with the name of a bash tool that is being called, which includes 7 in total.
%%bash
# extend a command over multiple lines
zcat genomes/yeast.gff.gz | \
grep -v '^#' | \
cut -s -f 3 | \
sort | \
uniq -c | \
sort -rn | \
head
The final column of the GFF file, the attributes, is a bit more difficult to read and comprehend, but this is where the real details about each genetic feature are found.
For example, you can find gene names, and descriptions of their functions in the attributes section, as well as get ontology IDs that can be used to learn more about a type of feature. Below I use cut
to select the 9th (last) column of the GFF, then grep
to select only rows that include the text "ID=gene", and then head
to print only the first match. The result is the attribute text of the first gene in the file.
%%bash
# select the attributes (last column) of the first gene
zcat genomes/yeast.gff.gz | cut -f 9 | grep "ID=gene" | head -n 1
As you can see, the information in the attribute
field is one long string of text that is delimited by semicolons. So to parse it we need to split elements on this character. Below is an example using a combination of grep
and cut
commands like we used above, but now we print the first and third items in the attributes string, and apply this to the first 10 genes in the file.
Again, I realize this string of commands looks complex. Do not be overwhelmed by it. Try to read it part by part to understand what each element is doing. You can practice by taking it apart from beginning to end. We will try examples of this in the next notebook.
%%bash
# print the first 10 genes and the gene names
zcat genomes/yeast.gff.gz | cut -f 9 | grep "ID=gene" | cut -d';' -f 1,3 | head -n 10
awk
to match text¶Another great tool that can perform similar tasks to this is called awk. The basic format of an awk
command is in the form of a string argument stating "x == y {do z}"
.
It is similar to cut in that it is designed to work on delimited fields of text (e.g., columns of the data file). Below we print the attributes field ($9
) if the type field ($3
) matches 'gene'. In other words, we are asking it to print the 9th column if the word 'gene' is in the 3rd column. Using awk we can accomplish similar tasks to what we did above using grep and cut.
NB: Awk is a complex tool that can take a long time to master. We do not expect you to master it today. Just try it out. By learning that it exists and what it can do you can more easily google later to find an awk command that will accomplish the task you are attempting. But don't worry if you find this last section difficult.
%%bash
# if field 3 is gene then return field 9
zcat genomes/yeast.gff.gz | awk '$3 == "gene" {print $9}' | head -n 5
We can combine multiple awk
calls so that we can parse the results from one command with another. For example, we could select elements of each line by splitting on tabs using the -F
(field) argument, and then parse those results by splitting them with a different delimiter like a semicolon by using -F ;
. This can be useful for parsing a result like in the example above. Let's try this to select a few items (1, 2, and 3) from the attributes field (field 9) of lines in the GFF file describing genes.
%%bash
# if field 3 is gene then return field 9, then split on ; and select first 3 items
zcat genomes/yeast.gff.gz | \
awk -F '\t' '$3 == "gene" {print $9}' | \
awk -F ';' '{print $1,$2,$3}' | \
head -n 10
We can also format the returned results from an awk command to make the results prettier or easier to parse by the next command in a pipe. We can do this using the OFS (output field separator) argument to awk
. Below we output the result of the first awk command as semicolon separated values, and then parse the next string on semicolons.
Read the comment lines above each line of code below to follow along with what it is doing. Try modifying part of the code to see how it changes the output. Remember you can look back at this cell to see what the GFF file looks like that we are parsing.
%%bash
# decompress and stream gff file and pipe to next command
zcat genomes/yeast.gff.gz | \
# split on tab; if field 3 is gene then return (start;stop;attributes)
awk -F '\t' '{OFS=";"} $3 == "gene" {print $4,$5,$9} ' | \
# split on semicolon; get attributes 1 2 5 return as \t separated
awk -F ';' '{OFS="\t"} {print $1,$2,$5}' | \
# show only first 10 results
head -n 10
How many telomeres do you expect to find on each chromosome, and where should they be located? Use this information to make sure your results make sense for the exercise below. Note: not all genomes are likely to have telomeres accurately assembled, since these regions are often highly repetitive. In smaller genomes like Yeast they are easier to assemble.
%%bash
# read file | exclude lines starting with # | get fields 1-5 | keep lines with word 'region'
zcat genomes/yeast.gff.gz | grep -v "^#" | cut -f 1,3,4,5 | grep -w 'telomere'