This notebook will correspond with chapter 7 in the official Python tutorial https://docs.python.org/3/tutorial/.
By the end of this exercise you will:
Python is very atomic language, meaning that many packages in the standard library are packaged into individual libraries that need to be loaded in order to access their utilities. This makes Python very light weight since the base language does not need to load all of these extra utilities unless we ask it to. To load a package that is installed on our system we can call the import
function like below. Here we are also using a package that is not part of the standard library but was installed separately, called requests
, which is used to download data from the web.
import os
import gzip
import requests
Last week we learned about using the bash language to run code in a terminal. One of the commands we used several times was the program wget
, which is used to get data from the web, i.e., to download data from a URL. Run the bash script below to create a new folder and download two files into that folder. Notice the %%bash header which makes this code cell execute bash code instead of Python code.
%%bash
# creates a new directory
mkdir -p datafiles/
# downloads two files into the new directory
wget https://eaton-lab.org/data/40578.fastq.gz -q -O datafiles/40578.fastq.gz
wget https://eaton-lab.org/data/iris-data-dirty.csv -q -O datafiles/iris-data-dirty.csv
It turns out we can also perform the same task using Python. Here we will also create a new directory and download two files into it. We'll name the new directory datafiles2 to differentiate it from the first one that was just called datafiles. In this case the Python version of the code looks quite a bit more complicated than the bash script. This isn't always the case, indeed Python code is often much simpler to read. Throughout this notebook we will learn about the Python code in the cell below piece by piece. By the end of this notebook you should be able to understand the code entirely.
# make a new directory
os.makedirs("datafiles2", exist_ok=True)
# the URL to file 1
url1 = "https://eaton-lab.org/data/40578.fastq.gz"
# open a file for writing and write the content to it
with open("./datafiles2/40578.fastq.gz", 'wb') as ffile:
ffile.write(requests.get(url1).content)
# the URL to file 2
url2 = "https://eaton-lab.org/data/iris-data-dirty.csv"
# open a file for writing and write the content to it
with open("./datafiles2/iris-data-dirty.csv", 'wb') as ffile:
ffile.write(requests.get(url2).content)
Another common tool that we learned in bash was the ls
command, which is used to look at the contents of a location in the filesystem.
%%bash
# the ls bash command shows the contents of the folder
ls ./datafiles/
There is an equivalent command in Python that can be called from the os
package of the standard library. Because we imported the os
package at the top of this notebook all of its functions are available for us to use. All of the functions that are associated with the os
package can be called from the os
variable, like below. In this example we call the .listdir()
function which is similar to the ls
command in bash. An important difference between the two is that while the bash command above simply printed the results, the listdir()
function below returns the results of the command as a Python list. Having the results in a list allows us to more easily perform further analyses on the contents of this location.
# the listdir() function from the os package is similar to 'ls'
os.listdir("datafiles2/")
The os
package has many functions but we will be using just a small part of it today, primarily the path
submodule. Just like everything else in Python packages are also objects, and so we can access all of the functions in this package using tab completion. Put your cursor after the period in the cell below and press <tab>
to see available options in os
. There are many!
## use tab-completion after the '.' to see available options in os
os.
os
package¶A filepath refers to a location on your computer's filesystem. For example, /home/deren/Documents/homework.docx
could be the full path to a document on my computer.
Writing code to automate working with file paths can often be difficult to format, or error prone. If the string representation of a filepath is incorrect by even a single typo then the path will not work correctly. This becomes extra tricky when a program needs to access filepaths on different types of computers, since filepaths look different on a Mac and PC. Here understanding the filesystem hierarchy that we learned in lesson 1 becomes important. Fortunately the os.path
package makes it easy to write code for filepaths that will work seamlessy across different computers.
os.path
¶The os.path
submodule is used to format filepaths. It can expand special characters in path names that are used as shortcuts (like the dot or double dot), it can join together multiple paths, and it can search for special directories like $HOME, or the current directory.
Essentially, the os.path
package has many similar functions to those we learned in bash scripting last week, such as pwd
to show your current directory, or ~
as a shorthand for your home directory. Here we can access those filepaths as string variables and work with them very easily.
NB: The goal here is not for you to master the os
package, but to understand that many such packages exist in the Python standard library and that you can use tab-completion, google search, and other sources to find them and learn to use them.
The two code cells below will print a different result depending on what your username is on your computer, and depending on where you opened this notebook from, respectively.
# return my $HOME directory
os.path.expanduser("~")
# convert relative path to a full path
os.path.abspath('./')
The function above takes a relative path (the path "./"
means here, where I am located), and it expands it into a full path, meaning that it explicitly shows the entire path from the root (/
) to the file or directory.
path = "./datafiles/iris-data-dirty.csv"
os.path.abspath(path)
# assign my current location to a variable called 'curdir'
curdir = os.path.abspath('.')
curdir
# get the lowest level directory in 'curdir'
os.path.basename(curdir)
# get the directory containing 'curdir'
os.path.dirname(curdir)
Because it can be hard to keep track of the "/" characters between directories and filepaths it is useful to use the .join
function of the os.path
module to join together path names. Here we will create string variable with a new pathname for a file that doesn't yet exist in our current directory. You can see in the three examples below that it doesn't matter when we include a "/" after a directory name or not, the join
function figures it out for us.
# see how os.path.join handles '/' characters in path names
print(os.path.join("/home/", "folder1/", "folder2", "newfile.txt"))
print(os.path.join("/home/", "folder1", "folder2", "newfile.txt"))
print(os.path.join("/home/", "folder1/", "folder2/", "newfile.txt"))
The os.path.join()
function can take an unlimited number of arguments and it will join them together ensuring that there is the correct number of "/" characters between folders. Simple but effective.
# get the full path name to a newfile in our current directory
newfile = os.path.join(curdir, "newfile.txt")
newfile
A key thing to be aware of when working with filepaths is that just because you've written a path does not mean that it actually exists. If it doesn't exist and you try to use the path to write or read a file it will raise an error.
Here again the os.path
package has some convenient functions for asking the question, specifically os.path.exists()
. Below I show an example where the filepath does exist, and it returns True, and another example where the filepath has a typo, and it returns False.
# this path is correct
print(os.path.exists("./datafiles/iris-data-dirty.csv"))
# this path is incorrect, it is missing an 's' in the dir name
print(os.path.exists("./datafile/iris-data-dirty.csv"))
The function open()
is used to interact with a file, either to read data from it, write data to it, or to do both. The syntax for using the open()
function is to provide two arguments like the following: open(filename, mode)
.
The filename is the name (path) to the file, and the mode is a one letter descriptor of how you plan to use the file. Options include w
for 'write', r
for 'read', or a
for 'append'. If you do not enter a mode argument it will use the default mode which is 'r', but it is generally good practice to be explicit with arguments when opening files. It helps to prevent yourself from accidentally overwriting a file.
Below we will use the mode w
to write. The 'w' mode has the special property that the file path you provide does not need to exist ahead of time. By providing a filename in 'w' mode you are asking Python to create a new file with this name. Below we create a file and then return the object so that a descriptor of it is shown in stdout. As you can see the object returned by calling the open()
function is a type of object called a io.TextIOWrapper
object. Remember, everything in Python is an object. Read more on this below.
# get an open file object
ofile = open("./datafiles/helloworld.txt", 'w')
# return the file object
ofile
As with other objects, this variable ofile
has attributes and functions that we can access and see by using tab-completion. Move your cursor to the end of the object below after the period and use tab to see some of the options.
## use tab to see options associated with open file objects
ofile.
With a file object open in 'w' mode the most common thing to do next is to write data to the file. For this you use the .write()
function and provide it with a string to be written to the file. Below we write the words "Hello world" to the ofile object, which remember is opened to write the filepath "./datafiles/helloworld.txt". Finally after calling the write function we need to close the file. This will tell the system that no more data will be written to the file for now.
# It returns the number of characters written.
ofile.write("Hello world")
# when we are done writing to the file use .close()
ofile.close()
To read data from a file we need to first open a file object, just like when we wrote to a file, but now we use the mode flag r
. You can now access the data in the file using the .read()
function. Below when the .read()
function is called it returns the contents of the file as a string, which is stored to a variable called idata
. Finally, just like when writing, it is good practice to close a file when you are done reading the contents from it.
# open a file object for reading
ifile = open("./datafiles/iris-data-dirty.csv", 'r')
# read all contents of the file as a string
idata = ifile.read()
# close the file object
ifile.close()
Now that we've stored the contents of the file in the variable idata
we can interact with it just like it is any other string, since it is now a string object.
## print the first 100 characters of idata
print(idata[:100])
Gzip compression is easily handled in Python using the standard library. The gzip
module has an open()
function that acts just like the regular open
function to create a file object. You just use the gzip
version instead of the regular open
function to open and read a gzipped file properly. Let's try it out on the compressed fastq file we downloaded earlier. We'll also practice using os.path
to find the full filepath of the 40578.fastq.gz
file.
Note: there is one extra function below that we won't discuss much for now, but which we covered a bit in class. This is the .decode()
function. This is necessary in Python3 to convert data from bytes to a string. Bytes are a more efficient way to store data, and the gzip function returns bytes instead of strings by default. Because strings are easier to work with we just convert it to a string type.
# get full path to the file in our current directory
gzfile = os.path.abspath("./datafiles/40578.fastq.gz")
# open a gzip file
ffile = gzip.open(gzfile, 'r')
# read compressed data from this file and decode it
fdata = ffile.read().decode()
# close the file
ffile.close()
# print part of the string 'fdata'
print(fdata[:500])
.read()
function¶The read()
function is nice for reading in a large chunk of text, but it then requires us to parse that text using string processing. This is because all of the data is loaded as one big chunk of text. It then usually requires us to split
this text using some kind of delimiter. Let's try splitting the fastq data on newline characters ("\n"
) by using the string function .split()
. The result is stored as a new variable below as a list, with each line of the file as an element in the list.
# split fastq string data on newline characters to return a list
fastqlines = fdata.split("\n")
# print the first 10 list elements
fastqlines[:10]
.readline()
function¶When we learned about the bash command cat I explained that it is very efficient to use because the data in a file are streamed to the output instead of all loaded at once. This is also possible with Python. In fact, the default behavior of a file object opened in 'read' mode is to act as a generator, which means that you can iterate over the object and it will return the next line one at a time until the end of the file. Below is an example where we iterate through lines of the file to count how many there are. This was done without loading the entire file at once.
# open a gzip file
ffile = gzip.open(gzfile, 'r')
# an integer counter starting at 0
nlines = 0
# the file object is iterable, each element a line
for i in ffile:
nlines += 1
# print how many lines in the file
print(nlines, "lines in the file.")
# close the file
ffile.close()
More explicitly instead of iterating over the file object you can call the .readline()
function from it to return one line of data from it at a time. For example, the code below reads just the first four lines.
# open a gzip file
ffile = gzip.open(gzfile, 'r')
print(ffile.readline().decode())
print(ffile.readline().decode())
print(ffile.readline().decode())
print(ffile.readline().decode())
# close it
ffile.close()
Let's take a side quest now and read some details of the fastq file format here. This is a file format for next-generation sequence data that we will use frequently throughout this course. We will discuss it in detail several times. Fastq files can be very large, often multiple gigabytes (Gb) in size. The example fastq file that we downloaded in the beginning of this notebook is quite small to make it easy to work with in this tutorial.
As the link above describes, the fastq format stores labeled sequence data in a sequence of four lines at a time. Meaning that one sequenced read (a length of DNA information) is written over four lines. The first line labels the read with unique identifying information. The second line contains the sequence data. The third line is a spacer or can contain optional information. And the fourth line contains quality scores for each base in the read.
Let's reuse the list object fastqlines
that we created above by reading in the entire file and splitting on newline breaks. Because it is a list we can easily use indexing to select just one or more lines at a time.
# the first line: identifier
fastqlines[0]
# the second line: sequence data
fastqlines[1]
# the third line: spacer/repeat
fastqlines[2]
# the fourth line: quality scores
fastqlines[3]
The quality scores in the fastq sequence format are stored using an ASCII encoding, which is a way of representing a number using a single character of text. This data was generated on a modern Illumina machine, and so the scores are actually encoded by the numeric representation of the ASCII character + 33 (this is just a relatively arbitrary convention that has been adopted). Python has the function ord()
to convert string characters to ints, and chr()
to convert ints to ASCII character strings.
## convert string to int
ord("A")
## convert int to str
chr(65)
## get first 10 phred scores from a line from the fastq file
phreds = fastqlines[3][:10]
print(phreds)
## get ASCII for a string of phred scores
print([ord(i) for i in phreds])
## subtract the built-in offset amount from each number
phred33 = [ord(i) - 33 for i in phreds]
print(phred33)
# convert to probabilities of being wrong
print([10 ** (-i / 10) for i in phred33])
From looking at the fastq file data we can see that each four line element could also be separated by a "\n@"
character. This is because the identifier in the first line will always start with a "@" character. Splitting the file into string objects that represent separate reads of the file, instead of just lines, can make it easier to parse and read the file. Let's try this now by parsing the file and counting the number of reads.
## split the fdata string on each occurrence of "\n@"
freads = fdata.split("\n@")
## print the first element in the list
print("The first read: \n{}".format(freads[0]))
## print the last element in the list
print("\nThe last read: \n{}".format(freads[-1]))
## print the number of reads in the file
print("\nN reads in the file = {}".format(len(freads)))
In Python there is a special keyword called with
that can be used to wrap statements into a context dependency. That means that everything which takes place indented within the statement will be able to access information about the outer statement. This is most often used for opening file objects. The reason being, when you open a file object using the with
statement it is designed to automatically close the file when you end the with
statement. In other words, this is just a shortcut to make your code a little bit shorter, by avoiding having to write a .close()
argument for every file. It will instead recognize that when you leave the indentation under the with statement the file object should be closed.
You can see the similarity between the standard syntax and the simplified syntax using a with
statement. Both are shown below for comparison.
# standard method for reading data
infile = open("./datafiles/iris-data-dirty.csv", 'r')
data = infile.read()
infile.close()
# simplified method that will automatically close the file.
with open("./datafiles/iris-data-dirty.csv", 'r') as infile:
data = infile.read()
print(data[:100])
The standard format for using the requests
library is to make a GET request to a url, which is a request to read the data from that page. This will return a response
object which we can then access for information. The response
object will contain an error message if the url is invalid, or blocked, and it will contain the HTML text of the webpage if it is successful.
We used this method to download data at the top of this notebook. Now we'll look at it in a bit more detail.
# store urls as strings
url1 = "https://eaton-lab.org/data/40578.fastq.gz"
url2 = "https://eaton-lab.org/data/iris-data-dirty.csv"
The requests.get()
function returns a new variable 'response', which is a Python object just like the other object types we've learned about. We can access functions of this object using tab completion.
# see the response object (200 means successful GET)
response = requests.get(url2)
response
# show the first 50 characters of data
response.text[:50]
# split the string of text on each newline character
lines = response.text.split("\n")[:10]
lines
That is all we need to know about the requests
library for now. It is simple to use and convenient for fetching data from the web.
It can be useful to split a string into separate elements as a list. We've done this several times already by calling .split()
on a string. There is also a reverse type of function that can take many elements in a list and combine them together into a single string. This function is called .join()
.
The trick to using .join()
is to remember that although it is meant to operate on a list or tuple of inputs, it is actually a function associated with a string object. That is because you call .join()
from the string that you want to use to place in between the elements of the list to join them. Some examples below.
# a list of string items
elist = ["dogs", "cats", "bats", "elephants"]
# call .join from the string you want to combine them
"--".join(elist)
The example above shows that a list of animal names can be combined by a text separator, in this case "--". OK, that doesn't seem all that useful yet though, does it? Well, the real strength of the .join()
function is usually for combining items with a text delimiter composed of spaces, tabs, or newlines. Below are some examples.
print(" ".join(elist))
print("\t".join(elist))
print("\n".join(elist))
import random
def random_dna(length):
return "".join(random.choice("ACGT") for i in range(length))
def write_fasta(name, sequence):
# combine name and sequence into a string
fasta = ">" + name + "\n" + sequence
# write to file
with open("datafiles/sequence.fasta", 'w') as out:
out.write(fasta)
# test writing
write_fasta("tester", random_dna(20))
# read output to make sure it worked
with open("./datafiles/sequence.fasta") as indata:
print(indata.read())
def fastq_to_fasta(infastq, outfasta):
"""
Takes an input fastq.gz file and writes converted fasta.
"""
# read in the fastq data and split on lines
with gzip.open(infastq, 'rt') as indata:
lines = indata.read().strip().split("\n")
# convert to fasta
fasta = []
for idx in range(len(lines)):
# select lines starting with @
if lines[idx][0] == "@":
# save this line and the next one
fasta.append(">" + lines[idx][1:] + "\n" + lines[idx + 1])
# join fasta lines by newline
fastastring = "\n".join(fasta)
# write to newfile
with open(outfasta, 'w') as out:
out.write(fastastring)
# test calling the function
fastq_to_fasta("./datafiles/40578.fastq.gz", "./datafiles/40578.fasta")
# read new fasta file to test it
with open("./datafiles/40578.fasta", 'r') as indata:
print(indata.read())