Notebook 3.0: Files I/O

This notebook will correspond with chapter 7 in the official Python tutorial https://docs.python.org/3/tutorial/.

Learning objectives:

By the end of this exercise you will:

  1. Understand the concept of importing libraries.
  2. Be familiar with reading and writing data to files.
  3. Be introduced to the fastq genomic data file format.

Importing a package

Python is very atomic language, meaning that many packages in the standard library are packaged into individual libraries that need to be loaded in order to access their utilities. This makes Python very light weight since the base language does not need to load all of these extra utilities unless we ask it to. To load a package that is installed on our system we can call the import function like below. Here we are also using a package that is not part of the standard library but was installed separately, called requests, which is used to download data from the web.

In [1]:
import os
import gzip
import requests

Download data files for this notebook

Last week we learned about using the bash language to run code in a terminal. One of the commands we used several times was the program wget, which is used to get data from the web, i.e., to download data from a URL. Run the bash script below to create a new folder and download two files into that folder. Notice the %%bash header which makes this code cell execute bash code instead of Python code.

In [2]:
%%bash

# creates a new directory
mkdir -p datafiles/

# downloads two files into the new directory
wget https://eaton-lab.org/data/40578.fastq.gz -q -O datafiles/40578.fastq.gz
wget https://eaton-lab.org/data/iris-data-dirty.csv -q -O datafiles/iris-data-dirty.csv

It turns out we can also perform the same task using Python. Here we will also create a new directory and download two files into it. We'll name the new directory datafiles2 to differentiate it from the first one that was just called datafiles. In this case the Python version of the code looks quite a bit more complicated than the bash script. This isn't always the case, indeed Python code is often much simpler to read. Throughout this notebook we will learn about the Python code in the cell below piece by piece. By the end of this notebook you should be able to understand the code entirely.

In [3]:
# make a new directory
os.makedirs("datafiles2", exist_ok=True)

# the URL to file 1
url1 = "https://eaton-lab.org/data/40578.fastq.gz"

# open a file for writing and write the content to it
with open("./datafiles2/40578.fastq.gz", 'wb') as ffile:
    ffile.write(requests.get(url1).content)

# the URL to file 2
url2 = "https://eaton-lab.org/data/iris-data-dirty.csv"

# open a file for writing and write the content to it
with open("./datafiles2/iris-data-dirty.csv", 'wb') as ffile:
    ffile.write(requests.get(url2).content)

List directories

Another common tool that we learned in bash was the ls command, which is used to look at the contents of a location in the filesystem.

In [4]:
%%bash

# the ls bash command shows the contents of the folder
ls ./datafiles/
40578.fastq.gz
iris-data-dirty.csv

There is an equivalent command in Python that can be called from the os package of the standard library. Because we imported the os package at the top of this notebook all of its functions are available for us to use. All of the functions that are associated with the os package can be called from the os variable, like below. In this example we call the .listdir() function which is similar to the ls command in bash. An important difference between the two is that while the bash command above simply printed the results, the listdir() function below returns the results of the command as a Python list. Having the results in a list allows us to more easily perform further analyses on the contents of this location.

In [5]:
# the listdir() function from the os package is similar to 'ls'
os.listdir("datafiles2/")
Out[5]:
['40578.fastq.gz', 'iris-data-dirty.csv']

Using packages

The os package has many functions but we will be using just a small part of it today, primarily the path submodule. Just like everything else in Python packages are also objects, and so we can access all of the functions in this package using tab completion. Put your cursor after the period in the cell below and press <tab> to see available options in os. There are many!

In [ ]:
## use tab-completion after the '.' to see available options in os
os.

Filepath operations with the os package

A filepath refers to a location on your computer's filesystem. For example, /home/deren/Documents/homework.docx could be the full path to a document on my computer.

Writing code to automate working with file paths can often be difficult to format, or error prone. If the string representation of a filepath is incorrect by even a single typo then the path will not work correctly. This becomes extra tricky when a program needs to access filepaths on different types of computers, since filepaths look different on a Mac and PC. Here understanding the filesystem hierarchy that we learned in lesson 1 becomes important. Fortunately the os.path package makes it easy to write code for filepaths that will work seamlessy across different computers.

Using os.path

The os.path submodule is used to format filepaths. It can expand special characters in path names that are used as shortcuts (like the dot or double dot), it can join together multiple paths, and it can search for special directories like $HOME, or the current directory.

Essentially, the os.path package has many similar functions to those we learned in bash scripting last week, such as pwd to show your current directory, or ~ as a shorthand for your home directory. Here we can access those filepaths as string variables and work with them very easily.

NB: The goal here is not for you to master the os package, but to understand that many such packages exist in the Python standard library and that you can use tab-completion, google search, and other sources to find them and learn to use them.

Absolute and relative paths

The two code cells below will print a different result depending on what your username is on your computer, and depending on where you opened this notebook from, respectively.

In [7]:
# return my $HOME directory
os.path.expanduser("~")
Out[7]:
'/home/deren'
In [8]:
# convert relative path to a full path
os.path.abspath('./')
Out[8]:
'/home/deren/Documents/genomics/3-python-advanced/notebooks'

The function above takes a relative path (the path "./" means here, where I am located), and it expands it into a full path, meaning that it explicitly shows the entire path from the root (/) to the file or directory.

Action: In the cell below write a *relative* path to the file called "iris-data-dirty.csv" that we downloaded earlier. It is located in a directory called "datafiles". Then use the function 'os.path.abspath()' and enter the relative path as an argument to expand it to print the full path.
In [9]:
path = "./datafiles/iris-data-dirty.csv"
os.path.abspath(path)
Out[9]:
'/home/deren/Documents/genomics/3-python-advanced/notebooks/datafiles/iris-data-dirty.csv'

Operations on filepaths

In [10]:
# assign my current location to a variable called 'curdir'
curdir = os.path.abspath('.')
curdir
Out[10]:
'/home/deren/Documents/genomics/3-python-advanced/notebooks'
In [11]:
# get the lowest level directory in 'curdir'
os.path.basename(curdir)
Out[11]:
'notebooks'
In [12]:
# get the directory containing 'curdir'
os.path.dirname(curdir)
Out[12]:
'/home/deren/Documents/genomics/3-python-advanced'

Joining filepaths

Because it can be hard to keep track of the "/" characters between directories and filepaths it is useful to use the .join function of the os.path module to join together path names. Here we will create string variable with a new pathname for a file that doesn't yet exist in our current directory. You can see in the three examples below that it doesn't matter when we include a "/" after a directory name or not, the join function figures it out for us.

In [13]:
# see how os.path.join handles '/' characters in path names
print(os.path.join("/home/", "folder1/", "folder2", "newfile.txt"))
print(os.path.join("/home/", "folder1", "folder2", "newfile.txt"))
print(os.path.join("/home/", "folder1/", "folder2/", "newfile.txt"))
/home/folder1/folder2/newfile.txt
/home/folder1/folder2/newfile.txt
/home/folder1/folder2/newfile.txt

The os.path.join() function can take an unlimited number of arguments and it will join them together ensuring that there is the correct number of "/" characters between folders. Simple but effective.

In [14]:
# get the full path name to a newfile in our current directory
newfile = os.path.join(curdir, "newfile.txt")
newfile
Out[14]:
'/home/deren/Documents/genomics/3-python-advanced/notebooks/newfile.txt'

Finding files

A key thing to be aware of when working with filepaths is that just because you've written a path does not mean that it actually exists. If it doesn't exist and you try to use the path to write or read a file it will raise an error.

Here again the os.path package has some convenient functions for asking the question, specifically os.path.exists(). Below I show an example where the filepath does exist, and it returns True, and another example where the filepath has a typo, and it returns False.

In [15]:
# this path is correct
print(os.path.exists("./datafiles/iris-data-dirty.csv"))

# this path is incorrect, it is missing an 's' in the dir name
print(os.path.exists("./datafile/iris-data-dirty.csv"))
True
False

Writing files

The function open() is used to interact with a file, either to read data from it, write data to it, or to do both. The syntax for using the open() function is to provide two arguments like the following: open(filename, mode).

The filename is the name (path) to the file, and the mode is a one letter descriptor of how you plan to use the file. Options include w for 'write', r for 'read', or a for 'append'. If you do not enter a mode argument it will use the default mode which is 'r', but it is generally good practice to be explicit with arguments when opening files. It helps to prevent yourself from accidentally overwriting a file.

Below we will use the mode w to write. The 'w' mode has the special property that the file path you provide does not need to exist ahead of time. By providing a filename in 'w' mode you are asking Python to create a new file with this name. Below we create a file and then return the object so that a descriptor of it is shown in stdout. As you can see the object returned by calling the open() function is a type of object called a io.TextIOWrapper object. Remember, everything in Python is an object. Read more on this below.

In [16]:
# get an open file object
ofile = open("./datafiles/helloworld.txt", 'w')

# return the file object
ofile
Out[16]:
<_io.TextIOWrapper name='./datafiles/helloworld.txt' mode='w' encoding='UTF-8'>

File objects

As with other objects, this variable ofile has attributes and functions that we can access and see by using tab-completion. Move your cursor to the end of the object below after the period and use tab to see some of the options.

In [ ]:
## use tab to see options associated with open file objects
ofile.

With a file object open in 'w' mode the most common thing to do next is to write data to the file. For this you use the .write() function and provide it with a string to be written to the file. Below we write the words "Hello world" to the ofile object, which remember is opened to write the filepath "./datafiles/helloworld.txt". Finally after calling the write function we need to close the file. This will tell the system that no more data will be written to the file for now.

In [19]:
# It returns the number of characters written.
ofile.write("Hello world")
Out[19]:
11
In [20]:
# when we are done writing to the file use .close()
ofile.close()

Reading files

To read data from a file we need to first open a file object, just like when we wrote to a file, but now we use the mode flag r. You can now access the data in the file using the .read() function. Below when the .read() function is called it returns the contents of the file as a string, which is stored to a variable called idata. Finally, just like when writing, it is good practice to close a file when you are done reading the contents from it.

In [21]:
# open a file object for reading
ifile = open("./datafiles/iris-data-dirty.csv", 'r')

# read all contents of the file as a string
idata = ifile.read()

# close the file object
ifile.close()

Now that we've stored the contents of the file in the variable idata we can interact with it just like it is any other string, since it is now a string object.

In [22]:
## print the first 100 characters of idata
print(idata[:100])
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,

Gzip compressed files

Gzip compression is easily handled in Python using the standard library. The gzip module has an open() function that acts just like the regular open function to create a file object. You just use the gzip version instead of the regular open function to open and read a gzipped file properly. Let's try it out on the compressed fastq file we downloaded earlier. We'll also practice using os.path to find the full filepath of the 40578.fastq.gz file.

Note: there is one extra function below that we won't discuss much for now, but which we covered a bit in class. This is the .decode() function. This is necessary in Python3 to convert data from bytes to a string. Bytes are a more efficient way to store data, and the gzip function returns bytes instead of strings by default. Because strings are easier to work with we just convert it to a string type.

In [23]:
# get full path to the file in our current directory
gzfile = os.path.abspath("./datafiles/40578.fastq.gz")
In [24]:
# open a gzip file
ffile = gzip.open(gzfile, 'r')

# read compressed data from this file and decode it
fdata = ffile.read().decode()

# close the file
ffile.close()
In [25]:
# print part of the string 'fdata'
print(fdata[:500])
@40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN
+40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
IIIIIIHIIIIIIIIIGIIIHIIIBB:B##############################################
@40578_rex.2 GRC13_0027_FC:4:1:13011:1181 length=74
TGCAGAGTCTACCCAAAGGTTCAGGCCGNNNNNNNNNNNNNNGTTNATACGTNTNNNTATTTCTATGAGAANCN
+40578_rex.2 GRC13_0027_FC:4:1:13011:1181 length=74
GGGGHHHHHHHHHHHHHHHEBG<G?;??#######################################

Reading and parsing data with the .read() function

The read() function is nice for reading in a large chunk of text, but it then requires us to parse that text using string processing. This is because all of the data is loaded as one big chunk of text. It then usually requires us to split this text using some kind of delimiter. Let's try splitting the fastq data on newline characters ("\n") by using the string function .split(). The result is stored as a new variable below as a list, with each line of the file as an element in the list.

In [26]:
# split fastq string data on newline characters to return a list
fastqlines = fdata.split("\n")

# print the first 10 list elements
fastqlines[:10]
Out[26]:
['@40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74',
 'TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN',
 '+40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74',
 'IIIIIIHIIIIIIIIIGIIIHIIIBB:B##############################################',
 '@40578_rex.2 GRC13_0027_FC:4:1:13011:1181 length=74',
 'TGCAGAGTCTACCCAAAGGTTCAGGCCGNNNNNNNNNNNNNNGTTNATACGTNTNNNTATTTCTATGAGAANCN',
 '+40578_rex.2 GRC13_0027_FC:4:1:13011:1181 length=74',
 'GGGGHHHHHHHHHHHHHHHEBG<G?;??##############################################',
 '@40578_rex.3 GRC13_0027_FC:4:1:15237:1184 length=74',
 'TGCAGAGTCCTAAATCTATTTCCTCTTCNNNNGNNNNNNNATGCATGCAACCTCCNNTCGCCACCTGTACGNAN']

Reading and parsing data with .readline() function

When we learned about the bash command cat I explained that it is very efficient to use because the data in a file are streamed to the output instead of all loaded at once. This is also possible with Python. In fact, the default behavior of a file object opened in 'read' mode is to act as a generator, which means that you can iterate over the object and it will return the next line one at a time until the end of the file. Below is an example where we iterate through lines of the file to count how many there are. This was done without loading the entire file at once.

In [27]:
# open a gzip file
ffile = gzip.open(gzfile, 'r')

# an integer counter starting at 0
nlines = 0

# the file object is iterable, each element a line
for i in ffile:
    nlines += 1
    
# print how many lines in the file
print(nlines, "lines in the file.")

# close the file
ffile.close()
500 lines in the file.

More explicitly instead of iterating over the file object you can call the .readline() function from it to return one line of data from it at a time. For example, the code below reads just the first four lines.

In [28]:
# open a gzip file
ffile = gzip.open(gzfile, 'r')

print(ffile.readline().decode())
print(ffile.readline().decode())
print(ffile.readline().decode())
print(ffile.readline().decode())

# close it
ffile.close()
@40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74

TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN

+40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74

IIIIIIHIIIIIIIIIGIIIHIIIBB:B##############################################

The fastq file format

Let's take a side quest now and read some details of the fastq file format here. This is a file format for next-generation sequence data that we will use frequently throughout this course. We will discuss it in detail several times. Fastq files can be very large, often multiple gigabytes (Gb) in size. The example fastq file that we downloaded in the beginning of this notebook is quite small to make it easy to work with in this tutorial.

As the link above describes, the fastq format stores labeled sequence data in a sequence of four lines at a time. Meaning that one sequenced read (a length of DNA information) is written over four lines. The first line labels the read with unique identifying information. The second line contains the sequence data. The third line is a spacer or can contain optional information. And the fourth line contains quality scores for each base in the read.

Let's reuse the list object fastqlines that we created above by reading in the entire file and splitting on newline breaks. Because it is a list we can easily use indexing to select just one or more lines at a time.

In [29]:
# the first line: identifier
fastqlines[0]
Out[29]:
'@40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74'
In [30]:
# the second line: sequence data
fastqlines[1]
Out[30]:
'TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN'
In [31]:
# the third line: spacer/repeat
fastqlines[2]
Out[31]:
'+40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74'
In [32]:
# the fourth line: quality scores
fastqlines[3]
Out[32]:
'IIIIIIHIIIIIIIIIGIIIHIIIBB:B##############################################'

Phred quality scores

The quality scores in the fastq sequence format are stored using an ASCII encoding, which is a way of representing a number using a single character of text. This data was generated on a modern Illumina machine, and so the scores are actually encoded by the numeric representation of the ASCII character + 33 (this is just a relatively arbitrary convention that has been adopted). Python has the function ord() to convert string characters to ints, and chr() to convert ints to ASCII character strings.

In [33]:
## convert string to int
ord("A")
Out[33]:
65
In [34]:
## convert int to str
chr(65)
Out[34]:
'A'
In [35]:
## get first 10 phred scores from a line from the fastq file
phreds = fastqlines[3][:10]
print(phreds)
IIIIIIHIII
In [36]:
## get ASCII for a string of phred scores
print([ord(i) for i in phreds])
[73, 73, 73, 73, 73, 73, 72, 73, 73, 73]
In [37]:
## subtract the built-in offset amount from each number 
phred33 = [ord(i) - 33 for i in phreds]
print(phred33)
[40, 40, 40, 40, 40, 40, 39, 40, 40, 40]
In [38]:
# convert to probabilities of being wrong
print([10 ** (-i / 10) for i in phred33])
[0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.00012589254117941674, 0.0001, 0.0001, 0.0001]

Parsing (splitting) text on different characters

From looking at the fastq file data we can see that each four line element could also be separated by a "\n@" character. This is because the identifier in the first line will always start with a "@" character. Splitting the file into string objects that represent separate reads of the file, instead of just lines, can make it easier to parse and read the file. Let's try this now by parsing the file and counting the number of reads.

In [39]:
## split the fdata string on each occurrence of "\n@"
freads = fdata.split("\n@")

## print the first element in the list
print("The first read: \n{}".format(freads[0]))

## print the last element in the list
print("\nThe last read: \n{}".format(freads[-1]))

## print the number of reads in the file
print("\nN reads in the file = {}".format(len(freads)))
The first read: 
@40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN
+40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
IIIIIIHIIIIIIIIIGIIIHIIIBB:B##############################################

The last read: 
40578_rex.125 GRC13_0027_FC:4:1:2571:1496 length=74
TGCAGCTCACGGTCGTGAGGGTGAGCTTATTTTTTTGTGAACTGTCTCAACTGCTCGTGAGGGTCCTCACGATT
+40578_rex.125 GRC13_0027_FC:4:1:2571:1496 length=74
IIIIIGHIIIIIHIIIIFIIIDIHGIIIBGIIFIDIDIHHIDIHEIHIIIEEEIHIIE>CEEE:DDBDDFECC8


N reads in the file = 125

Using context to automatically open & close files

In Python there is a special keyword called with that can be used to wrap statements into a context dependency. That means that everything which takes place indented within the statement will be able to access information about the outer statement. This is most often used for opening file objects. The reason being, when you open a file object using the with statement it is designed to automatically close the file when you end the with statement. In other words, this is just a shortcut to make your code a little bit shorter, by avoiding having to write a .close() argument for every file. It will instead recognize that when you leave the indentation under the with statement the file object should be closed.

You can see the similarity between the standard syntax and the simplified syntax using a with statement. Both are shown below for comparison.

In [40]:
# standard method for reading data
infile = open("./datafiles/iris-data-dirty.csv", 'r')
data = infile.read()
infile.close()
In [41]:
# simplified method that will automatically close the file.
with open("./datafiles/iris-data-dirty.csv", 'r') as infile:
    data = infile.read()
In [42]:
print(data[:100])
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,

Downloading data from the web in Python

The standard format for using the requests library is to make a GET request to a url, which is a request to read the data from that page. This will return a response object which we can then access for information. The response object will contain an error message if the url is invalid, or blocked, and it will contain the HTML text of the webpage if it is successful.

We used this method to download data at the top of this notebook. Now we'll look at it in a bit more detail.

In [43]:
# store urls as strings
url1 = "https://eaton-lab.org/data/40578.fastq.gz"
url2 = "https://eaton-lab.org/data/iris-data-dirty.csv"

The requests.get() function returns a new variable 'response', which is a Python object just like the other object types we've learned about. We can access functions of this object using tab completion.

In [44]:
# see the response object (200 means successful GET)
response = requests.get(url2)
response
Out[44]:
<Response [200]>
In [45]:
# show the first 50 characters of data
response.text[:50]
Out[45]:
'5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris-s'
In [46]:
# split the string of text on each newline character
lines = response.text.split("\n")[:10]
lines
Out[46]:
['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa']

That is all we need to know about the requests library for now. It is simple to use and convenient for fetching data from the web.

Join: combine multiple string elements into a single string.

It can be useful to split a string into separate elements as a list. We've done this several times already by calling .split() on a string. There is also a reverse type of function that can take many elements in a list and combine them together into a single string. This function is called .join().

The trick to using .join() is to remember that although it is meant to operate on a list or tuple of inputs, it is actually a function associated with a string object. That is because you call .join() from the string that you want to use to place in between the elements of the list to join them. Some examples below.

In [47]:
# a list of string items
elist = ["dogs", "cats", "bats", "elephants"]
In [48]:
# call .join from the string you want to combine them
"--".join(elist)
Out[48]:
'dogs--cats--bats--elephants'

The example above shows that a list of animal names can be combined by a text separator, in this case "--". OK, that doesn't seem all that useful yet though, does it? Well, the real strength of the .join() function is usually for combining items with a text delimiter composed of spaces, tabs, or newlines. Below are some examples.

In [49]:
print(" ".join(elist))
dogs cats bats elephants
In [50]:
print("\t".join(elist))
dogs	cats	bats	elephants
In [51]:
print("\n".join(elist))
dogs
cats
bats
elephants

Challenges

Action: This challenge starts with a repeat from your last notebook. Write a function below that takes an integer argument to generate a random sequence of DNA of a given length that will be returned as a string. Hint: You will need to import the random library.
In [52]:
import random

def random_dna(length):
    return "".join(random.choice("ACGT") for i in range(length))
Action: This challenge should use your function from the above challenge as part of your answer. Write code below to combine a fasta header (e.g., "> sequence name") and a random sequence of DNA to a create valid fasta data string. Then write the data to a file and save it as "datafiles/sequence.fasta". Hint: You can organize your code into a function if you want, and then call it, or you can do this by just writing a few lines of code. If you do not remember the format of fasta files then use google or look back at the lecture slides.
In [57]:
def write_fasta(name, sequence):
    
    # combine name and sequence into a string
    fasta = ">" + name + "\n" + sequence
    
    # write to file
    with open("datafiles/sequence.fasta", 'w') as out:
        out.write(fasta)
    
# test writing
write_fasta("tester", random_dna(20))

# read output to make sure it worked
with open("./datafiles/sequence.fasta") as indata:
    print(indata.read())
>tester
GAGTCTTGGTCACGGGGATA
Action (Hard Challenge): You have now learned about two sequence file formats, fasta and fastq. Fastq contains more information than fasta since it also stores quality information for each base. Your challenge here is to write a function to convert a fastq file into fasta format. All of the code you need is composed in snippets in examples above. Feel free to use google or the chatroom to seek further help if needed. Your answer must: (1) Write a function; (2) The function must read the 'datafiles/40578.fastq.gz' file from disk; (3) It must convert the data to fasta format; and (4) It must write the result to a file "datafiles/40578.fasta". Be sure you look at your fasta file after you write it to check that it looks how you expect. If not, modify your code and try again. This is an advanced level challenge, do not get discouraged if you find it difficult. But do try your best to solve the problem and seek help if needed.
In [64]:
def fastq_to_fasta(infastq, outfasta):
    """
    Takes an input fastq.gz file and writes converted fasta.
    """
    # read in the fastq data and split on lines
    with gzip.open(infastq, 'rt') as indata:
        lines = indata.read().strip().split("\n")
        
    # convert to fasta
    fasta = []
    for idx in range(len(lines)):
        
        # select lines starting with @
        if lines[idx][0] == "@":
            
            # save this line and the next one
            fasta.append(">" + lines[idx][1:] + "\n" + lines[idx + 1])
    
    # join fasta lines by newline
    fastastring = "\n".join(fasta)
    
    # write to newfile
    with open(outfasta, 'w') as out:
        out.write(fastastring)
    

# test calling the function
fastq_to_fasta("./datafiles/40578.fastq.gz", "./datafiles/40578.fasta")

# read new fasta file to test it
with open("./datafiles/40578.fasta", 'r') as indata:
    print(indata.read())
>40578_rex.1 GRC13_0027_FC:4:1:10524:1181 length=74
TGCAGCATAGCATAGATAATACAAGGTTNNNNNNNNNNNNNNTTTNCACAGTNTNNNATTAAACCCGGTAGNTN
>40578_rex.2 GRC13_0027_FC:4:1:13011:1181 length=74
TGCAGAGTCTACCCAAAGGTTCAGGCCGNNNNNNNNNNNNNNGTTNATACGTNTNNNTATTTCTATGAGAANCN
>40578_rex.3 GRC13_0027_FC:4:1:15237:1184 length=74
TGCAGAGTCCTAAATCTATTTCCTCTTCNNNNGNNNNNNNATGCATGCAACCTCCNNTCGCCACCTGTACGNAN
>40578_rex.4 GRC13_0027_FC:4:1:4657:1192 length=74
TGCAGGGTATAAATGTTTATTAGAAGATTAAGANNNNGCTGCACAAAAACCATATGACATTAAAAGAAACTCAC
>40578_rex.5 GRC13_0027_FC:4:1:6218:1191 length=74
TGCAGTATAGGTGCTAAAATACATCATTAACAANNNNCTTTCTTATAATTATTTAATGTTTCATAGCATTTAAN
>40578_rex.6 GRC13_0027_FC:4:1:11872:1189 length=74
TGCAGGCAAATTATGGCAGTTGAAATGAAGAAANNNNNNTAAAATGACTGCTAATTTTTTGTTAAAATGTAATN
>40578_rex.7 GRC13_0027_FC:4:1:15437:1199 length=74
TGCAGTGTTTATTCTTTTGTTTGACACAAATTAANTCCTTTAGTTGGTGAACGACCAAACTCGACCAAACTCAA
>40578_rex.8 GRC13_0027_FC:4:1:17455:1193 length=74
TGCAGAGCAAATAATTCTGCTAAATCTACTGAANNNNTTCTTGTTTGAGAACCCGATTAGCCTGGGCTTGCTTN
>40578_rex.9 GRC13_0027_FC:4:1:2960:1206 length=74
TGCAGGAGTTGGGGCATGATCTGGCTCAGCTCTGCGCGAAAGCGTCCAGCTGAGGGCGGGAAGGAGTCTTGTAG
>40578_rex.10 GRC13_0027_FC:4:1:7907:1211 length=74
TGCAGAAGTAGGAATAATGGCACCCGAGATAATATTGTTTCCATAAAGTAGAGATCCAGAAACAGGTTCACGAA
>40578_rex.11 GRC13_0027_FC:4:1:9051:1207 length=74
TGCAGAGCAAATTACAAGTTTGATCCTTTTGAAGAGTGCATCCATATTTGCATGATTCACTGTCTTCCCTTGCC
>40578_rex.12 GRC13_0027_FC:4:1:14147:1201 length=74
TGCAGAATAATTTTGGGGTTGGGAGAAAATGGAAACTGATATTGAGCATTAAACAATTTTAGTCATTTTTCTTT
>40578_rex.13 GRC13_0027_FC:4:1:2000:1223 length=74
TGCAGTTTAAGTAGTCAAACTATAGTTCAACTATAGTTAACTATAATTAAACTGTAGTCGGCCTACAGGAAAAC
>40578_rex.14 GRC13_0027_FC:4:1:3648:1214 length=74
TGCAGTCGTCTTTGTGTTGGCTTGACCTCCGCACACGCTAGGCCAAACAGTTTCGCTGCCAAGATGGCGCGCCT
>40578_rex.15 GRC13_0027_FC:4:1:11699:1218 length=74
TGCAGCGAGTTATTACTGCTCGTCTGAGGGTCTGCGGCTTCTCTGGTAAGCCATTTATGGATGTGGGGACCCAA
>40578_rex.16 GRC13_0027_FC:4:1:14339:1212 length=74
TGCAGAGAGTAAAACAAACAAGATTTGGGAAATTGGGAGCATTGAGCTAATTAAATTCGGTTATACCTTCAAAA
>40578_rex.17 GRC13_0027_FC:4:1:14649:1217 length=74
TGCAGACACGTCACCTTAGCGGTAGGTTAGCGGTAGGTCTACCACCGCTAAGTCACCGGTAACGGGCCCAAAAT
>40578_rex.18 GRC13_0027_FC:4:1:4276:1233 length=74
TGCAGTCCAGTTCAAGACCCGCTCCCCCGGTTTCAAGTACACACCCGTATCATGTGGCAGCTTACGAGCGAGCC
>40578_rex.19 GRC13_0027_FC:4:1:4871:1229 length=74
TGCAGGTCCACCTCACCATTCATGATGTTATATGAAGATGACGAGGTTGTTGTTGTTGTTGTTGCCGCCTCCAT
>40578_rex.20 GRC13_0027_FC:4:1:10403:1234 length=74
TGCAGGTTGTCAGTGGCGAAGGCAAACACGATGACAGACACACACAGCAGTGGTAGCAGTAACAGAGAGAAGCA
>40578_rex.21 GRC13_0027_FC:4:1:13557:1230 length=74
TGCAGGCCTAACATCTGAAACCCTGATCAGGTCATCAAAACGCCTCTGAAATGGCCAAAGCCTCAAATCCTCCC
>40578_rex.22 GRC13_0027_FC:4:1:13721:1226 length=74
TGCAGTACTGTAGCTGAAATGAATCTCTAAGAAGGGGAATACACAAAATTATACTCCCTCCGTCCAAAAATAAT
>40578_rex.23 GRC13_0027_FC:4:1:16132:1232 length=74
TGCAGGTTTTTTTATAACCCAACATTTTCCTATTGTAGGTCCCGCCAAACTCACGGGAAATATCAAATACCCAA
>40578_rex.24 GRC13_0027_FC:4:1:2473:1246 length=74
TGCAGCTCGTGGGGCCCCGGCGGCGGAATCACCGGCGGCGTTTCGTTGCTCTTCTTCGTCACTATCCCCATTGT
>40578_rex.25 GRC13_0027_FC:4:1:4508:1240 length=74
TGCAGATAGATGTGAAGTGTGGACCATTAACATCAATTCTCAGTTGACTCACTTTTTTTTTTCATCATACTAGG
>40578_rex.26 GRC13_0027_FC:4:1:7925:1245 length=74
TGCAGCATCTAAGAATGTCCCTCTGTGGGAACTTGTGTGGTTCTCCTTTGTCGCCAATATGTCCATCACCGGAA
>40578_rex.27 GRC13_0027_FC:4:1:9503:1238 length=74
TGCAGTGAAATGTCCGGTTATTTTATACGTTCTTGTTCCAGATCTACCCAAAAGGTAACCTACGATGCTTATGA
>40578_rex.28 GRC13_0027_FC:4:1:9591:1246 length=74
TGCAGCCCCTGCTTCTTCAGGCGGAACTCCAGGTTGAGGAGTTACTCGGAATGCTGCCAAGATATCAGTATCTT
>40578_rex.29 GRC13_0027_FC:4:1:19167:1247 length=74
TGCAGAGAATTCCAATATGGTGGAGATGGTACTATTGGGCAAACCCCGTTGCTTGGAGTCTGTATGGCCTTGTT
>40578_rex.30 GRC13_0027_FC:4:1:4916:1256 length=74
TGCAGACACGAGTCATGGGCCTTGTACAGAAGACACGAGGCATTAGTCTTTCATTCTTAGGGAGTTATTTTATC
>40578_rex.31 GRC13_0027_FC:4:1:8950:1250 length=74
TGCAGGATTGACCCAGCTAAGATCGTCAATTGAAGATGTATCTCCCACTGGTGGCAGAAACTCCGGCTGGTTGC
>40578_rex.32 GRC13_0027_FC:4:1:13649:1253 length=74
TGCAGTTGCTAAAAATTCTGGGGCCATTCTGCAATTTCTGGATGTTTTTGAGTCAAAACACAGAATCTTGAAAT
>40578_rex.33 GRC13_0027_FC:4:1:15558:1251 length=74
TGCAGGGGCAAACTTCCAATTCGAATCTTATCTCGTTGATTCCGATTCTGTTCCTACGGGACGGGCCAAAGTCG
>40578_rex.34 GRC13_0027_FC:4:1:8196:1260 length=74
TGCAGCCAAACAACAAACAAAACCTATACATTAGCTACAGCGATTGCTTGTTGACCGTCTTTGATGACTTGGGC
>40578_rex.35 GRC13_0027_FC:4:1:1966:1271 length=74
TGCAGAAAAAATTACAGAAACAAGAAATTCGTAAAATAATAAAGTTTTTTTTTTTTTTTTTAACCTTTGTCTTT
>40578_rex.36 GRC13_0027_FC:4:1:13586:1280 length=74
TGCAGTTTCTGTAAAAAGCATTTTCCAAGTCAGGCATCTAGCCATACGTGGTTTAAATGTCCTAAAATTAAGGA
>40578_rex.37 GRC13_0027_FC:4:1:15948:1277 length=74
TGCAGCTAGAATAAAGGACTCTGCCCCCTCTCCTCCTGTCGCAGCACGTCGGTATATCAGCAGTAGCTGCAATT
>40578_rex.38 GRC13_0027_FC:4:1:17000:1275 length=74
TGCAGGTGCTTAGTTAGTGTTTTTTTTTATAATTTCCGTACTATTAATTTCACAAAAAAACTTTGACAGCTGGG
>40578_rex.39 GRC13_0027_FC:4:1:18271:1275 length=74
TGCAGTACCTCGACGTGACATGAGCGTGAAAGGGGTTTAAGAATCAGTTTTCTTTTTATAAGGGCTAAAATTAC
>40578_rex.40 GRC13_0027_FC:4:1:18527:1277 length=74
TGCAGCTGCACGAGCCCTTCCCGCATGCCACAAATGACCTACGAATAAGAAGAATCCTAGAACAAAATGAGAGG
>40578_rex.41 GRC13_0027_FC:4:1:19204:1275 length=74
TGCAGATTTTGCTTTCTGAAAGCTGAGTGGGTTGTTCACTAACTAAGGGGCGGGAATTTGCAGGTGATGGATAT
>40578_rex.42 GRC13_0027_FC:4:1:1501:1289 length=74
TGCAGCGACTGTTGCACCTTTGGAATCCCGGTGGACCTAGTGACGAGCACGACATCATCGATACTGCTCTTGTC
>40578_rex.43 GRC13_0027_FC:4:1:16935:1287 length=74
TGCAGATGACATGTCGTTGAGTAGACCACATACAGTTACAGACTTACAGGCACCTTCTCCTAAGCACTAGTATT
>40578_rex.44 GRC13_0027_FC:4:1:4483:1295 length=74
TGCAGCCTGCAAGTTATGTTTCCACAAATAAAACTGAATTCCATTTTCATAATTTTTAACATTACGAGAAGTTG
>40578_rex.45 GRC13_0027_FC:4:1:7074:1303 length=74
TGCAGCTATCGGTTTGCACTTTTACCCAATCTGGGAAGCAGCATCCGTTGATGAATGGTTATACAATGGCGGTC
>40578_rex.46 GRC13_0027_FC:4:1:17012:1303 length=74
TGCAGGGACCAATCATGCTAGTTGGGCGATATGCTTAACACATACAAGCCCAACGTAGTTTTCTTGGAAACGCC
>40578_rex.47 GRC13_0027_FC:4:1:17086:1298 length=74
TGCAGGCCGCCACTGCTTTGACGGAGGCCCGCAAGGATCCCATGGAGGCAGCGGTTTCAGGCAGAGACACAGTT
>40578_rex.48 GRC13_0027_FC:4:1:18179:1301 length=74
TGCAGAGCTCCTTGCATGCATTTGTTGACTCCTTGCCCCCAACAACTAGACTAGGAATTGTGCTTTATGGCCGC
>40578_rex.49 GRC13_0027_FC:4:1:2586:1309 length=74
TGCAGAAGAAAAAACAGCAAAATCCGATCCAATTTATCGTAATCGATTAGTTAACATGTTGGTTAACCGTATTC
>40578_rex.50 GRC13_0027_FC:4:1:7992:1311 length=74
TGCAGATACCACATTAACCTGGGCACCCTTTGCAACAATGACTGCCCATTGGTGGGAGTTGATGTGGGGGGCGA
>40578_rex.51 GRC13_0027_FC:4:1:15473:1313 length=74
TGCAGTGAGCTTCACTCTGTCATTTACAGGAAGAAAAAACAAAAATTAAATTAAGTTCTCGAATAAAACTATTC
>40578_rex.52 GRC13_0027_FC:4:1:15875:1309 length=74
TGCAGATCCGCTCTTTTTCCTATTCAAAGATCAGCCCCCTGGCTCTGTGTTTTCACATCGAGAATTATTTGCAG
>40578_rex.53 GRC13_0027_FC:4:1:16770:1311 length=74
TGCAGATTAGGAAGGATCGTAGATTTTGGCAACTCAACGGCTGAAATGGGGTAGGGATTCAGAGAATCTCGCTA
>40578_rex.54 GRC13_0027_FC:4:1:4790:1319 length=74
TGCAGGGTGTGCACTGCCCACAGCTCTCGTCCTTGTAGAAATGCGATAGCCTGGAGATCGCATCGACAACGTCC
>40578_rex.55 GRC13_0027_FC:4:1:12857:1323 length=74
TGCAGCTTTGACGCGCAGCTTTACTGATCTAACAGTCTACCCTGTGAGCTCTACCGACGACATCGACCGTACCG
>40578_rex.56 GRC13_0027_FC:4:1:14088:1323 length=74
TGCAGTACCTCGACGTGACATGAGCGTGAAAGGGGTTTAAGAATCAGTTTTCTTTTTATAAGGGCTAAAATTAC
>40578_rex.57 GRC13_0027_FC:4:1:15887:1324 length=74
TGCAGATCCGCTCTTTTTCCTATTCAAAGATCAGCCCCCTGGCTCTGTGTTTTCACATCGAGAATTATTTGCAG
>40578_rex.58 GRC13_0027_FC:4:1:16748:1326 length=74
TGCAGATTAGGAAGGATCGTAGATTTTGGCAACTCAACGGCTGAAATGGGGTAGGGATTCAGAGAATCTCGCTA
>40578_rex.59 GRC13_0027_FC:4:1:4129:1330 length=74
TGCAGCTTTTGGATCAGACAGGTACAACAAACACTTCAACAAAAATAAAGCATTGAATTCCATGCGTTTGTCTT
>40578_rex.60 GRC13_0027_FC:4:1:5903:1331 length=74
TGCAGATTTGAATCTGGACAAAACATCAACATCCAATAAAACTATAATTTACTAAACAAATTAATTCCCCTTAC
>40578_rex.61 GRC13_0027_FC:4:1:15310:1335 length=74
TGCAGCAAATGAAATATGTGAGAAAGTGGCTGTGGTTGGGTTTAGTACAAACATGATAAGCTACTTAACAATCC
>40578_rex.62 GRC13_0027_FC:4:1:6490:1341 length=74
TGCAGCCCTTCTTCCAGCCCGCGGATCAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATG
>40578_rex.63 GRC13_0027_FC:4:1:14479:1348 length=74
TGCAGTCTCTTCTTCATACCCTCCAAGTCCTACAAAATTAATACACCCAAATAATTAGCAACATAGCCTCATCT
>40578_rex.64 GRC13_0027_FC:4:1:16245:1345 length=74
TGCAGTACCTCAGCAATAGGTTTGGAAGTTTTATCTAATTCCGATGTTGTATACTATGTTCGTGTTTGAGTGTT
>40578_rex.65 GRC13_0027_FC:4:1:17142:1344 length=74
TGCAGCGACCGCTATATCCCTACTGGAGCTCTCTGTCCGCGAATAGCGGTGGCATAGCGGTCAGAAGTGACCGC
>40578_rex.66 GRC13_0027_FC:4:1:4698:1353 length=74
TGCAGTACACAACCCAACACACGACAGCTTCATAATGATGAAAAATGCCATATGCTGGAATTATATACCGTTTA
>40578_rex.67 GRC13_0027_FC:4:1:16511:1362 length=74
TGCAGCATTTAGTCTGCTTTCCACATAAAGAATGAATGGGGAGATAAAATTTACTAAAAAATCAGCAATGTAAA
>40578_rex.68 GRC13_0027_FC:4:1:3908:1370 length=74
TGCAGAAGATGGACATTGGATTTTAACTAGACCATTTGAACCTGTGTGTAAGCCTGGGGGTCATGGTGTGATAT
>40578_rex.69 GRC13_0027_FC:4:1:8298:1367 length=74
TGCAGCTGTTGGTGGACGGATCGAGTCGTGCGAGCAGTAGAAACGGGCTTTCTGTGGAGACGCGTGCCGAACGG
>40578_rex.70 GRC13_0027_FC:4:1:8771:1374 length=74
TGCAGAAGTAGGAATAATGGCACCCGAGATAATATTGTTTCCATAAAGTAGAGATCCAGAAACAGGTTCACGAA
>40578_rex.71 GRC13_0027_FC:4:1:10712:1368 length=74
TGCAGTTTGAGAACACTTCGAGCACAATTTACCGAACTGTAGCCCAAATGGATATATAGTTCTGAATTAGATGC
>40578_rex.72 GRC13_0027_FC:4:1:12949:1366 length=74
TGCAGGCACGCAGTCATGGTGCAAATAGGCGAGCCCTTGGGCGGTCCCCAGTGCGATGGCGAACCGTTTAGGCC
>40578_rex.73 GRC13_0027_FC:4:1:10140:1382 length=74
TGCAGCCTCGTGACCAAGGATGCATGGGAAGAGACCTTCAGGATCCTGCACGAAAGATAAATTCAAAATTATTT
>40578_rex.74 GRC13_0027_FC:4:1:11474:1385 length=74
TGCAGACAGACCAACTAGCTGATTCTTGCAAATTAATTGATTGTGTTCTTTCTTGGGTTTTGATATCTGATTGA
>40578_rex.75 GRC13_0027_FC:4:1:13132:1385 length=74
TGCAGCCAACCTTCACGTTTGTATCCAAATGTTCGAAAAATAGCAGCATGTGGTTCATGCTTCATTGTTGGCTT
>40578_rex.76 GRC13_0027_FC:4:1:13412:1388 length=74
TGCAGCCTGTGGCCACCCCTTCTGCACCGATTGCTGGAAAACTTATGTTCGCCTTTCCATCCAAGACGGGCCGG
>40578_rex.77 GRC13_0027_FC:4:1:14834:1379 length=74
TGCAGACATCAACATATCCTTCCACGAATAGACGCACGGTGTGCTGGAATTAATAATCTTTCATGTTTATAGCC
>40578_rex.78 GRC13_0027_FC:4:1:6313:1398 length=74
TGCAGGTCGGTGATTTTTGTGATGGACCGGCTGGTCGCGGGTGGCGAAGCCAAGCGATAGGGGACAAACGGCAG
>40578_rex.79 GRC13_0027_FC:4:1:11517:1398 length=74
TGCAGGAATCGAGGTACTTCACACGTTGACAAATCACACAATTGACTTTTTTTGGAACAAAGCGTGCCTAATTG
>40578_rex.80 GRC13_0027_FC:4:1:13356:1391 length=74
TGCAGGACGGCAATTGTTCTGGGCACTGCTATTCCACTGGTTTTATTCCTTGTCTGGAATGGTGTCATTCTTGG
>40578_rex.81 GRC13_0027_FC:4:1:16941:1395 length=74
TGCAGCTCTCCACGTTGTGTTCGCGCCTCTTTTAGACTCTTGACTGCCACTAAACTACCATCTGCTAACCGCCC
>40578_rex.82 GRC13_0027_FC:4:1:18999:1388 length=74
TGCAGCACTGGCCTTCCCATGTTCCTTATTTTTGAAGAGCCAACTTGTGGAAGACCCCCCACCACTGGCTGCGG
>40578_rex.83 GRC13_0027_FC:4:1:19512:1399 length=74
TGCAGCAGTCACCCTCGGCCCGTTTCAGATAGTATGGCCTGCTTCTTCACCTCAGCCCGAACAACAGCAGCGTA
>40578_rex.84 GRC13_0027_FC:4:1:9548:1409 length=74
TGCAGGCTATATGTTGGTTTGTTACGATTCTTTAGTGATTGGTGACTATGTCTATTTTACCCTTTTGTGAACAA
>40578_rex.85 GRC13_0027_FC:4:1:14030:1410 length=74
TGCAGAGTCACATCGAATGCTTGTCGAGGCATATGGTGATCATGCTCTATCAGAAGCAACATGCAAAAGATGGT
>40578_rex.86 GRC13_0027_FC:4:1:14096:1405 length=74
TGCAGCTCTAGCGGTCGGCTCGGTTCTTTCAAATTGTTTCTCATTATTGAGAAAAGGTAACAAAGATAAAATAC
>40578_rex.87 GRC13_0027_FC:4:1:15071:1408 length=74
TGCAGTTCAGCCGAATATAAAAACACACAAAAGGGTCTAATTCCAATTAAGAATCTCCACATCCTTCTTCTCCT
>40578_rex.88 GRC13_0027_FC:4:1:2896:1418 length=74
TGCAGGGAATGGTAATTCATTCGGGCCGAGATATGATAAGGATCGCTTTTACGGGCTCGGGAAAGCCGTCGGTT
>40578_rex.89 GRC13_0027_FC:4:1:1967:1435 length=74
TGCAGACCTCCGAAAAAAAGACGTGCGTAGAAATGAACAATGAAGAAATTTCAAAAAAAAACAAGATAAATAAA
>40578_rex.90 GRC13_0027_FC:4:1:12507:1428 length=74
TGCAGGAAATTCAAAAGGAGATATGGATGTTCAAAGATGATATTGGAGATAGCAATGTAAACCAAGCTAGAAAA
>40578_rex.91 GRC13_0027_FC:4:1:17243:1431 length=74
TGCAGTACGAGTAGACTGCCGCAGACACTGCTTCGTGCATAGCCTCTGCTGACGACGCGTCTTGTACGGAGGAA
>40578_rex.92 GRC13_0027_FC:4:1:19026:1434 length=74
TGCAGCCTCCAAGGAGTATAAAGCGGTTATACCATTTGCTTCTGGTAGACACACAAAGAGAACGTGTTACATAT
>40578_rex.93 GRC13_0027_FC:4:1:1075:1444 length=74
TGCAGATCCGCTCTTTTTCCTATTCAAAGATCAGCCCCCTGGCTCTGTGTTTTCACATCGAGAATTATTTGCAG
>40578_rex.94 GRC13_0027_FC:4:1:1368:1438 length=74
TGCAGCATGGGTCACCGAAGCGACGACTGTGGTCTTATACCGGAGTCGATGAAAAGAGTATGTAAGGAATTCCT
>40578_rex.95 GRC13_0027_FC:4:1:3842:1440 length=74
TGCAGCTGCACGAGCCCTTCCCGCATGCCACAAATGACCTACGAATAAGAAGAATCCTAGAACAAAATGAGAGG
>40578_rex.96 GRC13_0027_FC:4:1:6119:1440 length=74
TGCAGGGCTGATGCCGCGCAGTTCATCCTCATAGAGGGCGGCAGCTCCGCCATTCTTCTCCTCCTCACCGCCCT
>40578_rex.97 GRC13_0027_FC:4:1:6371:1442 length=74
TGCAGAAACGCTACATCGATTGTTAGTGTCTCTGTAAGTGCACCTGGGCTTACCCGGTATTTTTTTTTTTTTGG
>40578_rex.98 GRC13_0027_FC:4:1:8730:1435 length=74
TGCAGATTGCCACTGCATATTTTCAGGAGTTACAAATTAACCCCCATAATTTCTAACATCTGTGAATTGACCAC
>40578_rex.99 GRC13_0027_FC:4:1:16326:1446 length=74
TGCAGCAGACAGCAGTTGAAGAGAAAAAAGATCAAACTAAATCAAACATATCATGCATAAGATTTTTCCAAAGA
>40578_rex.100 GRC13_0027_FC:4:1:1967:1453 length=74
TGCAGACCTCCGAAAAAAAGACGTGCGTAGAAATGAACAATGAAGAAATTTCAAAAAAAAACAAGATAAATAAA
>40578_rex.101 GRC13_0027_FC:4:1:4252:1458 length=74
TGCAGGCCTTGAGATGGCAAGTCAACGGTCAGCATTAGGGCCAGTTAGTGCACGAGTCGCGGTGTGGTCCGGAC
>40578_rex.102 GRC13_0027_FC:4:1:6163:1456 length=74
TGCAGATCCGCTCTTTTTCCTATTCAAAGATCAGCCCCCTGGCTCTGTGTTTTCACATCGAGAATTATTTGCAG
>40578_rex.103 GRC13_0027_FC:4:1:6919:1450 length=74
TGCAGCACTCTCGTCCCGCCACCACACTTGGCAGTCATCCCCAGTCGACCAGGTCCTTTCCATAGCCGTGTCAT
>40578_rex.104 GRC13_0027_FC:4:1:13290:1448 length=74
TGCAGAAAAATGGTGGAAAAATATTTAAAGCAAAAACAAAAATAATTAACTCAAGACATGTGGAATTTGCTCTG
>40578_rex.105 GRC13_0027_FC:4:1:15179:1448 length=74
TGCAGATCCGCTCTTTTTCCTATTCAAAGATCAGCCCCCTGGCTCTGTGTTTTCACATCGAGAATTATTTGCAG
>40578_rex.106 GRC13_0027_FC:4:1:15966:1448 length=74
TGCAGTCTTTGGAATTCTTTGAAGAGGATGAATGGAAGTTCTCCATCCACTGCCAACTTTTTAAGGTATGTCTG
>40578_rex.107 GRC13_0027_FC:4:1:18482:1449 length=74
TGCAGCATCTTATCCTCCAAACTCCTCATGGACGCGGTGCAAAACCCGAGAATGCCCCTCCGGGTCGTTGTCCA
>40578_rex.108 GRC13_0027_FC:4:1:2223:1463 length=74
TGCAGAGCCAACACAGGCCTCCAGCTTTGAACTGGCCATTAAAGTGATCTTCTGAATTGACTCCAGTGCCTTGG
>40578_rex.109 GRC13_0027_FC:4:1:7720:1467 length=74
TGCAGCCTTGGTCCCCACAATTTCTGTAAAATGTATTTTCAATTTCTTGATGTTGGGAATTCTTTTAACCACCC
>40578_rex.110 GRC13_0027_FC:4:1:8625:1460 length=74
TGCAGTAGCTGCCGAATCTTCTACTGGTACATGGACAACTGTGTGGACCGATGGGCTTACTAGCCTTGATCGTT
>40578_rex.111 GRC13_0027_FC:4:1:10638:1466 length=74
TGCAGACCATGAAGTTGACAAACATTCAATGTAATAGGGTGGAGGGAGTTAAAGACGTTGTACATCATGTGTGT
>40578_rex.112 GRC13_0027_FC:4:1:10885:1463 length=74
TGCAGTTTCGATCATTCTAAACACTTAGTTTTGTTTGCTCTCAAGTGTAGATTGGTATTGATGTGGAAGAAAAG
>40578_rex.113 GRC13_0027_FC:4:1:11673:1459 length=74
TGCAGGTGATGAAGACGCTGGCGGGGGAGGGGATGACCATGCTGGTGGTCAGCCATGAGATGATGTTCGTGCGG
>40578_rex.114 GRC13_0027_FC:4:1:13629:1463 length=74
TGCAGAACTCCTCATCTCTAATCTCCTACCAAATAAACAACAACCAAATCTCTTTTTATTTCTTCCTAAAAAAT
>40578_rex.115 GRC13_0027_FC:4:1:18957:1469 length=74
TGCAGGTCTTGCTAGTGGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCT
>40578_rex.116 GRC13_0027_FC:4:1:19383:1460 length=74
TGCAGTTTGAGCAGCAAAGGGTGTCCCTCTTCTCGTACCCTTGAATCCACAAGTACCGGCCGAGGACCAGGAAA
>40578_rex.117 GRC13_0027_FC:4:1:1696:1472 length=74
TGCAGTAGCACAGCCTAGAGGCGGCCGTGTTTATGTGACCCGTGTAAAAAATTCATTTTTTTATGTACATATCA
>40578_rex.118 GRC13_0027_FC:4:1:4799:1471 length=74
TGCAGCATCATGCTTTTCTCCTCCTGCTTCTGTTTCTTCCATACATTCTTCTTTTGCAACATCCTCATCCTCCA
>40578_rex.119 GRC13_0027_FC:4:1:6713:1478 length=74
TGCAGTGTTTATTCTTTTGTTTGACACAAATTAAGTCCTTTAGTTGGTGAACGACCAAACTCGACCAAACTCAA
>40578_rex.120 GRC13_0027_FC:4:1:9419:1481 length=74
TGCAGCGAATCACAATGTCGTGGACGCTGCCCGAGTCACATATTCGGGCCCTGGTCGAGTACAGGCCCCGGCCC
>40578_rex.121 GRC13_0027_FC:4:1:9790:1474 length=74
TGCAGCAACACAATCATGGGCTTCAACTTGATTCAACGTATTCTGTGTGAAGTTTCCCATAGCTTTTAACTTGC
>40578_rex.122 GRC13_0027_FC:4:1:3054:1492 length=74
TGCAGCCTGCACTCATTTCATCATGTTCAACAAAATTAATTATTATTTTATTTTTGAAGTAAAAATGCTACAAA
>40578_rex.123 GRC13_0027_FC:4:1:3992:1487 length=74
TGCAGTTGTGATTGATCAAGAAGGAAATCCAAAAGGAACTCGCATTTTTGGTGCAATCCCGCGGGAATTGCGAC
>40578_rex.124 GRC13_0027_FC:4:1:1922:1505 length=74
TGCAGGCGCGGCGGCGGCTTCAGCGGCTTGATAAAGCCCCCGGCGAAAAGCCCCACCGCTGGCAAGGAGGCCGG
>40578_rex.125 GRC13_0027_FC:4:1:2571:1496 length=74
TGCAGCTCACGGTCGTGAGGGTGAGCTTATTTTTTTGTGAACTGTCTCAACTGCTCGTGAGGGTCCTCACGATT
Question: Describe each step of your function above verbally, in other words, explain how and why it works. Describe any parts that gave you trouble and how you found a solution. Enter your answer below using Markdown.

Response:

Write your response here.
Action: Save your notebook and download as HTML to upload to courseworks.