Notebook 5.1: Python bootcamp review

This notebook is intended to test your comprehension of some important coding concepts that will be relevant moving forward so that you can write and interpret Python code. Please revisit old notebooks and lecture slides or search for help on online for topics that you find challenging. It is up to you to import any required libraries to accomplish the requested tasks.

All cells in this notebook are 'challenges' that you should try to complete as part of the assignment.

The core Python object types

In [1]:
# create a string object and store it to a variable x
x = "hello world"
In [2]:
# create a string object with a newline character in it and print it.
y = "hello\nworld"
In [4]:
# create a list made up of a mixture of integers and floats
mix = [1, 2, 3.0, 6.6, 2e3]
In [5]:
# create a dictionary 
mydict = {'a': 3, 'b': 4, 'c': 10}

Coding routines

In [7]:
# 1. Create a dictionary object with several key-value pairs
# 2. Iterate over the dict printing the key and value of each item
mydict = {'a': 3, 'b': 4, 'c': 10}
for key in mydict:
    print(key, mydict[key])
a 3
b 4
c 10
In [18]:
# 1. Create a list of 25 randomly sampled integers.
# 2. Use list-comprehension to create a new list where another
#    random integer is added to each item in the first list.
import random
intlist = [random.randint(0, 10) for i in range(25)]
sumlist = [intlist[i] + random.randint(0, 10) for i in range(25)]
In [21]:
# 1. Create a numpy array of 25 randomly sampled integers.
# 2. Create another array where a random integer is added to every 
#    item in the first array. 
import numpy as np
arr1 = np.random.randint(0, 10, 25)
arrsum = arr1 + np.random.randint(0, 10, 25)
In [22]:
# 1. Create a list with several string objects in it.
# 2. Iterate over the items in the list.
# 3. For each item, use a conditional statement that will select
#    some of the items but not others. 
# 4. print the items that return True to the conditional statement.

slist = ['a', 'b', 'c', 'd']
for item in slist:
    if item == 'b':
        print(item)
b
In [23]:
# use bash or Python code to download the file from the following URL
URL = "https://eaton-lab.org/data/40578.fastq.gz"
In [24]:
%%bash
URL="https://eaton-lab.org/data/40578.fastq.gz"
wget $URL -q

Writing, interpreting, and using functions

In [27]:
# add comments after '#' in the following function.
# feel free to take apart and test parts of the function 
# to learn what each part is doing.

def reverse_complement(dnastring):
    """
    Returns the reverse-complement of a string of DNA.
    """
    # a dictionary to convert bases to their complement
    compdict = {
        'A': 'T',
        'C': 'G',
        'G': 'C',
        'T': 'A',
    }
    
    # list comprehension to get the complement of each base
    complist = [compdict[i] for i in dnastring]
    
    # string join command to convert list to a string
    compstring = "".join(complist)
    
    # reverse the order of items in the string
    revcompstring = compstring[::-1]
    
    # return the reverse complemented string
    return revcompstring
In [33]:
# write a function to read the 40578.fastq.gz file and *return*
# the length of the *first sequenced read* in the file. And 
# apply the function to the file to return the result.

import gzip

def get_fastq_length(infile):
    
    with gzip.open(infile, 'rt') as indata:
        # read first line (header)
        indata.readline()
        # read second line (first sequence)
        return len(indata.readline())


# test it
get_fastq_length("40578.fastq.gz")
Out[33]:
75

Scientific Python

Please revisit your assignments, the lecture slides, and/or the documentation for tips on using Pandas and Numpy.

In [59]:
# Create a pandas dataframe with three columns composed of 
# randomly generated data created using the numpy.random 
# library. The first columns should be random integers, 
# the second random float values, and the last one
# should a random string of A,C,G or T. Create 100 
# rows of data in total. 

import pandas as pd

df = pd.DataFrame({
    "integers": np.random.randint(0, 50, 100),
    "floats": np.random.normal(0, 1, 100),
    "strings": np.random.choice(list("ACGT"), 100),
})

# show 
df.head()
Out[59]:
integers floats strings
0 1 0.730890 C
1 22 0.353199 C
2 28 -1.272836 G
3 41 -0.344499 A
4 30 -0.370315 T
In [60]:
# Select all rows of the dataframe from above where the
# third column is equal to "A". 

# select rows
selectA = df[df["strings"] == "A"]

# show result head
selectA.head()
Out[60]:
integers floats strings
3 41 -0.344499 A
12 18 0.966228 A
14 41 0.895676 A
26 3 0.446839 A
32 6 2.146710 A
In [65]:
# Select all rows of the dataframe from above where the
# third column is equal to "T" and change the values to "U".

# select ROWS and COL to change, use .loc or .iloc
df.loc[df["strings"] == "T", "strings"] = "U"

# show result head
df.head(20)
Out[65]:
integers floats strings
0 1 0.730890 C
1 22 0.353199 C
2 28 -1.272836 G
3 41 -0.344499 A
4 30 -0.370315 U
5 4 -1.030631 U
6 36 -0.675709 G
7 29 -1.702446 U
8 7 1.854808 U
9 2 -0.247225 C
10 15 0.285567 C
11 0 0.787572 C
12 18 0.966228 A
13 2 0.880997 C
14 41 0.895676 A
15 2 0.569704 G
16 6 -0.640945 U
17 7 0.391457 G
18 16 -0.262894 G
19 43 1.271503 C
In [85]:
# Load a CSV file using the pandas library from the following URL
# and set appropriate names for the columns.
URL = "https://eaton-lab.org/data/iris-data-dirty.csv"

# load and set names
iris = pd.read_csv(URL, names=["trait1", "trait2", "trait3", "trait4", "taxon"])

# show 
iris.head()
Out[85]:
trait1 trait2 trait3 trait4 taxon
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [86]:
# print all *unique* values in the column with taxon names
iris["taxon"].unique()
Out[86]:
array(['Iris-setosa', 'Iris-setsa', 'Iris-versicolour', 'Iris-versicolor',
       'Iris-virginica'], dtype=object)
In [88]:
# replace the names that appear to be typos in the names column
iris.loc[iris["taxon"] == "Iris-setsa", "taxon"] = "Iris-setosa"
iris.loc[iris["taxon"] == "Iris-versicolour", "taxon"] = "Iris-versicolor"
In [89]:
# write code here to save the new revised dataframe as a CSV file.

iris.to_csv("./iris-revised.csv")

Reading and writing files

In [90]:
# Write the contents of the string object below to a new 
# file called "hello-world.txt".
mystring = "hello world"


with open("hello-world.txt", 'w') as out:
    out.write(mystring)
In [91]:
# open the file "hello-world.txt" by writing the *full path*
# to the location of the file on your filesystem. Read 
# the contents of the file and print it.

# write full path or get it using os.path
import os
fullpath = os.path.abspath("./hello-world.txt")

# write to full path
with open(fullpath, 'r') as indat:
    print(indat.read())
hello world