This notebook is associated with the following accompanied reading:
By the end of this module you should be able to:
numpy
.This notebook should be completed while reading Chapter 2 of the Data Science Handbook that was assigned for your reading. The numpy library is a third party Python library, meaning that it is not distributed by default with every Python installation. It can be easily installed however, and provides a huge suite of tools for scientific computing. I think that the assigned Chapter introduces numpy very well, so this notebook will mostly consist of exercises to test your comprehension of the reading.
# start by importing numpy
import numpy as np
There are many ways to create a numpy array. Numpy has several built-in functions for generating arrays that are composed entirely of one value, or a range of values using .zeros()
, .ones()
, or .arange()
; or, we can also generate an array by passing in a list like in the last two examples below.
# create an array with ten items in it that are all zeros
np.zeros(10)
# create an array with ten items that are all zeros as integers
np.zeros(10, dtype=int)
# create an array with a range of values from 0-10
np.arange(0, 10)
# create an array from a list
np.array([0, 3, 4, 10, 2, 2, 2, 2])
# create an array that is made of 0 and 1 alternating
np.array([0, 1] * 5)
When you create a numpy array from a list it tries to infer the datatype from contents of the list. Above, when we created a list of all int elements it created an int array. However, when we pass it a list below where some elements are ints and some are strings, it converts everything to strings. This is because numpy works most efficiently by storing all data in an array as a single datatype. You can create arrays with a mixed datatype but you lose much of the efficiency of numpy when you do so.
# mixed type lists will be converted to a single dtype array
np.array([0, 1, "apple", "orange"])
Numpy arrays can be indexed just like Python list objects to select particular elements from them. In addition to the one dimension in which lists can be indexed, however, arrays can be indexed in multiple dimensions to select both rows and columns, and they can apply functions over these indices as well. If you need a refresher on how to index
and slice
objects in Python look back at our reading from last week (link), or google "python indexing".
# create a 2-dimensional array
np.zeros((4, 4))
# create a 3-dimensional array
np.zeros((5, 3, 3))
# index the third element in a 1-d array
np.arange(100)[3]
# slice the third through tenth elements in a 1-d array
np.arange(100)[3:10]
# one way
arr = np.zeros((3, 5), dtype=int)
arr[:] = 35
arr
# another way
arr = np.empty((3, 5), dtype=int)
arr[:] = 35
arr
# another way
arr = np.empty((3, 5), dtype=int)
arr.fill(35)
arr
# another way
arr = np.full((3, 5), 35, dtype=int)
arr
np.random.randint(0, 10, (3, 5))
arr = np.arange(30)
arr[24:27]
arr = np.arange(24).reshape((12, 2))
arr[7:, 0]
When learning about lists and dictionaries we learned that a convenient way to use these objects is to first create an empty list or dictionary and then fill it with values as you iterate over an object and then append items to the list, or add new key/val pairs to the dictionary. In numpy you have to do things a bit differently. You start by creating an array that is the full size that you plan to work with, initialized with some null value like zeros, and then you update the values with new data.
The difference between these two approaches may seem subtle, but it can lead to huge speed improvements when done correctly in numpy. This is because numpy essentially reserves space in your computers memory for the size of the array
, whereas a list
that changes size as you extend it will need to keep updating the object in memory. As explained in your reading, this allows numpy to perform calculations using compiled functions in faster languages like C and fortran; it's a workaround that allows us to write pretty Python code but still benefit from super fast speed.
These details only begin to matter when you do pretty high level computing, but it's worth learning the motivation for why we use numpy, and why the code looks the way that it does.
# common with lists: start with an empty list and fill it as you iterate
empty = []
for i in range(100):
empty.append(i)
# filling an empty array doesn't work like with lists
empty = np.array([])
for i in range(100):
# ... a function does not exist to extend the size of arrays.
# ... only to generate new arrays of a given size
pass
# instead, you create the full sized array with null values and update them by indexing
empty = np.zeros(100)
for i in range(100):
empty[i] = i
# run this first so that we all start from the same random seed
import numpy as np
np.random.seed(12345)
# 1. comment: generate 50 random integers between 1K-50K
contig_sizes = np.random.randint(1000, 50000, size=50)
contig_sizes
# 2. comment: sort integer array by size
contig_sizes.sort()
contig_sizes
# 3. comment: reverse the order of the array
contig_sizes = contig_sizes[::-1]
contig_sizes
# 4. comment: get the total sum (e.g., assembly size)
total_len = contig_sizes.sum()
total_len
# 5. comment: get half of the total size
half_total_len = total_len / 2
half_total_len
# 6. comment: make an array of zeros
contig_sum_lens = np.zeros(contig_sizes.size, dtype=int)
contig_sum_lens
# 7. comment: interate over the items in the empty array and
# fill each index with the sum of all up to that point.
for i in range(contig_sizes.size):
contig_sum_lens[i] = contig_sizes[i:].sum()
contig_sum_lens
# 8. comment: calculate which values are > half the total length
which_contigs_longer_than_half = contig_sum_lens > half_total_len
which_contigs_longer_than_half
# 9. comment: keep the values that answered True above
contigs_longer_than_half = contig_sizes[which_contigs_longer_than_half]
contigs_longer_than_half
# 10. comment: keep the lowest value from above, which is the N50
n50 = contigs_longer_than_half.min()
n50
def calculate_N50_from_contig_sizes(contig_sizes):
# 2. comment: sort integer array by size
contig_sizes.sort()
contig_sizes
# 3. comment: reverse the order of the array
contig_sizes = contig_sizes[::-1]
contig_sizes
# 4. comment: get the total sum (e.g., assembly size)
total_len = contig_sizes.sum()
total_len
# 5. comment: get half of the total size
half_total_len = total_len / 2
half_total_len
# 6. comment: make an array of zeros
contig_sum_lens = np.zeros(contig_sizes.size, dtype=int)
contig_sum_lens
# 7. comment: interate over the items in the empty array and
# fill each index with the sum of all up to that point.
for i in range(contig_sizes.size):
contig_sum_lens[i] = contig_sizes[i:].sum()
contig_sum_lens
# 8. comment: calculate which values are > half the total length
which_contigs_longer_than_half = contig_sum_lens > half_total_len
which_contigs_longer_than_half
# 9. comment: keep the values that answered True above
contigs_longer_than_half = contig_sizes[which_contigs_longer_than_half]
contigs_longer_than_half
# 10. comment: keep the lowest value from above, which is the N50
n50 = contigs_longer_than_half.min()
return n50
# run this first so that we all start from the same random seed
import numpy as np
np.random.seed(12345)
contig_sizes = np.random.randint(1000, 50000, size=50)
# run the new function
calculate_N50_from_contig_sizes(contig_sizes)