Notebook 4.1: Introduction to Numpy

This notebook is associated with the following accompanied reading:

  • Yandell, Mark, and Daniel Ence. 2012. “A Beginner’s Guide to Eukaryotic Genome Annotation.” Nature Reviews Genetics 13 (5): 329–42. https://doi.org/10.1038/nrg3174.

Learning objectives:

By the end of this module you should be able to:

  1. Find and use scientific and numerical functions in numpy.
  2. Generate arrays of data and compute values on them.
  3. Understand the difference between lists and numpy arrays.
  4. Calculate N50 contig size.

Introduction to numpy

This notebook should be completed while reading Chapter 2 of the Data Science Handbook that was assigned for your reading. The numpy library is a third party Python library, meaning that it is not distributed by default with every Python installation. It can be easily installed however, and provides a huge suite of tools for scientific computing. I think that the assigned Chapter introduces numpy very well, so this notebook will mostly consist of exercises to test your comprehension of the reading.

In [1]:
# start by importing numpy 
import numpy as np

Create a numpy array

There are many ways to create a numpy array. Numpy has several built-in functions for generating arrays that are composed entirely of one value, or a range of values using .zeros(), .ones(), or .arange(); or, we can also generate an array by passing in a list like in the last two examples below.

In [2]:
# create an array with ten items in it that are all zeros
np.zeros(10)
Out[2]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [3]:
# create an array with ten items that are all zeros as integers
np.zeros(10, dtype=int)
Out[3]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [4]:
# create an array with a range of values from 0-10
np.arange(0, 10)
Out[4]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [5]:
# create an array from a list
np.array([0, 3, 4, 10, 2, 2, 2, 2])
Out[5]:
array([ 0,  3,  4, 10,  2,  2,  2,  2])
In [6]:
# create an array that is made of 0 and 1 alternating
np.array([0, 1] * 5)
Out[6]:
array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

However, the datatype is important...

When you create a numpy array from a list it tries to infer the datatype from contents of the list. Above, when we created a list of all int elements it created an int array. However, when we pass it a list below where some elements are ints and some are strings, it converts everything to strings. This is because numpy works most efficiently by storing all data in an array as a single datatype. You can create arrays with a mixed datatype but you lose much of the efficiency of numpy when you do so.

In [7]:
# mixed type lists will be converted to a single dtype array
np.array([0, 1, "apple", "orange"])
Out[7]:
array(['0', '1', 'apple', 'orange'], dtype='<U21')

Dimensions and indexing

Numpy arrays can be indexed just like Python list objects to select particular elements from them. In addition to the one dimension in which lists can be indexed, however, arrays can be indexed in multiple dimensions to select both rows and columns, and they can apply functions over these indices as well. If you need a refresher on how to index and slice objects in Python look back at our reading from last week (link), or google "python indexing".

In [8]:
# create a 2-dimensional array 
np.zeros((4, 4))
Out[8]:
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
In [9]:
# create a 3-dimensional array
np.zeros((5, 3, 3))
Out[9]:
array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])
In [10]:
# index the third element in a 1-d array
np.arange(100)[3]
Out[10]:
3
In [11]:
# slice the third through tenth elements in a 1-d array
np.arange(100)[3:10]
Out[11]:
array([3, 4, 5, 6, 7, 8, 9])

Testing your skills

Action [5]: Create a 2-dimensional array with 3 rows and 5 columns that is composed entirely of cells with the integer value 35.
In [14]:
# one way
arr = np.zeros((3, 5), dtype=int)
arr[:] = 35
arr
Out[14]:
array([[35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35]])
In [17]:
# another way
arr = np.empty((3, 5), dtype=int)
arr[:] = 35
arr
Out[17]:
array([[35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35]])
In [19]:
# another way
arr = np.empty((3, 5), dtype=int)
arr.fill(35)
arr
Out[19]:
array([[35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35]])
In [21]:
# another way
arr = np.full((3, 5), 35, dtype=int)
arr
Out[21]:
array([[35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35],
       [35, 35, 35, 35, 35]])
Action [6]: Create a 2-dimensional array of size (3, 5) that is composed of random integers generated by numpy.
In [29]:
np.random.randint(0, 10, (3, 5))
Out[29]:
array([[8, 1, 3, 8, 8],
       [9, 2, 9, 0, 5],
       [9, 3, 2, 2, 7]])
Action [7]: Return the values [24, 25, 26] from the array below by using slicing.
In [39]:
arr = np.arange(30)
In [40]:
arr[24:27]
Out[40]:
array([24, 25, 26])
Action [8]: Return the values [14, 16, 18, 20, 22] from the array below by slicing.
In [41]:
arr = np.arange(24).reshape((12, 2))
In [47]:
arr[7:, 0]
Out[47]:
array([14, 16, 18, 20, 22])

Filling an array with data

When learning about lists and dictionaries we learned that a convenient way to use these objects is to first create an empty list or dictionary and then fill it with values as you iterate over an object and then append items to the list, or add new key/val pairs to the dictionary. In numpy you have to do things a bit differently. You start by creating an array that is the full size that you plan to work with, initialized with some null value like zeros, and then you update the values with new data.

The difference between these two approaches may seem subtle, but it can lead to huge speed improvements when done correctly in numpy. This is because numpy essentially reserves space in your computers memory for the size of the array, whereas a list that changes size as you extend it will need to keep updating the object in memory. As explained in your reading, this allows numpy to perform calculations using compiled functions in faster languages like C and fortran; it's a workaround that allows us to write pretty Python code but still benefit from super fast speed.

These details only begin to matter when you do pretty high level computing, but it's worth learning the motivation for why we use numpy, and why the code looks the way that it does.

In [48]:
# common with lists: start with an empty list and fill it as you iterate
empty = []
for i in range(100):
    empty.append(i)
In [49]:
# filling an empty array doesn't work like with lists
empty = np.array([])
for i in range(100):
    # ... a function does not exist to extend the size of arrays.
    # ... only to generate new arrays of a given size
    pass
In [50]:
# instead, you create the full sized array with null values and update them by indexing
empty = np.zeros(100)
for i in range(100):
    empty[i] = i

Biological examples

Question [9]: In your reading by Yandell and Ence they describe the use of summary statistics for describing the completeness and contiguity of a genome assembly. In Box 1 this includes a description of the N50 statistic. Using Markdown explain what an N50 contig size, and an N50 scaffold size. Is bigger or smaller N50 better?

Response:

N50 is the length of the smallest scaffold or contig which, along with all larger contigs or scaffolds, contains half of the sequence of the genome. Bigger is better.

Computing N50 with numpy

Action [10]: Insert short comments at the indicated positions in the 10 code cells below to describe what each chunk of code is doing. Hint: Look at the document string of each function interactively by putting your cursor inside of the parentheses and holding shift and pressing tab; Look in your assigned Chapter 2 reading; google the function name; and try to infer based on the returned output and the name of the function.
In [51]:
# run this first so that we all start from the same random seed
import numpy as np
np.random.seed(12345)
In [52]:
# 1. comment: generate 50 random integers between 1K-50K
contig_sizes = np.random.randint(1000, 50000, size=50)
contig_sizes
Out[52]:
array([21962, 12749,  3177, 20876, 45457,  5094, 21862, 29005, 17930,
       41477, 48873, 28766, 41251,  7798, 23633, 20547, 24287, 22878,
       16568, 42842, 14001, 46323, 49205,  6842, 31983, 37907, 17615,
        4511, 10503, 30025, 18239, 27660,  3624, 38251,  3812, 34916,
       36983, 49499, 42297, 40019, 19821, 23562,  3419, 39917, 13514,
        5204, 44269, 24611, 44091, 12788])
In [54]:
# 2. comment: sort integer array by size
contig_sizes.sort()
contig_sizes
Out[54]:
array([ 3177,  3419,  3624,  3812,  4511,  5094,  5204,  6842,  7798,
       10503, 12749, 12788, 13514, 14001, 16568, 17615, 17930, 18239,
       19821, 20547, 20876, 21862, 21962, 22878, 23562, 23633, 24287,
       24611, 27660, 28766, 29005, 30025, 31983, 34916, 36983, 37907,
       38251, 39917, 40019, 41251, 41477, 42297, 42842, 44091, 44269,
       45457, 46323, 48873, 49205, 49499])
In [55]:
# 3. comment: reverse the order of the array
contig_sizes = contig_sizes[::-1]
contig_sizes
Out[55]:
array([49499, 49205, 48873, 46323, 45457, 44269, 44091, 42842, 42297,
       41477, 41251, 40019, 39917, 38251, 37907, 36983, 34916, 31983,
       30025, 29005, 28766, 27660, 24611, 24287, 23633, 23562, 22878,
       21962, 21862, 20876, 20547, 19821, 18239, 17930, 17615, 16568,
       14001, 13514, 12788, 12749, 10503,  7798,  6842,  5204,  5094,
        4511,  3812,  3624,  3419,  3177])
In [57]:
# 4. comment: get the total sum (e.g., assembly size)
total_len = contig_sizes.sum()
total_len
Out[57]:
1272443
In [58]:
# 5. comment: get half of the total size
half_total_len = total_len / 2
half_total_len
Out[58]:
636221.5
In [59]:
# 6. comment: make an array of zeros
contig_sum_lens = np.zeros(contig_sizes.size, dtype=int)
contig_sum_lens
Out[59]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])
In [60]:
# 7. comment: interate over the items in the empty array and
#             fill each index with the sum of all up to that point. 
for i in range(contig_sizes.size):
    contig_sum_lens[i] = contig_sizes[i:].sum()
contig_sum_lens
Out[60]:
array([1272443, 1222944, 1173739, 1124866, 1078543, 1033086,  988817,
        944726,  901884,  859587,  818110,  776859,  736840,  696923,
        658672,  620765,  583782,  548866,  516883,  486858,  457853,
        429087,  401427,  376816,  352529,  328896,  305334,  282456,
        260494,  238632,  217756,  197209,  177388,  159149,  141219,
        123604,  107036,   93035,   79521,   66733,   53984,   43481,
         35683,   28841,   23637,   18543,   14032,   10220,    6596,
          3177])
In [61]:
# 8. comment: calculate which values are > half the total length
which_contigs_longer_than_half = contig_sum_lens > half_total_len
which_contigs_longer_than_half
Out[61]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False])
In [62]:
# 9. comment: keep the values that answered True above
contigs_longer_than_half = contig_sizes[which_contigs_longer_than_half]
contigs_longer_than_half
Out[62]:
array([49499, 49205, 48873, 46323, 45457, 44269, 44091, 42842, 42297,
       41477, 41251, 40019, 39917, 38251, 37907])
In [63]:
# 10. comment: keep the lowest value from above, which is the N50
n50 = contigs_longer_than_half.min()
n50
Out[63]:
37907
Question [11]: Summarize in Markdown below how the ten lines of code above accomplish the task of computing N50 starting with a list of contig lengths.

Response:

We created a random array of integers between 1K and 50K and sought to calculate the N50. This was done by sorting them by order of length and then creating a new array that calculated the ordered sum of contigs starting with the largest and working towards the smallest. A mask was then created to select only those sums that are greater than half of the genome size, and the smallest is the N50.
Action [12]: Bonus challenge (optional/not-graded): combine the lines of code above together into a function that takes the list of contig lengths as an input argument (something you would typically have after completing a genome assembly) and returns the N50 statistic. Test your function to see if it works. If you get stuck on this do not worry.
In [ ]:
 
In [66]:
def calculate_N50_from_contig_sizes(contig_sizes):

    # 2. comment: sort integer array by size
    contig_sizes.sort()
    contig_sizes

    # 3. comment: reverse the order of the array
    contig_sizes = contig_sizes[::-1]
    contig_sizes

    # 4. comment: get the total sum (e.g., assembly size)
    total_len = contig_sizes.sum()
    total_len

    # 5. comment: get half of the total size
    half_total_len = total_len / 2
    half_total_len

    # 6. comment: make an array of zeros
    contig_sum_lens = np.zeros(contig_sizes.size, dtype=int)
    contig_sum_lens

    # 7. comment: interate over the items in the empty array and
    #             fill each index with the sum of all up to that point. 
    for i in range(contig_sizes.size):
        contig_sum_lens[i] = contig_sizes[i:].sum()
    contig_sum_lens

    # 8. comment: calculate which values are > half the total length
    which_contigs_longer_than_half = contig_sum_lens > half_total_len
    which_contigs_longer_than_half

    # 9. comment: keep the values that answered True above
    contigs_longer_than_half = contig_sizes[which_contigs_longer_than_half]
    contigs_longer_than_half

    # 10. comment: keep the lowest value from above, which is the N50
    n50 = contigs_longer_than_half.min()
    
    return n50
In [68]:
# run this first so that we all start from the same random seed
import numpy as np
np.random.seed(12345)
contig_sizes = np.random.randint(1000, 50000, size=50)

# run the new function
calculate_N50_from_contig_sizes(contig_sizes)
Out[68]:
37907
Action: Save and Download this notebook as HTML to upload to courseworks.