Notebook 4.2: Pandas and Gene Ontology

This notebook is associated with the following accompanied reading:

  • Yandell, Mark, and Daniel Ence. 2012. “A Beginner’s Guide to Eukaryotic Genome Annotation.” Nature Reviews Genetics 13 (5): 329–42. https://doi.org/10.1038/nrg3174.
  • NB: You do not need to read Chapter 4 of the Data Science Handbook this week.

Learning objectives:

By the end of this module you should be able to:

  1. Create and load Pandas DataFrames.
  2. Index/select elements from a Pandas DataFrame.
  3. Understand the difference between numpy arrays and pandas dataframes.

Introduction to Pandas DataFrames

The pandas library is similar to numpy in that it contains a core object type, the DataFrame, and a set of functions for computing over axes of this object. Generally, DataFrames are easier to learn and understand than numpy arrays because they have labels, and so it can be easier to keep track of what you're doing, and because you can combine different datatypes (e.g., ints, floats, strings) together in a single DataFrame, whereas in numpy you usually work with just one dtype at a time.

One way to think about Pandas DataFrames is that they are essentially a pretty wrapper around one or more numpy arrays, which put labels on the rows and columns, and allow you to index using row and column names instead of just indices. But Pandas does more than that as well, and provides many very useful functions for working with data in a table.

If you have used the R programming language before then Pandas may feel familiar, it is similar to the data frame object in R, and is intended for holding data for doing statistics operations. Your reading provides many examples of interesting uses for Pandas. Again, this notebook is primarily meant to test your knowledge after finishing the reading.

Loading pandas

Numpy and Pandas are often used together.

In [1]:
import pandas as pd
import numpy as np

Creating DataFrames

You can create a Pandas DataFrame from scratch in a variety of ways. I find that the easiest way is to use a Python dictionary to enter key/value pairs to the pd.DataFrame() function. Remember from our earlier notebook that you can create a dictionary using curly brackets. Here we create a DataFrame by providing it a dictionary where the key is a string (this will become the column name) and the value is a numpy array of randomly generated values (this will become the data). Each key/pair in the dictionary is written on a separate line to make it easier to read, but they are all surrounded by a single set of curly brackets, and so interpreted as a single dictionary. It is important that all of the data we enter is of the same length, since this will become the column data of the table. But, as you can see, it is OK that the data in each column is a different datatype, here we combine floats, strings, and ints. Pandas is also nice because it returns the table data in the output cell as a nicely formatted HTML table. Scroll your cursor over the table and you can see that cells light up as you highlight them. It's nice.

In [2]:
# create an array with randomly generated values 
df = pd.DataFrame({
    "column1": np.random.normal(0, 1, 10),
    "column2": np.random.choice(list("ACGT"), 10), 
    "column3": np.random.randint(0, 5, 10),
})

# return the dataframe
df
Out[2]:
column1 column2 column3
0 -0.738636 T 3
1 -1.487751 C 3
2 0.640317 G 1
3 1.263995 G 2
4 -0.134444 C 0
5 0.821132 C 4
6 1.291565 T 1
7 0.513995 C 1
8 0.952997 C 2
9 1.604751 A 3

Indexing DataFrames

You can index DataFrames by their column or row names (row names are called the index in pandas), or by using numeric indices like in numpy. The four options below accomplish the same thing, they select all data in column1.

In [3]:
# select columns by name in a dictionary-like way
df["column1"]
Out[3]:
0   -0.738636
1   -1.487751
2    0.640317
3    1.263995
4   -0.134444
5    0.821132
6    1.291565
7    0.513995
8    0.952997
9    1.604751
Name: column1, dtype: float64
In [4]:
# select column by name in a simple object-like way
df.column1
Out[4]:
0   -0.738636
1   -1.487751
2    0.640317
3    1.263995
4   -0.134444
5    0.821132
6    1.291565
7    0.513995
8    0.952997
9    1.604751
Name: column1, dtype: float64
In [5]:
# select column by name using fancy indexing
df.loc[:, "column1"]
Out[5]:
0   -0.738636
1   -1.487751
2    0.640317
3    1.263995
4   -0.134444
5    0.821132
6    1.291565
7    0.513995
8    0.952997
9    1.604751
Name: column1, dtype: float64
In [6]:
# select column by index using fancy indexing
df.iloc[:, 0]
Out[6]:
0   -0.738636
1   -1.487751
2    0.640317
3    1.263995
4   -0.134444
5    0.821132
6    1.291565
7    0.513995
8    0.952997
9    1.604751
Name: column1, dtype: float64

Why so many ways to index?

  • The first dictionary-like way is nice since if you're familiar with dictionary objects then you can substitute dataframes in for these objects easily.

  • The second object-like mode is fast to type and can be selected using tab-completion, so it's easy to view all the column choices in an object without having to print it; however in this mode you cannot select column names with odd characters in them like spaces since the column name is not wrapped in quotes. This syntax is not ideal.

  • The third way of indexing, with .loc[] can combine selecting by both row and column names into a single call, so this very powerful and is generally the preferred method.

  • The final way of indexing with .iloc[] is a nice alternative, you just need to keep track that you are indexing the col or row that you intended by knowing their index.

Loading data files

Pandas has several functions for loading data from a variety of formats, but the most commonly used ones are .read_csv() or read_table(). You can provide a number of arguments to each function to tell it details of the file, such as what the delimiter is (e.g., tabs, spaces, commas), whether to skip the first N rows, whether to use the first row and column names, etc. You can use this function to load data files from a file path location on your hard disk, or, if you're connected to the internet, you can even just provide it a url of the location of a datafile that is publicly hosted online.

We will practice by loading a GFF file, which we have seen before. We can load the yeast genome GFF file directly from the NCBI FTP site where it is hosted. For a quick reminder of the GFF file format visit here. We provide the argument comment="#" below so it will skip the first few lines of the file that are not data but just comments, and we will enter in the column names by hand because it is easier than trying to parse them from the file in this case.

The DataFrame object is saved as the variable gff, which we then show the first few lines of with the function of the dataframe object called .head(). This should sound familiar. Remember when we were learning bash scripting there was a function called head to print the first few lines of a file. Pandas is implementing that same idea as a function of dataframes. We want to just peek at the first few lines to make sure the data were loaded/parsed correctly.

In [7]:
# url to the GFF file on the NCBI FTP site
url = "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz"

# read csv function with arguments to parse the file
gff = pd.read_csv(
    url, 
    sep="\t", 
    header=None,
    comment="#", 
    names=("seqname", "source", "feature", "start",
           "end", "score", "strand", "frame", "attribute"),
)

# return the dataframe
gff.head()
Out[7]:
seqname source feature start end score strand frame attribute
0 NC_001133.9 RefSeq region 1 230218 . + . ID=NC_001133.9:1..230218;Dbxref=taxon:559292;N...
1 NC_001133.9 RefSeq telomere 1 801 . - . ID=id-NC_001133.9:1..801;Dbxref=SGD:S000028862...
2 NC_001133.9 RefSeq origin_of_replication 707 776 . + . ID=id-NC_001133.9:707..776;Dbxref=SGD:S0001212...
3 NC_001133.9 RefSeq gene 1807 2169 . - . ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8...
4 NC_001133.9 RefSeq mRNA 1807 2169 . - . ID=rna-NM_001180043.1;Parent=gene-YAL068C;Dbxr...

Fancy indexing

Let's repeat the exercise we implemented in our first session of trying to select rows and columns from a table of data that fit some filtering criterion that we're interested in. Here, we will try to select all of the features in the GFF that are telomeres so we can extract their start and stop positions. This is much easier and nicer looking in Pandas than it is by using bash scripting.

Here we will also introduce the concept of a boolean mask. A convenient way to filter or subselect part of a data set is to use a mask. You can create a boolean array to act as a mask for an entire column or row by asking a question with a True or False answer and broadcasting it across the row or column. For example, below we ask whether the feature is a telomere, which returns a True or False for every element in the 'feature' column.

In [8]:
# create a boolean mask of the feature column asking 'is it a telomere?'
mask = gff.feature == "telomere"

# show the first 10 values
mask.head(10)
Out[8]:
0    False
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: feature, dtype: bool

We can now apply the mask as an index or with .loc or .iloc to the gff dataframe and it will select all rows that have True, and not the ones that have False. The index (row name) on the right is retained from the entire dataframe, thus you can see that by selecting only telomeres we get element 1, 432, 434, 2257, 2259, etc.

In [9]:
# use this mask to select only rows that are telomeres
telomeres = gff[mask]

# return the first 10 value
telomeres.head()
Out[9]:
seqname source feature start end score strand frame attribute
1 NC_001133.9 RefSeq telomere 1 801 . - . ID=id-NC_001133.9:1..801;Dbxref=SGD:S000028862...
432 NC_001133.9 RefSeq telomere 229411 230218 . + . ID=id-NC_001133.9:229411..230218;Dbxref=SGD:S0...
434 NC_001134.8 RefSeq telomere 1 6608 . - . ID=id-NC_001134.8:1..6608;Dbxref=SGD:S00002886...
2257 NC_001134.8 RefSeq telomere 812379 813184 . + . ID=id-NC_001134.8:812379..813184;Dbxref=SGD:S0...
2259 NC_001135.5 RefSeq telomere 1 1098 . - . ID=id-NC_001135.5:1..1098;Dbxref=SGD:S00002887...
In [10]:
# or, we could have written the above code more concisely as a single line:
gff.loc[gff.feature == "telomere"].head()
Out[10]:
seqname source feature start end score strand frame attribute
1 NC_001133.9 RefSeq telomere 1 801 . - . ID=id-NC_001133.9:1..801;Dbxref=SGD:S000028862...
432 NC_001133.9 RefSeq telomere 229411 230218 . + . ID=id-NC_001133.9:229411..230218;Dbxref=SGD:S0...
434 NC_001134.8 RefSeq telomere 1 6608 . - . ID=id-NC_001134.8:1..6608;Dbxref=SGD:S00002886...
2257 NC_001134.8 RefSeq telomere 812379 813184 . + . ID=id-NC_001134.8:812379..813184;Dbxref=SGD:S0...
2259 NC_001135.5 RefSeq telomere 1 1098 . - . ID=id-NC_001135.5:1..1098;Dbxref=SGD:S00002887...
Action [13]: Using pandas count the number of genes in the yeast GFF DataFrame in code cell below.
In [13]:
gff.loc[gff.feature == "gene"].shape[0]
Out[13]:
6427

Parsing attributes from the GFF

As before, much of the data of the GFF file is in the final column called attributes. We can parse this information as with any type of string, by splitting on the delimiter in the text which is a semicolon. Let's try extracting the first ten genes and see what kind of information is in their attributes.

In the code cells below I am just returning the filtered dataframe rather than saving it to variable so you can easily see the result of the operation. As you can see it is easy to explore the effect of our operations by filtering in various ways and calling the .head() function to show a small part of the result.

In [14]:
# select only gene rows
gff[gff.feature == "gene"].head(5)
Out[14]:
seqname source feature start end score strand frame attribute
3 NC_001133.9 RefSeq gene 1807 2169 . - . ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8...
7 NC_001133.9 RefSeq gene 2480 2707 . + . ID=gene-YAL067W-A;Dbxref=GeneID:1466426;Name=Y...
11 NC_001133.9 RefSeq gene 7235 9016 . - . ID=gene-YAL067C;Dbxref=GeneID:851230;Name=SEO1...
16 NC_001133.9 RefSeq gene 11565 11951 . - . ID=gene-YAL065C;Dbxref=GeneID:851232;Name=YAL0...
20 NC_001133.9 RefSeq gene 12046 12426 . + . ID=gene-YAL064W-B;Dbxref=GeneID:851233;Name=YA...
In [15]:
# select attributes of those rows
gff[gff.feature == "gene"]["attribute"].head(5)
Out[15]:
3     ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8...
7     ID=gene-YAL067W-A;Dbxref=GeneID:1466426;Name=Y...
11    ID=gene-YAL067C;Dbxref=GeneID:851230;Name=SEO1...
16    ID=gene-YAL065C;Dbxref=GeneID:851232;Name=YAL0...
20    ID=gene-YAL064W-B;Dbxref=GeneID:851233;Name=YA...
Name: attribute, dtype: object

DataFrames are iterable, as are their columns or rows. Below we iterate over the elements in the 'attribute' column of the rows where feature=='gene'. The iteration is done using a list-comprehension. We can do something which each selected element as it is created during the for-loop that is happening within the list-comprehension. Here, because the selected elements are strings, we perform a string operation, .split(), to split the string every time a semi-colon appears. The result is a list of lists of the items in the string that were split.

That sounds pretty complicated right? We operated on a dataframe column full of strings and got in return a list of lists of strings! This is why it is important to understand the object types in Python, since it allows you to understand how and why this was possible.

In [16]:
# split the string objects of those attribute rows
[i.split(";") for i in gff[gff.feature == "gene"]["attribute"].head(5)]
Out[16]:
[['ID=gene-YAL068C',
  'Dbxref=GeneID:851229',
  'Name=PAU8',
  'end_range=2169,.',
  'gbkey=Gene',
  'gene=PAU8',
  'gene_biotype=protein_coding',
  'locus_tag=YAL068C',
  'partial=true',
  'start_range=.,1807'],
 ['ID=gene-YAL067W-A',
  'Dbxref=GeneID:1466426',
  'Name=YAL067W-A',
  'end_range=2707,.',
  'gbkey=Gene',
  'gene_biotype=protein_coding',
  'locus_tag=YAL067W-A',
  'partial=true',
  'start_range=.,2480'],
 ['ID=gene-YAL067C',
  'Dbxref=GeneID:851230',
  'Name=SEO1',
  'end_range=9016,.',
  'gbkey=Gene',
  'gene=SEO1',
  'gene_biotype=protein_coding',
  'locus_tag=YAL067C',
  'partial=true',
  'start_range=.,7235'],
 ['ID=gene-YAL065C',
  'Dbxref=GeneID:851232',
  'Name=YAL065C',
  'end_range=11951,.',
  'gbkey=Gene',
  'gene_biotype=protein_coding',
  'locus_tag=YAL065C',
  'partial=true',
  'start_range=.,11565'],
 ['ID=gene-YAL064W-B',
  'Dbxref=GeneID:851233',
  'Name=YAL064W-B',
  'end_range=12426,.',
  'gbkey=Gene',
  'gene_biotype=protein_coding',
  'locus_tag=YAL064W-B',
  'partial=true',
  'start_range=.,12046']]

Select the gene name from each attribute

The example above selected the 'attribute' element for a subset of rows and then parsed it into a list. Now we can go further by selecting elements out of those lists of results. For example, we can select the gene names of the rows where feature=='gene'.

In [17]:
# save 'attributes' as a list for the first ten genes, like we did above.
attrs_lists = [i.split(";") for i in gff[gff.feature == "gene"].attribute.head(10)]

# iterate over each list in the list of lists
for attrs_list in attrs_lists:
    
    # iterate over each string in the list
    for attr in attrs_list:
    
        # if the first four string elements spell Name
        if attr[:4] == "Name":
            print(attr)
Name=PAU8
Name=YAL067W-A
Name=SEO1
Name=YAL065C
Name=YAL064W-B
Name=TDA8
Name=YAL064W
Name=YAL063C-A
Name=FLO9
Name=GDH3
Action [14]: Copy the code above into the cell below and modify it to try to select the "gene_biotype" of each of the first 20 genes in the DataFrame instead of the "Name".
In [33]:
# save 'attributes' as a list for the first ten genes, like we did above.
attrs_lists = [i.split(";") for i in gff[gff.feature == "gene"].attribute.head(20)]

# iterate over each list in the list of lists
for attrs_list in attrs_lists:
    
    # iterate over each string in the list
    for attr in attrs_list:
    
        # if the first four string elements spell Name
        if "gene_biotype" in attr:
            print(attr)
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
gene_biotype=protein_coding
Save and download this notebook as HTML to upload to courseworks.