Skip to content

data science

Python Data Science

Numpy and Pandas

The numpy and pandas libraries are the core of data science in Python. The numpy ndarray and pandas DataFrame class objects are custom classes that can be used to store, visualize, and operate on structured data. Here we will explore how you can use these custom classes within your own code to leverage the power of these modules to improve efficiency and add power and flexibility.

Recap

The past few sessions you have been developing a micro-project in the wf.ipynb notebook. The goal of this was to create simple functions to implement the Wright-Fisher process of genetic drift, and then to wrap these functions in a class. At the end of the in-class exercise we saw a bit about performance evaluation, and how small changes can have big effects on runtime. Here

Numpy Challenge: Add data science to your micro-project

Your challenge today is to return to your program and re-implement the internal data structures and code to use numpy arrays and vectorized functions, instead of python native lists and standard library functions. You will then do some performance evaluation to see how large a difference this makes.

To complete this assignment you will need to do the following:

  • Add a new parameter to your __init__ method called with_np and set the default to False. Store this value as a class attribute.
  • Add a new conditional branch inside __init__ to test for this flag, and if True you will implement the self.pop attribute using numpy arrays.
  • Modify your step() method to test the with_np attribute, and if True use numpy vectorized functions to replace random.choice().
  • In a new cell and instantiate two new objects of type Population, one called pop and the other pop_np, set the population size to 100 and set the with_np flag appropriately given the name of these objects.
  • Now you will create two new cells to use %%timeit to test the performance of each of these. In each cell you will have one call to step() initially setting ngens=1000, with one cell for pop and the other for pop_np.
  • Evaluate the performance of the numpy code across a range of population sizes and numbers of generations. In a new cell use markdown to record and document your findings. What did you find and what do you conclude from this?

Pandas Challenge: Revisiting the Iris data exploration

Now we will revisit the Iris data again, whereas before we used bash, and then pure python to manipulate this data, now we will use pandas and the DataFrame class.

  • Create a new notebook called iris_pd.ipynb in the same directory as wf.ipynb (it should be in hack-5-python/notebooks).
  • Import any modules you man need in a new cell at the top of this notebook.
  • Fetch the iris data using the requests module.
  • Load the iris data into a pandas DataFrame. This version of the data does not have a 'header' to indicate the meanings of the columns, so you will need to know that the 5 columns in the dataset are sepal length and width, petal length and width, and species ID. (Hint: use the names parameter of pd.read_csv).
  • First, again fix the rows with the misspelled species IDs and remove the rows with NA.
  • Use the describe() method to show summary statistics over the numerical valued columns. This is cool, but maybe not very intereting because it doesn't show us anything about differences among species.
  • We can split the dataframe into 'groups' based on shared column values using the groupby method. groupby returns an iterable over tuples in the form of (group label, group data). Combine groupby and describe to print summary stats for each of the three species. (Hint: There are many ways to do this, you might use a for loop and unpack the tuples with indexing inside the loop).
  • What conclusions do you draw from an informal inspection of these values?
  • Challenge: Perform formal tests for differences in group means using ttest_ind from the scipy.stats module (you may need to conda install -c conda-forge scipy first. Hint: You can access the data for a specific group using the get_group() method of the groupby object, e.g. if I called my groupby object gb I could say `gb.get_group("Iris-setosa") and it would return a DataFrame for the data for only these samples).

Assignment

Commit and push updates to your repo before Monday.