data science
Python Data Science¶
Numpy and Pandas¶
The numpy and pandas libraries are the core of data science in Python.
The numpy ndarray and pandas DataFrame class objects are custom classes
that can be used to store, visualize, and operate on structured data.
Here we will explore how you can use these custom classes within your own
code to leverage the power of these modules to improve efficiency and add
power and flexibility.
Recap¶
The past few sessions you have been developing a micro-project in the wf.ipynb
notebook. The goal of this was to create simple functions to implement
the Wright-Fisher process of genetic drift, and then to wrap these functions
in a class. At the end of the in-class exercise we saw a bit about performance
evaluation, and how small changes can have big effects on runtime. Here
Numpy Challenge: Add data science to your micro-project¶
Your challenge today is to return to your program and re-implement the internal data structures and code to use numpy arrays and vectorized functions, instead of python native lists and standard library functions. You will then do some performance evaluation to see how large a difference this makes.
To complete this assignment you will need to do the following:
- Add a new parameter to your
__init__method calledwith_npand set the default to False. Store this value as a class attribute. - Add a new conditional branch inside
__init__to test for this flag, and if True you will implement theself.popattribute using numpy arrays. - Modify your
step()method to test thewith_npattribute, and if True use numpy vectorized functions to replacerandom.choice(). - In a new cell and instantiate two new objects of type
Population, one calledpopand the otherpop_np, set the population size to 100 and set thewith_npflag appropriately given the name of these objects. - Now you will create two new cells to use
%%timeitto test the performance of each of these. In each cell you will have one call tostep()initially settingngens=1000, with one cell forpopand the other forpop_np. - Evaluate the performance of the numpy code across a range of population sizes and numbers of generations. In a new cell use markdown to record and document your findings. What did you find and what do you conclude from this?
Pandas Challenge: Revisiting the Iris data exploration¶
Now we will revisit the Iris data again, whereas before we used bash,
and then pure python to manipulate this data, now we will use pandas
and the DataFrame class.
- Create a new notebook called
iris_pd.ipynbin the same directory aswf.ipynb(it should be inhack-5-python/notebooks). - Import any modules you man need in a new cell at the top of this notebook.
- Fetch the iris data using
the
requestsmodule. - Load the iris data into a pandas DataFrame. This version of the data
does not have a 'header' to indicate the meanings of the columns, so you will
need to know that the 5 columns in the dataset are sepal length and width,
petal length and width, and species ID. (Hint: use the
namesparameter ofpd.read_csv). - First, again fix the rows with the misspelled species IDs and remove
the rows with
NA. - Use the
describe()method to show summary statistics over the numerical valued columns. This is cool, but maybe not very intereting because it doesn't show us anything about differences among species. - We can split the dataframe into 'groups' based on shared column values
using the
groupbymethod.groupbyreturns an iterable over tuples in the form of (group label, group data). Combinegroupbyanddescribeto print summary stats for each of the three species. (Hint: There are many ways to do this, you might use aforloop and unpack the tuples with indexing inside the loop). - What conclusions do you draw from an informal inspection of these values?
- Challenge: Perform formal tests for differences in group means using
ttest_indfrom thescipy.statsmodule (you may need toconda install -c conda-forge scipyfirst. Hint: You can access the data for a specific group using theget_group()method of the groupby object, e.g. if I called my groupby objectgbI could say `gb.get_group("Iris-setosa") and it would return a DataFrame for the data for only these samples).