data science
Python Data Science¶
Numpy and Pandas¶
The numpy
and pandas
libraries are the core of data science in Python.
The numpy ndarray
and pandas DataFrame
class objects are custom classes
that can be used to store, visualize, and operate on structured data.
Here we will explore how you can use these custom classes within your own
code to leverage the power of these modules to improve efficiency and add
power and flexibility.
Recap¶
The past few sessions you have been developing a micro-project in the wf.ipynb
notebook. The goal of this was to create simple functions to implement
the Wright-Fisher process of genetic drift, and then to wrap these functions
in a class. At the end of the in-class exercise we saw a bit about performance
evaluation, and how small changes can have big effects on runtime. Here
Numpy Challenge: Add data science to your micro-project¶
Your challenge today is to return to your program and re-implement the internal data structures and code to use numpy arrays and vectorized functions, instead of python native lists and standard library functions. You will then do some performance evaluation to see how large a difference this makes.
To complete this assignment you will need to do the following:
- Add a new parameter to your
__init__
method calledwith_np
and set the default to False. Store this value as a class attribute. - Add a new conditional branch inside
__init__
to test for this flag, and if True you will implement theself.pop
attribute using numpy arrays. - Modify your
step()
method to test thewith_np
attribute, and if True use numpy vectorized functions to replacerandom.choice()
. - In a new cell and instantiate two new objects of type
Population
, one calledpop
and the otherpop_np
, set the population size to 100 and set thewith_np
flag appropriately given the name of these objects. - Now you will create two new cells to use
%%timeit
to test the performance of each of these. In each cell you will have one call tostep()
initially settingngens=1000
, with one cell forpop
and the other forpop_np
. - Evaluate the performance of the numpy code across a range of population sizes and numbers of generations. In a new cell use markdown to record and document your findings. What did you find and what do you conclude from this?
Pandas Challenge: Revisiting the Iris data exploration¶
Now we will revisit the Iris data again, whereas before we used bash,
and then pure python to manipulate this data, now we will use pandas
and the DataFrame
class.
- Create a new notebook called
iris_pd.ipynb
in the same directory aswf.ipynb
(it should be inhack-5-python/notebooks
). - Import any modules you man need in a new cell at the top of this notebook.
- Fetch the iris data using
the
requests
module. - Load the iris data into a pandas DataFrame. This version of the data
does not have a 'header' to indicate the meanings of the columns, so you will
need to know that the 5 columns in the dataset are sepal length and width,
petal length and width, and species ID. (Hint: use the
names
parameter ofpd.read_csv
). - First, again fix the rows with the misspelled species IDs and remove
the rows with
NA
. - Use the
describe()
method to show summary statistics over the numerical valued columns. This is cool, but maybe not very intereting because it doesn't show us anything about differences among species. - We can split the dataframe into 'groups' based on shared column values
using the
groupby
method.groupby
returns an iterable over tuples in the form of (group label, group data). Combinegroupby
anddescribe
to print summary stats for each of the three species. (Hint: There are many ways to do this, you might use afor
loop and unpack the tuples with indexing inside the loop). - What conclusions do you draw from an informal inspection of these values?
- Challenge: Perform formal tests for differences in group means using
ttest_ind
from thescipy.stats
module (you may need toconda install -c conda-forge scipy
first. Hint: You can access the data for a specific group using theget_group()
method of the groupby object, e.g. if I called my groupby objectgb
I could say `gb.get_group("Iris-setosa") and it would return a DataFrame for the data for only these samples).