Programming and Data Science for Biology
(EEEB G4050)

Lecture 8: Classes & scientific python (numpy/pandas/scipy)

Outline:

  • Class skeletons
  • Scientific Python packages

Python Classes

Object-oriented programming design principle. Keep attributes and functions associated with the data they are intended to operate on. Many packages are organized around a few class objects.


                      # a simple class with an init function
                      class Simple:
                          def __init__(self, name):
                              self.name = name

                      # an instance of the class
                      inst = Simple("deren")
                      print(inst.name)
                    

Python class objects

A design feature for writing cleaner code.


                            class Analysis:
                                def __init__(self, data):
                                    self.data = data

                                def subsampled_linear_regression(self, seed):
                                    ...
                                    return x, y

                                def kmeans_clustering(self, column_name):
                                    ...
                                    return clusttable

                            # init an Analysis object once
                            analysis = Analysis("./data.csv")

                            # use object repeatedly; same data, but different arguments or functions.
                            res1 = analysis.subsampled_linear_regression(seed=123)
                            res2 = analysis.subsampled_linear_regression(seed=321)
                            res3 = analysis.kmeans_clustering("leaf_size")
                        

Classes: start w/ a skeleton


                            class Simulator:
                                def __init__(self, arg1, arg2):
                                    # store input args and check they are valid
                                    self.arg1 = arg1
                                    self.arg2 = arg2
                                    self.check_args()

                                    # a dict to store results
                                    self.results = {}

                                def check_args(self):
                                    "checks that args stored to self are valid."
                                    pass

                                def setup_model(self):
                                    "organizes args into a list of strings for entering to subprocess"
                                    pass

                                def run_simulation(self):
                                    "uses subprocess to call a simulator tool on model list args"
                                    pass

                                def parse_and_format_results(self):
                                    "fills self.results with values from simulations"
                                    pass

                                def run(self):
                                    "run complete simulation procedure"
                                    self.setup_model()
                                    self.run_simulation()
                                    self.parse_and_format_results()

                            # init an Simulator object with a set of arguments
                            sim = Simulator(arg1=5, arg2=0.555)

                            # the run function here calls many other functions
                            sim.run()

                            # access results
                            print(sim.results)
                        

Class: skeleton expanded example


                        class Simulator:
                            def __init__(self, arg1, arg2):
                                # store input args and check they are valid
                                self.arg1 = arg1
                                self.arg2 = arg2
                                self.check_args()

                                # storage objects with values filled during .run()
                                self.model = []
                                self.out = ""
                                self.results = {}

                            def check_args(self):
                                "checks that args stored to self are valid."
                                if self.arg1 > 10:
                                    raise ValueError("arg1 cannot be > 10")

                            def setup_model(self):
                                "organizes args into a list of strings for entering to subprocess"
                                self.model = ["simulator", "-c", str(self.arg1), "-k", str(self.arg2)]

                            def run_simulation(self):
                                "uses subprocess to call a simulator tool"
                                proc = subprocess.run(self.model, stdout=subprocess.PIPE, check=True)
                                self.out = proc.stdout.decode()

                            def parse_and_format_results(self):
                                "fills self.results with values from simulations"
                                self.results['mean'] = float(self.out.split()[0])

                            def run(self):
                                "run complete simulation procedure"
                                self.setup_model()
                                self.run_simulation()
                                self.parse_and_format_results()

                        # init an Simulator object with a set of arguments
                        sim = Simulator(arg1=5, arg2=0.555)

                        # the run function here calls many other functions
                        sim.run()

                        # access results
                        print(sim.results)

                    

Python Data Science

Numpy is the core of Python Data Science. Composed of custom class objects
for storing array data for numerical processing using compiled functions.



Numpy basics

Numpy arrays are like lists (but very different)


                                import numpy as np

                                # init an array object
                                arr = np.array([0, 1, 2, 3])

                                # you can index or slice an array like a list
                                print(arr[0])
                                print(arr[1:3])

                                # they are also mutable
                                arr[0] = 100
                                print(arr)
                            

                                0
                                [1 2]
                                [100   1   2   3]
                            

Numpy basics

Unlike lists, their operations are broadcast


                                # adding two lists concatenates the list objects
                                list1 = [0, 1, 2]
                                list2 = [2, 3, 4]
                                print(list1 + list2)

                                # adding two arrays sums their contents (if numeric types)
                                arr1 = np.array([0, 1, 2])
                                arr2 = np.array([2, 3, 4])
                                print(arr1 + arr2)
                            

                                [0, 1, 2, 2, 3, 4]
                                [2 4 6]
                            

Numpy basics

Numpy arrays can be multi-dimensional


                                # create an array full of zeros that is 3 x 4 x 5
                                arr = np.zeros(shape=(3, 4, 5))
                                print(arr[0])
                            

                                [[0. 0. 0. 0. 0.]
                                 [0. 0. 0. 0. 0.]
                                 [0. 0. 0. 0. 0.]
                                 [0. 0. 0. 0. 0.]]
                            

Numpy basics

Subpackages provide further operations (e.g., linear algebra, random sampling)


                                # outcome of 3 coin flips, repeated 10 times:
                                arr = np.random.binomial(n=1, p=0.5, size=(10, 3))
                                print(arr)
                            

                                [[1 0 0]
                                 [1 1 1]
                                 [0 0 0]
                                 [0 1 0]
                                 [0 1 1]
                                 [1 0 0]
                                 [0 0 1]
                                 [1 1 1]
                                 [0 1 0]
                                 [1 1 1]]
                            

Numpy basics

Functions of arrays include many statistical operations. These can
be broadcast over specific dimensions.


                                # calculate mean by row, or by column.
                                arr = np.random.binomial(n=1, p=0.5, size=(10, 3))
                                print(arr.mean(axis=0))
                                print(arr.mean(axis=1))
                            

                                [0.7 0.4 0.5]
                                [0.33333333 0.66666667 0.66666667 0.33333333 0.33333333 0.33333333
                                 0.66666667 0.         1.         1.        ]
                            

Numpy routines

Create an empty array and then fill it. Unlike lists, appending is not efficient.


                                arr = np.zeros(shape=(3, 10, 10))
                                for i in range(3):
                                    arr[i] = i
                                print(arr)
                            

                                [[[0. 0.]
                                  [0. 0.]]

                                 [[1. 1.]
                                  [1. 1.]]

                                 [[2. 2.]
                                  [2. 2.]]]
                            

Numpy routines

Arrays have attributes including size, shape, dtype.


                                arr = np.arange(10, dtype=int)
                                print(arr.size)    # n cells
                                print(arr.shape)   # dimensions
                                print(arr.dtype)   # data type (e.g., int8 uses less memory than int16)
                                print(arr.nbytes)  # space in memory
                            

                                10
                                (10,)
                                int64
                                80
                            

Numpy routines

Arrays have functions for operating on all or part.


                                arr = np.arange(20)
                                print(arr)             # [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
                                print(arr.min())       # 0  : lowest value
                                print(arr.argmin())    # 0  : index of the lowest value
                                print(arr.max())       # 19 : highest value
                                print(arr.argmax())    # 19 : index of highest value
                                print(arr.mean())      # 9.5: mean
                                print(arr.std())       # 5.766281297335398: std
                            

                                [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
                                0
                                0
                                19
                                19
                                9.5
                                5.766281297335398
                            

Pandas basics

Pandas provides a wrapper around numpy arrays that provides row (index) and column names
as well as many additional statistical operations.


                                import pandas as pd

                                # create a Pandas DataFrame by taking a numpy array as input
                                print(pd.DataFrame(arr))
                            

                                   0  1  2
                                0  1  0  0
                                1  1  0  1
                                2  1  1  0
                                3  0  0  1
                                4  0  0  1
                                5  1  0  0
                                6  1  1  0
                                7  0  0  0
                                8  1  1  1
                                9  1  1  1
                            

Pandas basics

Data can be accessed more easily from dataframes by using indexing and slicing
not only on indices but also on label names (many options).


                                # create a Pandas DataFrame by taking a numpy array as input
                                df = pd.DataFrame(arr, columns=['a', 'b', 'c'])
                            

                                   a  b  c
                                0  1  0  0
                                1  1  0  1
                                2  1  1  0
                                3  0  0  1
                                4  0  0  1
                                5  1  0  0
                                6  1  1  0
                                7  0  0  0
                                8  1  1  1
                                9  1  1  1
                              

Pandas basics

Selecting the first column ('a'). Multiple options for 'getting' values.


                                # select the column dict-style
                                df['a']

                                # loc: indexing by names
                                df.loc[:, 'a']

                                # iloc: indexing by index (like an array)
                                df.iloc[:, 0]
                            

Pandas basics

Selecting the first column ('a'). Fewer options for 'setting' values: use loc or iloc.


                                # set all values in column 'a' to be 0
                                df.loc[:, 'a'] = 0

                                # or, set values in column 'a' to be the following 10 values.
                                df.loc[:, 'a'] = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

                                # set all values in column 0 to be 0
                                df.iloc[:, 0] = 0

                                # or, set values in column 0 to be the following 10 values.
                                df.iloc[:, 0] = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

                                # don't do this: setting on a 'get' operation raises a warning
                                df['a'] = 0

                                # don't do this: setting with wrong size causes an error (size=3 here instead of 10)
                                df.loc[:, 0] = [0, 1, 0]
                            

Numpy & Pandas (why and when to use)

Generally much faster than standard Python code. If your data is array-like
then it almost always makes sense to use numpy or pandas instead of standard
Python objects like lists. For data analyses you should mostly use standard Python
code only to get your data into an array-like format.

Numpy v Pandas (when to use each)

If your data is of a single type (all integers or all floats) then numpy is simpler, faster, and uses less memory. It is bare metal speed. But if your data is composed of mixed types (e.g., a CSV data file with strings, integers and floats mixed) then pandas is much better for keeping track of your dtypes and operating on them appropriately.

tldr;

CSV data = use pandas;
speed-intensive operations within your program = use numpy;
doing data science = use both;