Object-oriented programming design principle. Keep attributes and functions associated with the data they are intended to operate on. Many packages are organized around a few class objects.
# a simple class with an init function
class Simple:
def __init__(self, name):
self.name = name
# an instance of the class
inst = Simple("deren")
print(inst.name)
A design feature for writing cleaner code.
class Analysis:
def __init__(self, data):
self.data = data
def subsampled_linear_regression(self, seed):
...
return x, y
def kmeans_clustering(self, column_name):
...
return clusttable
# init an Analysis object once
analysis = Analysis("./data.csv")
# use object repeatedly; same data, but different arguments or functions.
res1 = analysis.subsampled_linear_regression(seed=123)
res2 = analysis.subsampled_linear_regression(seed=321)
res3 = analysis.kmeans_clustering("leaf_size")
class Simulator:
def __init__(self, arg1, arg2):
# store input args and check they are valid
self.arg1 = arg1
self.arg2 = arg2
self.check_args()
# a dict to store results
self.results = {}
def check_args(self):
"checks that args stored to self are valid."
pass
def setup_model(self):
"organizes args into a list of strings for entering to subprocess"
pass
def run_simulation(self):
"uses subprocess to call a simulator tool on model list args"
pass
def parse_and_format_results(self):
"fills self.results with values from simulations"
pass
def run(self):
"run complete simulation procedure"
self.setup_model()
self.run_simulation()
self.parse_and_format_results()
# init an Simulator object with a set of arguments
sim = Simulator(arg1=5, arg2=0.555)
# the run function here calls many other functions
sim.run()
# access results
print(sim.results)
class Simulator:
def __init__(self, arg1, arg2):
# store input args and check they are valid
self.arg1 = arg1
self.arg2 = arg2
self.check_args()
# storage objects with values filled during .run()
self.model = []
self.out = ""
self.results = {}
def check_args(self):
"checks that args stored to self are valid."
if self.arg1 > 10:
raise ValueError("arg1 cannot be > 10")
def setup_model(self):
"organizes args into a list of strings for entering to subprocess"
self.model = ["simulator", "-c", str(self.arg1), "-k", str(self.arg2)]
def run_simulation(self):
"uses subprocess to call a simulator tool"
proc = subprocess.run(self.model, stdout=subprocess.PIPE, check=True)
self.out = proc.stdout.decode()
def parse_and_format_results(self):
"fills self.results with values from simulations"
self.results['mean'] = float(self.out.split()[0])
def run(self):
"run complete simulation procedure"
self.setup_model()
self.run_simulation()
self.parse_and_format_results()
# init an Simulator object with a set of arguments
sim = Simulator(arg1=5, arg2=0.555)
# the run function here calls many other functions
sim.run()
# access results
print(sim.results)
Numpy is the core of Python Data Science. Composed of custom class objects
for storing array data for numerical processing using compiled functions.
Numpy arrays are like lists (but very different)
import numpy as np
# init an array object
arr = np.array([0, 1, 2, 3])
# you can index or slice an array like a list
print(arr[0])
print(arr[1:3])
# they are also mutable
arr[0] = 100
print(arr)
0
[1 2]
[100 1 2 3]
Unlike lists, their operations are broadcast
# adding two lists concatenates the list objects
list1 = [0, 1, 2]
list2 = [2, 3, 4]
print(list1 + list2)
# adding two arrays sums their contents (if numeric types)
arr1 = np.array([0, 1, 2])
arr2 = np.array([2, 3, 4])
print(arr1 + arr2)
[0, 1, 2, 2, 3, 4]
[2 4 6]
Numpy arrays can be multi-dimensional
# create an array full of zeros that is 3 x 4 x 5
arr = np.zeros(shape=(3, 4, 5))
print(arr[0])
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
Subpackages provide further operations (e.g., linear algebra, random sampling)
# outcome of 3 coin flips, repeated 10 times:
arr = np.random.binomial(n=1, p=0.5, size=(10, 3))
print(arr)
[[1 0 0]
[1 1 1]
[0 0 0]
[0 1 0]
[0 1 1]
[1 0 0]
[0 0 1]
[1 1 1]
[0 1 0]
[1 1 1]]
Functions of arrays include many statistical operations. These can
be broadcast over specific dimensions.
# calculate mean by row, or by column.
arr = np.random.binomial(n=1, p=0.5, size=(10, 3))
print(arr.mean(axis=0))
print(arr.mean(axis=1))
[0.7 0.4 0.5]
[0.33333333 0.66666667 0.66666667 0.33333333 0.33333333 0.33333333
0.66666667 0. 1. 1. ]
Create an empty array and then fill it. Unlike lists, appending is not efficient.
arr = np.zeros(shape=(3, 10, 10))
for i in range(3):
arr[i] = i
print(arr)
[[[0. 0.]
[0. 0.]]
[[1. 1.]
[1. 1.]]
[[2. 2.]
[2. 2.]]]
Arrays have attributes including size, shape, dtype.
arr = np.arange(10, dtype=int)
print(arr.size) # n cells
print(arr.shape) # dimensions
print(arr.dtype) # data type (e.g., int8 uses less memory than int16)
print(arr.nbytes) # space in memory
10
(10,)
int64
80
Arrays have functions for operating on all or part.
arr = np.arange(20)
print(arr) # [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
print(arr.min()) # 0 : lowest value
print(arr.argmin()) # 0 : index of the lowest value
print(arr.max()) # 19 : highest value
print(arr.argmax()) # 19 : index of highest value
print(arr.mean()) # 9.5: mean
print(arr.std()) # 5.766281297335398: std
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
0
0
19
19
9.5
5.766281297335398
Pandas provides a wrapper around numpy arrays that provides row (index) and column names
as well as many additional statistical operations.
import pandas as pd
# create a Pandas DataFrame by taking a numpy array as input
print(pd.DataFrame(arr))
0 1 2
0 1 0 0
1 1 0 1
2 1 1 0
3 0 0 1
4 0 0 1
5 1 0 0
6 1 1 0
7 0 0 0
8 1 1 1
9 1 1 1
Data can be accessed more easily from dataframes by using indexing and slicing
not only on indices but also on label names (many options).
# create a Pandas DataFrame by taking a numpy array as input
df = pd.DataFrame(arr, columns=['a', 'b', 'c'])
a b c
0 1 0 0
1 1 0 1
2 1 1 0
3 0 0 1
4 0 0 1
5 1 0 0
6 1 1 0
7 0 0 0
8 1 1 1
9 1 1 1
Selecting the first column ('a'). Multiple options for 'getting' values.
# select the column dict-style
df['a']
# loc: indexing by names
df.loc[:, 'a']
# iloc: indexing by index (like an array)
df.iloc[:, 0]
Selecting the first column ('a'). Fewer options for 'setting' values: use loc or iloc.
# set all values in column 'a' to be 0
df.loc[:, 'a'] = 0
# or, set values in column 'a' to be the following 10 values.
df.loc[:, 'a'] = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
# set all values in column 0 to be 0
df.iloc[:, 0] = 0
# or, set values in column 0 to be the following 10 values.
df.iloc[:, 0] = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
# don't do this: setting on a 'get' operation raises a warning
df['a'] = 0
# don't do this: setting with wrong size causes an error (size=3 here instead of 10)
df.loc[:, 0] = [0, 1, 0]
Generally much faster than standard Python code. If your data is array-like
then it almost always makes sense to use numpy or pandas instead of standard
Python objects like lists. For data analyses you should mostly use standard Python
code only to get your data into an array-like format.
If your data is of a single type (all integers or all floats) then numpy is simpler, faster, and uses less memory. It is bare metal speed. But if your data is composed of mixed types (e.g., a CSV data file with strings, integers and floats mixed) then pandas is much better for keeping track of your dtypes and operating on them appropriately.
CSV data = use pandas;
speed-intensive operations within your program = use numpy;
doing data science = use both;