Programming and Data Science for Biology
(EEEB G4050)

Lecture 1: Introduction

What is scientific programming?

In the strict sense, it is writing a logical sequence of instructions that can be interpreted by a computer to perform a desired task.

More broadly, however, successful scientific programming involves a larger set of tasks, including: (1) envisioning a useful end product; (2) designing a user-friendly interface; (3) writing the source code; (4) annotating code for readability and comprehension; (5) making the tool installable across platforms; (6) writing documentation; (7) collaborating with a closed or open source community; (8) publishing.

What are programming languages?

A dialect we can use to provide instructions to the computer. Programming languages change over time, diverge from each other, or adopt from each other, just like spoken languages. The popularity of different languages has varied through time and among different communities of users.

In this class you will be introduced to several languages, including: bash, HTML, Markdown, SVG, JavaScript, and Python; with greatest emphasis on bash and Python.

Our goals in this class

(1) To learn to design, code, and publish a software tool that can accomplish a useful task in biology; and (2) To learn technical and social skills to contribute to the biological data science community.

How is this class geared towards biology?

Primarily through our interactions and discussions.

When biological information is encoded as data, it becomes no different from any other type of data -- it is just text. The skills that we will focus on here will generally apply to any type of data. But, through our discussions in this class, among a community of people developing software projects to solve biological problems, you will be exposed to numerous computational approaches for solving biological problems.

However, we will spend some time to learn about particular biological data types and structures, including from genomic and geospatial data.

Is my laptop appropriate for this class?

Short answer: yes.
Scientific computing is overwhelmingly performed on operating systems based on the Unix architecture (e.g., Linux, MacOSX), sometimes termed *nix systems. As research computing continues to move into the cloud, these systems which are more scalable for multi-user and multi-tasking are likely to remain and even grow in popularity. Because of this, Windows has historically been very difficult to use in research computing.

Fortunately, this has recently changed. As Microsoft has gained interest in cloud computing they have embraced Linux -- which for decades they had previously attacked as a competitor -- and Linux can now be installed and run on Windows as a completely contained additional operating system (Windows Subsystem for Linux).

So, you will all be learning a bit of Linux

Our inspiration

The 1995 hit movie Hackers. about a group of rough-around-the-edges youth that apply their hacking skills to take down a global financial conspiracy that was being blamed on one them after a harmless hacking prank. Although we will not learn "hacking" in the strict sense in this class, the ethos behind hackers -- in the sense of a community of coders working together to solve problems -- stands as the inspiration for this class.

The shell or terminal (e.g., bash, zsh)

A terminal (also known as a shell) is a command-line interface (CLI) that allows users to interact with their computer’s operating system by typing text commands. It serves as a gateway for executing programs, managing files, and controlling system processes without the need for a graphical user interface (GUI). The terminal interprets commands entered by the user and communicates with the operating system to carry out tasks.

 
deren@linux ~ $







	      

Command-line tools (programs) called from terminal

A common syntax to call shell programs

 
# the full syntax, but some arguments can be left empty
$ [program-name] -[options...] [target]

# no options or targets: shows files in the current dir
$ ls

# option -l means to format the output with one file per line
$ ls -l

# target is a filepath (show that dir instead of current dir)
$ ls /home/

# options and target can be provided together
$ ls -la /home/deren
          

Filesystem paths

In a terminal you should always know: where am I? where are my files?
There are unix programs to tell you these answers.


# print working directory (current location)
$ pwd

# change directory (move yourself to another target location)
$ cd /home/deren

# list the files in this directory
$ ls 

# show the filepath to the program `ls`
$ which ls
	        

The Unix filesystem

It is essential to understand the hierarchical structure of the filesystem on your computer. Directories (folders) are nested within directories, starting from the root (/).

Your first assignment will focus on learning the file hierarchy. Many of the errors you will encounter while learning to code will involve errors in writing filepaths. Learning to traverse the filesystem efficiently, in a terminal, will superpower your efficiency in coding.