Programming and Data Science for Biology
(EEEB G4050)

Lecture 2: Program design

Lecture outline

  • review and introduce new terminal basics
  • overview of types of software

Bash review: Filesystem

The hierarchical unix filesystem

Where are my files? Where am I?


              ls
            

              data/    docs/    file.txt
            

The hierarchical unix filesystem

Where am I?


              pwd
            

              /home/deren/
            

The hierarchical unix filesystem

what files are in that other directory (using the full path)?


              ls /home/deren/Documents/
            

              file1.txt    file2.txt    file3.txt
            

The hierarchical unix filesystem

what files are in that other directory (using the relative path)?


              ls ./Documents/
            

              file1.txt    file2.txt    file3.txt
            

The hierarchical unix filesystem

what files are in that other directory? (move and then check)


              cd /home/deren/Documents/
              ls
            

              file1.txt    file2.txt    file3.txt
            

The hierarchical unix filesystem

move up one directory (using relative path)


              cd ..
              pwd
            

              /home/deren/
            

The hierarchical unix filesystem

list contents of current directory (explicitly)


              # three different ways to specify the same
              ls       # . is the default target
              ls .     # . means current dir
              ls ./    # . is a dir, so ./ means the same thing
            

              Documents/
            

Bash review: command -options

program -options target

use `man` to view documentation of system commands


              man ls
            

              LS(1)                                    User Commands                                   LS(1)

              NAME
                     ls - list directory contents

              SYNOPSIS
                     ls [OPTION]... [FILE]...

              DESCRIPTION
                     List  information about the FILEs (the current directory by default).  Sort entries al‐
                     phabetically if none of -cftuvSUX nor --sort is specified.

                     Mandatory arguments to long options are mandatory for short options too.

                     -a, --all
                            do not ignore entries starting with .

                     -A, --almost-all
                            do not list implied . and ..

                     --author
                            with -l, print the author of each file

                     -b, --escape
                            print C-style escapes for nongraphic characters

                     --block-size=SIZE
                            with -l, scale sizes by SIZE when printing  them;  e.g.,  '--block-size=M';  see
                            SIZE format below

                     -B, --ignore-backups
                            do not list implied entries ending with ~

                     -c     with  -lt: sort by, and show, ctime (time of last change of file status informa‐
                            tion); with -l: show ctime and sort by name; otherwise: sort  by  ctime,  newest
                            first

                     -C     list entries by columns

                     --color[=WHEN]
                            color the output WHEN; more info below

                     -d, --directory
                            list directories themselves, not their contents

                     -D, --dired
                            generate output designed for Emacs' dired mode

                     -f     list all entries in directory order

                     -F, --classify[=WHEN]
                            append indicator (one of */=>@|) to entries WHEN

                     --file-type
                            likewise, except do not append '*'

                     --format=WORD
                            across -x, commas -m, horizontal -x, long -l, single-column -1, verbose -l, ver‐
                            tical -C

                     --full-time
                            like -l --time-style=full-iso

                     -g     like -l, but do not list owner

                     --group-directories-first
                            group  directories  before files; can be augmented with a --sort option, but any
                            use of --sort=none (-U) disables grouping

                     -G, --no-group
                            in a long listing, don't print group names

                     -h, --human-readable
                            with -l and -s, print sizes like 1K 234M 2G etc.

                     --si   likewise, but use powers of 1000 not 1024

                     -H, --dereference-command-line
                            follow symbolic links listed on the command line

                     --dereference-command-line-symlink-to-dir
                            follow each command line symbolic link that points to a directory

                     --hide=PATTERN
                            do not list implied entries matching shell PATTERN (overridden by -a or -A)

                     --hyperlink[=WHEN]
                            hyperlink file names WHEN

                     --indicator-style=WORD
                            append indicator with style WORD to entry names:  none  (default),  slash  (-p),
                            file-type (--file-type), classify (-F)

                     -i, --inode
                            print the index number of each file

                     -I, --ignore=PATTERN
                            do not list implied entries matching shell PATTERN

                     -k, --kibibytes
                    ...
            

program -options target

list contents using 'long list' format (shows time, owner, size...)


              ls -l 
            

              -rw-rw-r-- 1 deren deren     5 Jan 22 22:05 file1.txt
              -rw-rw-r-- 1 deren deren 11141 Jan 22 22:06 file2.txt
              -rw-rw-r-- 1 deren deren  3280 Jan 22 22:07 file3.txt
            

program -options target

list contents long, sorted by time, using human-readable size.


              ls -l -t -h  # three options
              ls -lth      # same thing written differently
            

              -rw-rw-r-- 1 deren deren 3.3K Jan 22 22:07 file3.txt
              -rw-rw-r-- 1 deren deren  11K Jan 22 22:06 file2.txt
              -rw-rw-r-- 1 deren deren    5 Jan 22 22:05 file1.txt
            

program -options target

list files move up one directory (using relative path)


              # program -option target
              ls -ltr /home/deren/work
            

              # echo prints the target string to the terminal output (stdout)
              hello world

              # ls -l 'lists' contents in the specifed directory
              drwxr-xr-x  7 deren deren  4096 Dec 28 03:37 data/    
              drwxr-xr-x  7 deren deren  4096 Dec 28 03:37 docs/    
              drwxr-xr-x  7 deren deren  4096 Dec 28 03:37 file.txt
            

Bash tricks: efficiency/hotkeys

efficiency/hotkeys

ctrl-a returns cursor to the start of a line


              # after writing a long command you notice a typo at the beginning
              lx -ltr /home/deren/work▌
            

              # ctrl-a brings cursor back to start.
              ▌lx -ltr /home/deren/work
            

efficiency/hotkeys

ctrl-d deletes a character from cursor position


              # cursor is at beginning of the line
              ▌lx -ltr /home/deren/work
            

              # ctrl-d to delete several characters
              ▌ -ltr /home/deren/work
            

efficiency/hotkeys

ctrl-k kills the current line


              # cursor is at beginning of the line
              ▌ls -ltr /home/deren/work
            

              # ctrl-k clears the whole line
              ▌
            

efficiency/hotkeys

ctrl-e sends cursor to end of a line


              # cursor is at beginning of the line
              ▌ls -ltr /home/deren/work
            

              # ctrl-e sends it to the end
              ls -ltr /home/deren/work▌
            

efficiency/hotkeys

ctrl-f and ctrl-b move cursor forward or back one char


              # cursor is at beginning of the line
              ▌ls -ltr /home/deren/work
            

              # execute ctrl-f two times
              ls▌ -ltr /home/deren/work
            

efficiency/hotkeys

the up arrow retrieves previously executed commands; down arrow cycles back.


              # press up once to get last executed command
              pwd
            

              # press down key to cycle back down
              ls -ltr /home/deren/work
            

efficiency/hotkeys

ctrl-p (previous) and ctrl-n (next) are synonymous with up and down arrows, respectively


              # ctrl-p (move up to previous command)
              pwd
            

              # ctrl-n (move down to next command)
              ls -ltr /home/deren/work
            

efficiency/hotkeys

ctrl-r allow typing text to interactively search previously executed commands


              # ctr-r then start to type previous command "m"
              m
            

              # it will autocomplete and cycle among options
              man ls
            

efficiency/hotkeys

This set of hotkeys is termed the "readline" (or a subset of emacs-style) hotkeys.


              # ctrl-f: move forward
              a▌aaaa
              # ctrl-b: move back
              ▌aaaaa
              # ctrl-e: move to end
              aaaaa▌
              # ctr-a: move to start
              ▌aaaaa            
              # ctrl-k: clear text after cursor
              ▌
              # ctrl-p: up / previous command 
              man ls
              # ctrl-n: down / next command 
              ▌
            

bash tricks: read, write, pipe, redirect

bash tricks: read, write, pipe, redirect

`echo` displays/prints a line of text. Useful for debugging; checking variables.


              echo "hello world"
            

              hello world
            

bash tricks: read, write, pipe, redirect

`echo` can interpret line breaks and other special characters.


              echo -e "hello\nworld"
            

              hello
              world
            

bash tricks: read, write, pipe, redirect

text can be written to files using redirection (> or >>)


              # using > writes so that the file will contain only this text
              echo "hello world" > file.txt
              cat file.txt
            

              hello world
            


              # using >> appends so that the file will add this text to the end
              echo "hello world" >> file.txt
            

              hello world
              hello world
            

bash tricks: read, write, pipe, redirect

`echo` can interpret line breaks and other special characters.


              echo -e "hello\nworld"
            

              hello
              world
            

bash tricks: read, write, pipe, redirect

`cat` streams (reads) data from a file to the terminal output (stdout)


              cat file.txt
            

              hello world
              hello world
            

bash tricks: read, write, pipe, redirect

`cat` concatenates (stitches together) data from multiple files


              cat file1.txt file2.txt file3.txt > files_123.txt
            

bash tricks: read, write, pipe, redirect

`less` shows only single page of text at a time. Allows interactive viewing of files that is much more convenient for viewing large files. You need to press `q` to quit out of the interactive viewing session.


              less very-large-file.txt
            

bash tricks: read, write, pipe, redirect

`grep` is used to perform pattern matching on lines of text.


              # find all lines in this file containing the word "error"
              grep "error" /var/log/auth.log
            

bash tricks: read, write, pipe, redirect

A pipe (|) passes the output from command to the input of another command. More specifically, the stdout of a command can become the stdin of the next command.


              echo -e "a\nb\nc" > abc.txt
              cat abc.txt
            

              a
              b
              c
            

Here we pipe the output of cat to grep, to search for lines containing 'b'


              cat abc.txt | grep b
            

              b
            

Types of software

Cathedral versus Bazaar

Is it better to have one grand monolithic program that does everything (The Cathedral); or numerous independent, microservices that perform one simple thing well (The Bazaar)? This is a classical debate!

Examples of Cathedrals in Comp. Biology

SLiM: Evolutionary Simulation Framework

RevBayes: Bayesian Phylogenetic Modeling

  • A framework that _can_ accomplish many tasks
  • Include their own computational languages/syntax
  • Huge learning curve & barrier to entry
  • Developers require specialized domain expertise
  • SLiM Manual is 857 pages
  • RevBayes requires an expert tutorial for every analysis

Examples of Bazaars in Comp. Biology

Bedtools: genome arithmetic operations

Samtools: high-throughput sequence data manipulation

  • Easy to install and run in seconds.
  • Composed of many microutilities in similar design to unix utilities
  • Bedtools: extract part of a genome; find overlaps between genome segments; etc.
  • Samtools: read/write/filter/edit SAM/BAM files of genomic data coordinates/stats
  • Simple instructions and examples: one-liners
  • Pipe multiple commands in different order to accomplish complex tasks

Running a Cathedral Program

Call command on a script or config file: easy


              program-name script
            

The hard part is designing the script/instructions. Finding the right options/code/model/instructions to setup the program correctly *is the goal*. You will probably only run this once and then be done with it.

Running a Bazaar Program

Call many simple commands with options piped from one to next; each one simple task


              # perform simple task 'a'
              prog -a data

              # perform simple task 'b'
              prog -b data

              # or, perform complex task by combining commands
              prog -a data | prog -b - | prog -c - > output.csv
            

Each program is easy to understand, and in different combinations can accomplish complex tasks. Easy to debug. Can be confusing to know what all is possible, and how to accomplish it.

Dev tips 1: design for the style of use

Will the user only interact with the program once? (e.g., genome assembly)

  • Checkpointing to restart if interrupted.
  • Progress bar for indication.
  • Log file of the parameters/settings used.
  • Designed to run in terminal (e.g., on a server).

Or will they use the program interactively, tweaking options? (e.g., plotting tools)

  • Fast and responsive.
  • Auto-completion (tab) support (cleaned).
  • Designed to run in interactive env (e.g., jupyter, RStudio).

CLI vs API

The design of a program can generate tradeoffs between efficiency and reproducibility, accessibility, and ‘activation energy’. CLI (command line interface) vs API (application programming interface) provide a nice example of this.

CLI

We are now familiar with the format of command-line tools


              prog -b 10 -c 4 --mode "fast" data > out.txt
            
  • Pro: Fast to execute
  • Pro: Can often interact (e.g., pipe) with other unix tools
  • Con: Slow and burdensome to generate and view visualizations
  • Con: Extra work to further analyze results in Python or R

API

An API is implemented within a programming language (e.g., Python or R) and allows easy interactions between the specific software program and the general use of that language.


              # First open an interactive Python session
              ipython
            

              >>> # Then within Python the code can be run
              >>> import toytree
              >>> tree = toytree.rtree.bdtree(ntips=10)
              >>> print(tree.ntips)
            
  • Pro: Can combine general scripting with the specific API commands
  • Pro: Visual tools/libraries are highly developed and integrated (e.g., jupyter)
  • Con: Slower to access the program, requiring writing a script or opening interactive session
  • Con: To benefit from an API requires some knowledge of the language

CLI vs API example on the same tool

Many tools are developed as both a CLI and API. The API allows the developer or users a convenient framework for accessing utilities of the software; and the CLI provides easy access to a few simple and commonly used commands.


              # call the program `ms` from the command line to simulate coalescent trees
              $ ms 15 1000 -t 6.4 -G 6.93 -eG 0.2 0.0 -eN 0.3 0.5
            

              >>> # use the msprime API in Python to setup the same simulation
              >>> import msprime
              >>> d = msprime.Demography()
              >>> d.add_population(name="A", initial_size=20000, growth_rate=6.4)
              >>> d.add_population_parameters_change(time=16000, growth_rate=0.0)
              >>> d.add_population_parameters_change(time=24000, initial_size=10000)
              >>> ts = msprime.sim_ancestry(samples={"A": 15}, demography=d)
            

Dev tips 2: build upon existing tools

Most problems have some existing tool designed to address them. When thinking about starting a new project ask yourself the following questions:

  • Is there need for a newer or better tool?
  • Should I rewrite or extend the existing tool?
  • Should I contribute to that tool, or write a "wrapper/pipeline" that uses it?
  • Who is the community that will use these tools?

Wrapper tools

Sometimes a tool accomplishes a single task well, but is poorly designed for a related task. A wrapper tool is a program that calls the first program in a new and different enough way that it is worth packaging as a separate tool.

Example: easySFS developed by Isaac Overcast is a Python-based wrapper that calls the program dadi in an iterative way -- as opposed to just once -- in a way that allows exploring a broad parameter space to find the settings that maximize the data quality. Because users may want to perform this iterative process in different ways, easySFS provides its own user interface:


              $ ./easySFS.py -i input.vcf -p pops_file.txt --preview
            
            
              Pop1
              (2, 45.0)   (3, 59.0)   (4, 58.0)   (5, 49.0)   (6, 41.0)   (7, 35.0)   (8, 27.0)   (9, 20.0)   (10, 13.0)  (11, 8.0)   (12, 8.0)   (13, 5.0)   (14, 2.0)   (15, 2.0)   (16, 1.0)   

              pop2
              (2, 68.0)   (3, 96.0)   (4, 106.0)  (5, 110.0)  (6, 108.0)  (7, 89.0)   (8, 76.0)   (9, 66.0)   (10, 56.0)  (11, 49.0)  (12, 42.0)  (13, 39.0)  (14, 34.0)  (15, 29.0)  (16, 27.0)  (17, 26.0)  (18, 24.0)  (19, 23.0)  (20, 21.0)  (21, 22.0)  (22, 20.0)  (23, 19.0)  (24, 16.0)  (25, 16.0)  (26, 15.0)  (27, 15.0)  (28, 13.0)  (29, 13.0)  (30, 14.0)  (31, 14.0)  (32, 14.0)  (33, 13.0)  (34, 12.0)  (35, 9.0)   (36, 9.0)   (37, 8.0)   (38, 8.0)   (39, 8.0)   (40, 6.0)   (41, 6.0)   (42, 6.0)   (43, 5.0)   (44, 5.0)   (45, 5.0)   (46, 4.0)   (47, 4.0)   (48, 4.0)   (49, 3.0)   (50, 3.0)   (51, 3.0)   (52, 3.0)   (53, 3.0) 
            

Wrapper tools

Wrappers are a great place to start out in programming because they reduce the complexity of the problem to one of iteration, aggregation, and evaluation of results, instead of starting with trying to develop novel algorithms. Often they are very useful tools for your own research. If you think they will be helpful to others, share it on GitHub!

We will learn the building blocks necessary to write a wrapper tool within the first week of learning Python.

Dev tips 3: design and organize

Avoid the "scientific software as a pile of scripts" trap. Many scientific software tools are distributed as simply a collection of one-off scripts. This is very different from organizing a cohesive Bazaar-style software tool with atomic functions.


              # run the four steps of scientific pipeline X
              # 1. make the input files
              $ convertIM.pl infile.list
              # 2. Observed summary statistics vector
              $ obsSumStats.pl -T obsSS.table batch.masterIn.fromIM > obsSS.txt
              # 3. Simulating the prior
              $ msbayes.pl -c batch.masterIn.fromIM  -r 1000000 -o priorfile
              # 4. Getting the posterior
              $ acceptRej.pl -t 0.001 -p outfig.pdf obsSS.txt priorfile > modeEstimatesOut.txt
              # 5. Create plots in R
              $ Rscript make_plots_of_model_outputs_1.R modeEstimatesOut.txt
              $ Rscript make_plots_of_model_outputs_2.R modeEstimatesOut.txt              
            
  • Pros: easy to develop
  • Pros: Gets the job done (at least once)
  • Cons: Fragile (dependent on many other tools and languages that will likely change)
  • Cons: Poor documentation and readability
  • Cons: Unlikely to be used by others

Dev tips 3: design and organize

Why don't more software tools have a graphical interface?

  • Pros: User friendly
  • Pros: "wow" factor of visual results
  • Pros: supports exploratory and educational analysis
  • Cons: Doesn't scale well to super large data
  • Cons: Difficult to run on HPC clusters
  • Cons: Not reproducible
  • Cons: Difficult to implement (many steps)

Jupyter Notebook and R Studio type tools provide a Happy Medium.

  • Pros: User friendly
  • Pros: "wow" factor of visual results
  • Pros: supports exploratory and educational analysis
  • Pros: Can scale to super large data and run on HPCs
  • Pros: Reproducible
  • Pros: Developers do not need to learn complicated GUI frameworks

Example: design motivation behind ipyrad

Genomic assembly software for RAD-seq type data. Two tools took fairly different design approaches. Stacks follows the design of a unix type program, requiring the user to enter all arguments individually to several atomized sub-programs. The authors provide bash script templates (e.g., for-loops) to run the entire workflow, composing many steps. It is easy for a typo or bug in the script to crash the whole workflow.

ipyrad is also designed as a command line tool, but takes a single params file as input, in which the user can enter all parameters. ipyrad then creates the full workflow, managing all files from one atomic subprogram to the next.

Stacks bash script requires user to enter every sample name by hand! and to write for-loops, and call several tools, to run just part of the workflow.


            #!/bin/bash

            src=$HOME/research/project

            files=”sample_01
            sample_02
            sample_03”

            #
            # Build loci de novo in each sample for the single-end reads only. If paired-end reads are available, 
            # they will be integrated in a later stage (tsv2bam stage).
            # This loop will run ustacks on each sample, e.g.
            #   ustacks -f ./samples/sample_01.1.fq.gz -o ./stacks -i 1 --name sample_01 -M 4 -p 8
            #
            id=1
            for sample in $files
            do
                ustacks -f $src/samples/${sample}.1.fq.gz -o $src/stacks -i $id --name $sample -M 4 -p 8
                let "id+=1"
            done

            # 
            # Build the catalog of loci available in the metapopulation from the samples contained
            # in the population map. To build the catalog from a subset of individuals, supply
            # a separate population map only containing those samples.
            #
            cstacks -n 6 -P $src/stacks/ -M $src/popmaps/popmap -p 8

            #
            # Run sstacks. Match all samples supplied in the population map against the catalog.
            #
            sstacks -P $src/stacks/ -M $src/popmaps/popmap -p 8

            #
            # Run tsv2bam to transpose the data so it is stored by locus, instead of by sample. We will include
            # paired-end reads using tsv2bam. tsv2bam expects the paired read files to be in the samples
            # directory and they should be named consistently with the single-end reads,
            # e.g. sample_01.1.fq.gz and sample_01.2.fq.gz, which is how process_radtags will output them.
            #
            tsv2bam -P $src/stacks/ -M $src/popmaps/popmap --pe-reads-dir $src/samples -t 8

            #
            # Run gstacks: build a paired-end contig from the metapopulation data (if paired-reads provided),
            # align reads per sample, call variant sites in the population, genotypes in each individual.
            #
            gstacks -P $src/stacks/ -M $src/popmaps/popmap -t 8

            #
            # Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics
            # export several output files.
            #
            populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --structure --fstats --hwe -t 8
          

By contrast, ipyrad allows users to set all parameters in a single file, to run the entire assembly in one command.


            # create a default parameter file
            ipyrad -n test

            # run all steps of the assembly using setting in params-test.txt
            ipyrad -p params-test.txt -s 1234567
          

And if the user wants to test different parameter settings, you can create a branched at a checkpoint (1-7) and continue from that point with different parameters.


            # create a default parameter file
            ipyrad -p params-test.txt -b newparams

            # run all steps of the assembly using setting in params-test.txt
            ipyrad -p newparams-test.txt -s 67
          

CLI and API

For downstream (and visual) analyses, we created a Python API: the ipyrad-analysis toolkit. This allows users to install one package that supports long-running CLI commands, and fast and interactive downstream analyses.


            # run a structure analysis wrapper over 10 reps and visualize results
            >>> import ipyrad.analysis as ipa
            >>> tool = ipa.structure(data="./data.vcf", min_maf=0.1, min_cov=10)
            >>> tool.run(reps=10)
            >>> tool.draw_barplot()