Where are my files? Where am I?
ls
data/ docs/ file.txt
Where am I?
pwd
/home/deren/
what files are in that other directory (using the full path)?
ls /home/deren/Documents/
file1.txt file2.txt file3.txt
what files are in that other directory (using the relative path)?
ls ./Documents/
file1.txt file2.txt file3.txt
what files are in that other directory? (move and then check)
cd /home/deren/Documents/
ls
file1.txt file2.txt file3.txt
move up one directory (using relative path)
cd ..
pwd
/home/deren/
list contents of current directory (explicitly)
# three different ways to specify the same
ls # . is the default target
ls . # . means current dir
ls ./ # . is a dir, so ./ means the same thing
Documents/
use `man` to view documentation of system commands
man ls
LS(1) User Commands LS(1)
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
List information about the FILEs (the current directory by default). Sort entries al‐
phabetically if none of -cftuvSUX nor --sort is specified.
Mandatory arguments to long options are mandatory for short options too.
-a, --all
do not ignore entries starting with .
-A, --almost-all
do not list implied . and ..
--author
with -l, print the author of each file
-b, --escape
print C-style escapes for nongraphic characters
--block-size=SIZE
with -l, scale sizes by SIZE when printing them; e.g., '--block-size=M'; see
SIZE format below
-B, --ignore-backups
do not list implied entries ending with ~
-c with -lt: sort by, and show, ctime (time of last change of file status informa‐
tion); with -l: show ctime and sort by name; otherwise: sort by ctime, newest
first
-C list entries by columns
--color[=WHEN]
color the output WHEN; more info below
-d, --directory
list directories themselves, not their contents
-D, --dired
generate output designed for Emacs' dired mode
-f list all entries in directory order
-F, --classify[=WHEN]
append indicator (one of */=>@|) to entries WHEN
--file-type
likewise, except do not append '*'
--format=WORD
across -x, commas -m, horizontal -x, long -l, single-column -1, verbose -l, ver‐
tical -C
--full-time
like -l --time-style=full-iso
-g like -l, but do not list owner
--group-directories-first
group directories before files; can be augmented with a --sort option, but any
use of --sort=none (-U) disables grouping
-G, --no-group
in a long listing, don't print group names
-h, --human-readable
with -l and -s, print sizes like 1K 234M 2G etc.
--si likewise, but use powers of 1000 not 1024
-H, --dereference-command-line
follow symbolic links listed on the command line
--dereference-command-line-symlink-to-dir
follow each command line symbolic link that points to a directory
--hide=PATTERN
do not list implied entries matching shell PATTERN (overridden by -a or -A)
--hyperlink[=WHEN]
hyperlink file names WHEN
--indicator-style=WORD
append indicator with style WORD to entry names: none (default), slash (-p),
file-type (--file-type), classify (-F)
-i, --inode
print the index number of each file
-I, --ignore=PATTERN
do not list implied entries matching shell PATTERN
-k, --kibibytes
...
list contents using 'long list' format (shows time, owner, size...)
ls -l
-rw-rw-r-- 1 deren deren 5 Jan 22 22:05 file1.txt
-rw-rw-r-- 1 deren deren 11141 Jan 22 22:06 file2.txt
-rw-rw-r-- 1 deren deren 3280 Jan 22 22:07 file3.txt
list contents long, sorted by time, using human-readable size.
ls -l -t -h # three options
ls -lth # same thing written differently
-rw-rw-r-- 1 deren deren 3.3K Jan 22 22:07 file3.txt
-rw-rw-r-- 1 deren deren 11K Jan 22 22:06 file2.txt
-rw-rw-r-- 1 deren deren 5 Jan 22 22:05 file1.txt
list files move up one directory (using relative path)
# program -option target
ls -ltr /home/deren/work
# echo prints the target string to the terminal output (stdout)
hello world
# ls -l 'lists' contents in the specifed directory
drwxr-xr-x 7 deren deren 4096 Dec 28 03:37 data/
drwxr-xr-x 7 deren deren 4096 Dec 28 03:37 docs/
drwxr-xr-x 7 deren deren 4096 Dec 28 03:37 file.txt
ctrl-a returns cursor to the start of a line
# after writing a long command you notice a typo at the beginning
lx -ltr /home/deren/work▌
# ctrl-a brings cursor back to start.
▌lx -ltr /home/deren/work
ctrl-d deletes a character from cursor position
# cursor is at beginning of the line
▌lx -ltr /home/deren/work
# ctrl-d to delete several characters
▌ -ltr /home/deren/work
ctrl-k kills the current line
# cursor is at beginning of the line
▌ls -ltr /home/deren/work
# ctrl-k clears the whole line
▌
ctrl-e sends cursor to end of a line
# cursor is at beginning of the line
▌ls -ltr /home/deren/work
# ctrl-e sends it to the end
ls -ltr /home/deren/work▌
ctrl-f and ctrl-b move cursor forward or back one char
# cursor is at beginning of the line
▌ls -ltr /home/deren/work
# execute ctrl-f two times
ls▌ -ltr /home/deren/work
the up arrow retrieves previously executed commands; down arrow cycles back.
# press up once to get last executed command
pwd
# press down key to cycle back down
ls -ltr /home/deren/work
ctrl-p (previous) and ctrl-n (next) are synonymous with up and down arrows, respectively
# ctrl-p (move up to previous command)
pwd
# ctrl-n (move down to next command)
ls -ltr /home/deren/work
ctrl-r allow typing text to interactively search previously executed commands
# ctr-r then start to type previous command "m"
m
# it will autocomplete and cycle among options
man ls
This set of hotkeys is termed the "readline" (or a subset of emacs-style) hotkeys.
# ctrl-f: move forward
a▌aaaa
# ctrl-b: move back
▌aaaaa
# ctrl-e: move to end
aaaaa▌
# ctr-a: move to start
▌aaaaa
# ctrl-k: clear text after cursor
▌
# ctrl-p: up / previous command
man ls
# ctrl-n: down / next command
▌
`echo` displays/prints a line of text. Useful for debugging; checking variables.
echo "hello world"
hello world
`echo` can interpret line breaks and other special characters.
echo -e "hello\nworld"
hello
world
text can be written to files using redirection (> or >>)
# using > writes so that the file will contain only this text
echo "hello world" > file.txt
cat file.txt
hello world
# using >> appends so that the file will add this text to the end
echo "hello world" >> file.txt
hello world
hello world
`echo` can interpret line breaks and other special characters.
echo -e "hello\nworld"
hello
world
`cat` streams (reads) data from a file to the terminal output (stdout)
cat file.txt
hello world
hello world
`cat` concatenates (stitches together) data from multiple files
cat file1.txt file2.txt file3.txt > files_123.txt
`less` shows only single page of text at a time. Allows interactive viewing of files that is much more convenient for viewing large files. You need to press `q` to quit out of the interactive viewing session.
less very-large-file.txt
`grep` is used to perform pattern matching on lines of text.
# find all lines in this file containing the word "error"
grep "error" /var/log/auth.log
A pipe (|) passes the output from command to the input of another command. More specifically, the stdout of a command can become the stdin of the next command.
echo -e "a\nb\nc" > abc.txt
cat abc.txt
a
b
c
Here we pipe the output of cat to grep, to search for lines containing 'b'
cat abc.txt | grep b
b
Is it better to have one grand monolithic program that does everything (The Cathedral); or numerous independent, microservices that perform one simple thing well (The Bazaar)? This is a classical debate!
Call command on a script or config file: easy
program-name script
The hard part is designing the script/instructions. Finding the right options/code/model/instructions to setup the program correctly *is the goal*. You will probably only run this once and then be done with it.
Call many simple commands with options piped from one to next; each one simple task
# perform simple task 'a'
prog -a data
# perform simple task 'b'
prog -b data
# or, perform complex task by combining commands
prog -a data | prog -b - | prog -c - > output.csv
Each program is easy to understand, and in different combinations can accomplish complex tasks. Easy to debug. Can be confusing to know what all is possible, and how to accomplish it.
Will the user only interact with the program once? (e.g., genome assembly)
Or will they use the program interactively, tweaking options? (e.g., plotting tools)
The design of a program can generate tradeoffs between efficiency and reproducibility, accessibility, and ‘activation energy’. CLI (command line interface) vs API (application programming interface) provide a nice example of this.
We are now familiar with the format of command-line tools
prog -b 10 -c 4 --mode "fast" data > out.txt
An API is implemented within a programming language (e.g., Python or R) and allows easy interactions between the specific software program and the general use of that language.
# First open an interactive Python session
ipython
>>> # Then within Python the code can be run
>>> import toytree
>>> tree = toytree.rtree.bdtree(ntips=10)
>>> print(tree.ntips)
Many tools are developed as both a CLI and API. The API allows the developer or users a convenient framework for accessing utilities of the software; and the CLI provides easy access to a few simple and commonly used commands.
# call the program `ms` from the command line to simulate coalescent trees
$ ms 15 1000 -t 6.4 -G 6.93 -eG 0.2 0.0 -eN 0.3 0.5
>>> # use the msprime API in Python to setup the same simulation
>>> import msprime
>>> d = msprime.Demography()
>>> d.add_population(name="A", initial_size=20000, growth_rate=6.4)
>>> d.add_population_parameters_change(time=16000, growth_rate=0.0)
>>> d.add_population_parameters_change(time=24000, initial_size=10000)
>>> ts = msprime.sim_ancestry(samples={"A": 15}, demography=d)
Most problems have some existing tool designed to address them. When thinking about starting a new project ask yourself the following questions:
Sometimes a tool accomplishes a single task well, but is poorly designed for a related
task. A wrapper tool is a program that calls the first program in a new and different
enough way that it is worth packaging as a separate tool.
Example: easySFS developed by Isaac Overcast is a Python-based wrapper that
calls the program dadi in an iterative way -- as opposed to just once -- in a
way that allows exploring a broad parameter space to find the settings that maximize
the data quality. Because users may want to perform this iterative process in different
ways, easySFS provides its own user interface:
$ ./easySFS.py -i input.vcf -p pops_file.txt --preview
Pop1
(2, 45.0) (3, 59.0) (4, 58.0) (5, 49.0) (6, 41.0) (7, 35.0) (8, 27.0) (9, 20.0) (10, 13.0) (11, 8.0) (12, 8.0) (13, 5.0) (14, 2.0) (15, 2.0) (16, 1.0)
pop2
(2, 68.0) (3, 96.0) (4, 106.0) (5, 110.0) (6, 108.0) (7, 89.0) (8, 76.0) (9, 66.0) (10, 56.0) (11, 49.0) (12, 42.0) (13, 39.0) (14, 34.0) (15, 29.0) (16, 27.0) (17, 26.0) (18, 24.0) (19, 23.0) (20, 21.0) (21, 22.0) (22, 20.0) (23, 19.0) (24, 16.0) (25, 16.0) (26, 15.0) (27, 15.0) (28, 13.0) (29, 13.0) (30, 14.0) (31, 14.0) (32, 14.0) (33, 13.0) (34, 12.0) (35, 9.0) (36, 9.0) (37, 8.0) (38, 8.0) (39, 8.0) (40, 6.0) (41, 6.0) (42, 6.0) (43, 5.0) (44, 5.0) (45, 5.0) (46, 4.0) (47, 4.0) (48, 4.0) (49, 3.0) (50, 3.0) (51, 3.0) (52, 3.0) (53, 3.0)
Wrappers are a great place to start out in programming because they reduce the complexity
of the problem to one of iteration, aggregation, and evaluation of results, instead of
starting with trying to develop novel algorithms. Often they are very useful tools for
your own research. If you think they will be helpful to others, share it on GitHub!
We will learn the building blocks necessary to write a wrapper tool within the first week
of learning Python.
Avoid the "scientific software as a pile of scripts" trap. Many scientific software tools are distributed as simply a collection of one-off scripts. This is very different from organizing a cohesive Bazaar-style software tool with atomic functions.
# run the four steps of scientific pipeline X
# 1. make the input files
$ convertIM.pl infile.list
# 2. Observed summary statistics vector
$ obsSumStats.pl -T obsSS.table batch.masterIn.fromIM > obsSS.txt
# 3. Simulating the prior
$ msbayes.pl -c batch.masterIn.fromIM -r 1000000 -o priorfile
# 4. Getting the posterior
$ acceptRej.pl -t 0.001 -p outfig.pdf obsSS.txt priorfile > modeEstimatesOut.txt
# 5. Create plots in R
$ Rscript make_plots_of_model_outputs_1.R modeEstimatesOut.txt
$ Rscript make_plots_of_model_outputs_2.R modeEstimatesOut.txt
Why don't more software tools have a graphical interface?
Jupyter Notebook and R Studio type tools provide a Happy Medium.
Genomic assembly software for RAD-seq type data. Two tools took fairly
different design approaches.
Stacks follows the design of a unix type program, requiring the
user to enter all arguments individually to several atomized sub-programs.
The authors provide bash script templates (e.g., for-loops) to run the
entire workflow, composing many steps. It is easy for a typo or bug in
the script to crash the whole workflow.
ipyrad is also designed as a command line tool, but takes a single params
file as input, in which the user can enter all parameters. ipyrad then
creates the full workflow, managing all files from one atomic
subprogram to the next.
Stacks bash script requires user to enter every sample name by hand! and to write for-loops, and call several tools, to run just part of the workflow.
#!/bin/bash
src=$HOME/research/project
files=”sample_01
sample_02
sample_03”
#
# Build loci de novo in each sample for the single-end reads only. If paired-end reads are available,
# they will be integrated in a later stage (tsv2bam stage).
# This loop will run ustacks on each sample, e.g.
# ustacks -f ./samples/sample_01.1.fq.gz -o ./stacks -i 1 --name sample_01 -M 4 -p 8
#
id=1
for sample in $files
do
ustacks -f $src/samples/${sample}.1.fq.gz -o $src/stacks -i $id --name $sample -M 4 -p 8
let "id+=1"
done
#
# Build the catalog of loci available in the metapopulation from the samples contained
# in the population map. To build the catalog from a subset of individuals, supply
# a separate population map only containing those samples.
#
cstacks -n 6 -P $src/stacks/ -M $src/popmaps/popmap -p 8
#
# Run sstacks. Match all samples supplied in the population map against the catalog.
#
sstacks -P $src/stacks/ -M $src/popmaps/popmap -p 8
#
# Run tsv2bam to transpose the data so it is stored by locus, instead of by sample. We will include
# paired-end reads using tsv2bam. tsv2bam expects the paired read files to be in the samples
# directory and they should be named consistently with the single-end reads,
# e.g. sample_01.1.fq.gz and sample_01.2.fq.gz, which is how process_radtags will output them.
#
tsv2bam -P $src/stacks/ -M $src/popmaps/popmap --pe-reads-dir $src/samples -t 8
#
# Run gstacks: build a paired-end contig from the metapopulation data (if paired-reads provided),
# align reads per sample, call variant sites in the population, genotypes in each individual.
#
gstacks -P $src/stacks/ -M $src/popmaps/popmap -t 8
#
# Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics
# export several output files.
#
populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --structure --fstats --hwe -t 8
By contrast, ipyrad allows users to set all parameters in a single file, to run the entire assembly in one command.
# create a default parameter file
ipyrad -n test
# run all steps of the assembly using setting in params-test.txt
ipyrad -p params-test.txt -s 1234567
And if the user wants to test different parameter settings, you can create a branched at a checkpoint (1-7) and continue from that point with different parameters.
# create a default parameter file
ipyrad -p params-test.txt -b newparams
# run all steps of the assembly using setting in params-test.txt
ipyrad -p newparams-test.txt -s 67
For downstream (and visual) analyses, we created a Python API: the ipyrad-analysis toolkit. This allows users to install one package that supports long-running CLI commands, and fast and interactive downstream analyses.
# run a structure analysis wrapper over 10 reps and visualize results
>>> import ipyrad.analysis as ipa
>>> tool = ipa.structure(data="./data.vcf", min_maf=0.1, min_cov=10)
>>> tool.run(reps=10)
>>> tool.draw_barplot()