Notebook 12.1 Reading VCF files¶
This notebook will introduce you to the VCF file format, and methods for reading and manipulating variant call data (SNPs).
Learning objectives¶
- Understand the VCF file format
- Reading VCF files as pd data frames
- Basic manipulations of the vcf data
Variant Call Format (VCF) files¶
VCF is a very common file format for storing and retrieving DNA sequence data, specifically it is most often used for storing single-nucleotide polymorphism (SNP) data, i.e. only sites that are variable within a population or sample.
Required software¶
Install the following with conda before running this notebook (if you haven't already done so).
# conda install pandas requests toyplot -c conda-forge
import toyplot
import requests
import pandas as pd
Fetch an example vcf file¶
This is an in-class challenge activity, where I will give you prompts and motivations and then you'll have to figure out how to do what you need to do on your own (or with a partner).
First, use the requests
module to download the vcf file using this URL. Save the returned data to a file called wcs.vcf
(this data is from a White-crowned sparrow study).
vcf_url = "https://raw.githubusercontent.com/isaacovercast/easySFS/refs/heads/master/example_files/wcs_1200.vcf"
Converting to a pd data frame¶
Try loading the wcs.vcf
file into a pandas dataframe using pd.read_csv
. What happens?
Pandas expects a file to be formatted as tabular data, which a VCF actually is if you can somehow remove all the extra header information. read_csv()
can take an optional argument called header
where you can pass in the row number of the line that contains information about each column. See if you can figure out a way to identify the header row in this vcf file, and then load it into a df using the header
parameter.
Challenge question: Calculate % missing data¶
Missing data is often something useful to quantify in a vcf file. In most cases missing data is specified as './.', indicating no genotype calls for a sample at this SNP. Please calculate the % of missing sites in this dataset. Ask for hints about how to proceed if necessary.