Notebook 12.1 Reading VCF files¶

This notebook will introduce you to the VCF file format, and methods for reading and manipulating variant call data (SNPs).

Learning objectives¶

Understand the VCF file format
Reading VCF files as pd data frames
Basic manipulations of the vcf data

Variant Call Format (VCF) files¶

VCF is a very common file format for storing and retrieving DNA sequence data, specifically it is most often used for storing single-nucleotide polymorphism (SNP) data, i.e. only sites that are variable within a population or sample.

Required software¶

Install the following with conda before running this notebook (if you haven't already done so).

# conda install pandas requests toyplot -c conda-forge

import toyplot
import requests
import pandas as pd

Fetch an example vcf file¶

This is an in-class challenge activity, where I will give you prompts and motivations and then you'll have to figure out how to do what you need to do on your own (or with a partner).

First, use the requests module to download the vcf file using this URL. Save the returned data to a file called wcs.vcf (this data is from a White-crowned sparrow study).

vcf_url = "https://raw.githubusercontent.com/isaacovercast/easySFS/refs/heads/master/example_files/wcs_1200.vcf"

Converting to a pd data frame¶

Try loading the wcs.vcf file into a pandas dataframe using pd.read_csv. What happens?

Pandas expects a file to be formatted as tabular data, which a VCF actually is if you can somehow remove all the extra header information. read_csv() can take an optional argument called header where you can pass in the row number of the line that contains information about each column. See if you can figure out a way to identify the header row in this vcf file, and then load it into a df using the header parameter.

Challenge question: Calculate % missing data¶

Missing data is often something useful to quantify in a vcf file. In most cases missing data is specified as './.', indicating no genotype calls for a sample at this SNP. Please calculate the % of missing sites in this dataset. Ask for hints about how to proceed if necessary.