BIO00087H Workshop 7: Population genomics
The history of every organism is stored in its DNA.
Student services requested that we change this to avoid students having a deadline during an exam.
Try module --ignore-cache spider <name>
1 Learning objectives
- To learn about the data, data analysis tools, and methods of population genomics
- To explore an example of population genomics data
2 Introduction
2.1 Population genomics
At the end of workshop 3 we used diamnod to search for proteins that were conserved between Leishmania and yeast. This is an example of a comparative genomics analysis - comparing genomes between species. Population genomics is the study of genetic variation within species.
Typically we conduct population genomics by:
- Obtaining many biological samples from many individuals within a species or population
- Extracting DNA from each individual
- Sequencing the DNA with short reads (eg: Illumina)
- Using bioinformatics tools to identify genetic variants
- These are regions of the genome that differ between individuals within the species
- eg: single nucleotide polymorphisms (SNPs)
- Analysing the genetic variants to discover aspects of the biology of the species
2.1.1 What can we discover from population genomics?
There are many reasons why we may wish to study genetic and phenotypic variation within species.
- We can uncover genotypes that affect phenotypes, using genome wide association studies (GWAS)
- We can detect population structuring and other population demographic features, such as between species hybrids, and population size changes
- We can discover signals of recent adaptation within populations, such as selective sweeps
Samples/individuals/strains: genetically different individuals within the species we are studying. Terminology differs between species.
Genetic variant: A location in the genome where individuals within a species, or within a population differ from one another.
SNP: single nucleotide polymorphism. A location within the genome where individuals differ by one nucleotide (eg: A or G).
Alleles: The two (or more) states that a genetic variant can be, eg: A or G
Allele frequency: The frequency of an allele in a population.
Population: A group of individuals that are more closely related to each other than another population.
Read coverage: the average number of reads aligned within a particular region of the genome.
Somy: The number of chromosome copies within a particular chromosome. Most diploid chromosomes are disomic. Trisomy is the state where a chromosome has three copies.
2.2 Population genomics methods
Typically, the bioinformatics part of population genomics proceeds in these steps:
- genomic sequencing of multiple samples (makes fastq files)
- sequence data quality control
- mapping (alignment) of reads to a reference genome (generates a bam file)
- identifying genetic variations (SNP calling) from bam files (generates a VCF file)
- data analysis (many tools)
A more detailed summary is below, showing some of the tools that are used.
Figure 1.Population genomics methods. Examples of software tools that are used to conduct each step are shown in red. Note that, after read mapping and read alignment, large structural variants like copy number variants and inversions, are discovered using different software to small variants like SNPs and indels (small insertion/deletion variants).
3 Exercises
Today we will look at some population genomics data and begin to characterise populations of the Leishmania donovani species complex, which includes Leishmania donovani and Leishmania infantum.
3.1 Setting up
3.1.1 Directory set up
Start up the PC in Linux as last time, log in and open the Terminal window.
You should already have your own directory in
/shared/biology/bioldata1/bl-00087h/students/
Change to your directory with:
cd /shared/biology/bioldata1/bl-00087h/students/${USER}
Make a new directory for this workshop with:
mkdir workshop7
Change to this directory with:
cd workshop7
3.1.2 File set up
Today we will use files from the directory
/shared/biology/bioldata1/bl-00087h/data/VL
To place symbolic links for all files in this directory within your
workshop7
directory, do this from within yourworkshop7
directory:
ln -sf /shared/biology/bioldata1/bl-00087h/data/VL/* .
Don’t forget the dot (.
) at the end of the line.
This contains:
bra.sample1.bam
: a bam format file for sample bra.sample1 (sorted by chromosome and position)bra.sample1.bam.bai
: the index for this bam file (created with samtools index)Ldonovani.samples.txt
: a list of the samples in the VCF file. Each prefix describes which population the sample comes from (eg: ea1 is from East Africa). This includes the mystery samples.
That is all you will need to use today. It also contains some for those that will conduct your report on population genomics.
Ldonovani.vcf
: a VCF file containing 86,697 SNPs from 20 individuals from the LDSC, and five mystery samples. The ’mystery samples are called, newcastle.sample0, liverpool.sample0, manchester.sample0, leeds.sample0 and york.sample0.Ldon.chromosomes.bed
: a browser extensible format (bed) fileA symbolic link to a directory called
bams
that contains many other .bam files.
All this data will be described below.
3.2 Getting to know the data: read mapping and bam files
3.2.1 Read mapping
Population genomics analysis usually does not create de novo assemblies of reads. Instead, we sequence many individuals with short reads, with low at (5-10x) coverage, and then map (align) reads to a reference genome.
A reference genome is merely one individual from a species that is well-studied, well assembled and well-annotated.
We then use the alignments of all the reads to locate genetic variants. The figure below, which was generated by the Integrative Genomics Viewer (IGV) illustrates how we can identify genetic variants from read alignments.
The most common mapping tool is the Burrows-Wheeler Aligner (BWA).
3.2.2 BAM files
The standard output for all read mapping software is sequence alignment map (SAM) file. These are usually encoded in binary format to create a binary sequence alignment map (BAM) format file. bam files contain information about each where each read aligned to the reference genome, and if there are any differences between the read, and the reference genome.
A visual representation of a BAM is shown below.
Figure 2. IGV view of read alignments. This shows the alignment of reads onto a reference genome. The position in the reference genome is indicated at the top. We are looking at 68 bp of the human genome in this case. The Coverage line show the read coverage at each position in the genome (how many reads are aligned here). At the very bottom the Sequence line show the sequence of the reference genome in this location. The grey bars in the largest section indicate where reads are aligned to the genome.
Where the read differs from the reference genome IGV shows the different base. In this site we have just one SNP (a G/T polymorphism). This is a heterozygous SNP, because about half the reads have each allele. We can also observe base call errors, where one read has a different base.
3.2.2.1 BAM files
BAM files are the substrate for SNP-calling. They contain information about:
- Where each read maps to the reference genome (chromosome, position)
- How reliable the alignment is (mapping quality)
- DNA sequence differences between the read and the reference genome
This information is used by SNP-calling software to identify SNPs. BAM files are binary encoded (not text), so to view them we need to use samtools software, which is able to read, process and extract information from bams.
To view a bam:
- Use
module spider samtools
to find the relevant samtools installation - Load the latest version with
module load [samtools version]
- Run:
samtools view bra.sample1.bam | cut -f 1,3-5,10 | less
, which will show you:
- the read name (column 1)
- the chromosome where is maps to (column 2)
- the position on the chromosome where is maps to (column 3)
- the mapping score (phred scaled)(column 4)
- the aligned sequence (column 1)
A more intuitive view from samtools can be gained using samtools tview
. In this view
-
.......
means a read aligned on one strand (forward), with the same sequence as the reference -
,,,,,,,
means a read aligned on the other strand (reverse), with the same sequence as the reference - letter
AGCT
oragct
, means a difference form the reference with high qualty (uppercase) or lower quality (lowercase) -
*
means a deletion or insertion
If you scroll along with the arrow keys you will certainly locations where all reads identify a different base in the strain bra.sample1
from the reference genome. These are SNPs. To exit this view, press q
.
To start samtools tview
, do:
samtools tview bra.sample1.bam TriTrypDB-46_LdonovaniBPK282A1_Genome.fasta
3.2.3 SNP-calling
SNP calling is the process of identifying reliable SNPs within BAM files, and disguising SNPs from read errors. Most SNP-calling software also identify short insertion/deletion mutations (indels).
SNP calling can be carried out using on BAM (from one individual), but is usually performed with BAM from multiple individuals from a species.
3.2.4 Genetic variants in VCF format
The most common format for describing genetic variants is variant call format (VCF)
VCF files are essentially a tab-delimited table, where each row describes the genetic variation at a specific site in the genome, and columns are either:
information about that site in the genome (chromosome, position, the alleles that are present, the quality of the variant (phred scaled)
information about the samples (individuals, strains) (which alleles this individual has, and what data supports these allele calls)
You can see a simplified example of a VCF file here. And by looking at the file Lguyanensis-simplified.vcf
, with:
less Lguyanensis-simplified.vcf
It has a header line: #CHROM POS REF ALT
, that describes for each genetic variation:
- the chromosome where the variation is located
- the variant’s position in the chromosome
- the reference genome’s allele at this position (eg: A)
- the alternative s allele at this position (eg: G)
After this, we have the genotypes for the strains that are in the VCF (four in this case). They are coded as 0 for the reference allele (eg: A), 1 for the second allele (eg: G), so for example:
- AA is coded as 0/0
- AG is coded as 0/1
- GG is coded as 1/1
Lguyanensis-simplified.vcf
is a simplified display of a VCF file - not a genuine VCF. The genuine VCF that we will use today is the file: Ldonovani.vcf
. Have a look at this using less
. You will see that it contains a lot more information, that describes the chromosomes, the SNP quality etc.
An important point is the file contains many lines that start with hash ##
. These ‘comment lines’ contain important information about the format of the ‘data lines’ that come next.
You do not need to understand all this information today. But can learn more about the what is in VCF files later here, if you are curious.
3.3 Data analysis: subsetting and analysis of genetic diversity
Now we will conduct some analysis of the genetic diversity of the populations in our data. In our Ldonovani.vcf
VCF file we have samples from four populations:
- East Africa, population 1 (samples ea1.sample1, ea1.sample2, etc)
- East Africa, population 2 (samples ea2.sample2.sample2, etc)
- The Indian subcontinent, population 1 (samples isc1.sample1, etc)
- The Indian subcontinent, population 2 (samples isc2.sample1, etc)
- Brazil (samples bra.sample1, etc)
We will focus on the ea1 (East Africa) and bra (Brazil) populations here, but in your project you may analyse any/all of the samples.
3.3.1 Subsetting
Before we calculate the genetic diversity of each population, we have to extract some ‘columns’ (samples) from the VCF. We need a text file with the samples we want listed (one per row), so make these with:
grep bra Ldonovani.samples.txt > bra.samples.txt
grep ea1 Ldonovani.samples.txt > ea1.samples.txt
To check to see how many lines these file have do: wc -l *txt
bra.samples.txt
and ea1.samples.txt
should have 4 lines each.
Then, load the vcftools module with:
module load VCFtools/0.1.16-GCC-12.2.0
module spider [keyword]
Will find modules and the current version.
Then, use vcf-subset
(part of vcftools) to extract the samples from each population. Be patient, this may take a minute or two:
cat Ldonovani.vcf | vcf-subset -c bra.samples.txt -e -a > bra.vcf &
cat Ldonovani.vcf | vcf-subset -c ea1.samples.txt -e -a > ea1.vcf &
Using these methods, you could also make subsets of the SNP data corresponding to populations: ea2, isc1 or isc2, or an other set of strains.
3.3.2 Nucleotide diversity
Nucleotide diversity (usually denoted with the Greek letter π
) is a measure of the genetic differences contained in a set of samples. It is calculated as the average number of SNP differences between all pairs of strains, divided by the length of the region compared. A good thing about π is that it is independent of the sample size.
π is proportional to the long term population size (the number of individuals). So by sequencing multiple individuals from one more more populations, we can learn about a population in the natural environment, that would be difficult or impossible to estimate using any other method.
We will calculate π along windows of the genome, using the vcftools
software.
Then calculate π for windows of the genome (in the case 50 kb), like so:
vcftools --vcf ea1.vcf --window-pi 50000 --out ea1.data &
vcftools --vcf bra.vcf --window-pi 50000 --out bra.data &
The output files will be: bra.data.windowed.pi
and ea1.data.windowed.pi
. Later we will these load into R Studio for some analysis.
3.3.3 Tajima’s D
Tajima’s D (D
) is another metric that describes the genetic diversity within a population. Essentially, D measures the proportion of SNPs with low allele frequencies (rare SNPs). The proportion of rare SNPs is increased when a population is expanding (and decreased when it is shrinking), so we can interpret the genome-wide mean of D like so:
- Mean Tajima’s D positive: population is shrinking
- Mean Tajima’s D around zero: population is stable in size
- Mean Tajima’s D negative: population is expanding
So let’s Calculate D for the two populations we are examining:
vcftools --vcf ea1.vcf --TajimaD 50000 --out ea1.data &
vcftools --vcf bra.vcf --TajimaD 50000 --out bra.data &
To check that the Nucleotide diversity and Tajima’s D calculations ran as we expect, do:
wc -l *.data.*
The files ending in .Tajima.D
and windowed.pi
should have 570-665 lines. Also have a quick look at some of them with less
. You’ll see that these are nicely formatted lines that can be read into R for analysis.
3.4 Discovering population structure using principal components
3.4.1 Principal components analysis (PCA)
With complex data, such as SNP data, there can be many signals. One signal is the genetic differences between each pair of strains. In our sample of 20 strains this would be possible to visualize, but with a sample size of 100 or 1000, this rapidly becomes unfeasible.
Principal components analysis is a statistical technique for reducing the dimensionality of the data. For population genomics, this enables us to capture the strongest signals of population clustering. This is a standard method in population genomics.
3.4.2 Calculating principal components from SNP data
We will use PLINK2
to calculate principal components. First load the PLINK module:
module load PLINK/2.00a3.6-GCC-11.3.0
Then make a file that can be read by PLINK:
plink2 --vcf Ldonovani.vcf --make-bed \
--autosome-num 36 --out Ldonovani.plink \
--allow-extra-chr --vcf-half-call r
Then use PLINK to calculate PCs:
plink2 --pca --bfile Ldonovani.plink \
--autosome-num 36 --allow-extra-chr \
--read-freq Ldonovani.plink.afreq --out Ldonovani.plink
The main output file that you need will be: Ldonovani.plink.eigenvec
. This contains each sample (IID
) and then the principal component coordinates for PC1, PC2, etc.
3.5 Analysis of data
The last part of this workshop will be conducted in R Studio. The nucleotide diversity, Tajima’s D and principal component data are all easily analysed in R Studio.
Within R Studio, you can open files, directly from the shared drive /shared/biology/bioldata1/bl-00087h/
. You can also save files such as plots on that drive space.
To view plots made in R Studio, open the file browser (as above), use pwd
in the Terminal app to
You won’t need this for today, but we describe how to move files off the /shared/biology/bioldata1
network drive here.
Figure 3. Ubuntu Linux desktop.
3.5.1 R Studio work
To start:
- open the R Studio app, using the Show Apps icon at the bottom left of the screen
- create a new R script (with a descriptive name), and save it
- work though the R code below, copying the code into our script
First load libraries and set your working directory:
3.5.2 Nucleotide diversity data analysis
Next read the nucleotide diversity (π) data into R. Then add a population
column using the R code below:
#load EA1 data
ea1pi<-as_tibble(read.table("ea1.data.windowed.pi",h=T))
#add a population column
ea1pi$population='EA1'
#load Brazil data
brapi<-as_tibble(read.table("bra.data.windowed.pi",h=T))
#add a population column
brapi$population='BRA'
#merge the two tibbles
pi<- bind_rows(ea1pi,brapi)
#check
summary(pi)
glimpse(pi)
Now make a box and whiskers plot showing the nucleotide diversity between populations:
## nucleotide diversity plot
ggplot(pi, aes(x=population,y=PI))+
geom_boxplot()+
stat_compare_means()+
theme_classic2()
#save jpeg of plot
ggsave("pi.boxplot.jpeg",width=5,height=7)
What do these plots show?
What does the difference in nucleotide diversity between populations tell you?
3.5.3 Tajima’s D data analysis
First, read in the data:
#read Tajima's D data
ea1d<-read.table("ea1.data.Tajima.D",h=T)
ea1d$population='EA1'
brad<-read.table("bra.data.Tajima.D",h=T)
brad$population='BRA'
#merge
taj<- bind_rows(ea1d,brad)
#check
summary(taj)
glimpse(taj)
Now make some plots of Tajima’s D:
# Tajima's D boxplot
ggplot(taj, aes(x=population,y=TajimaD))+
geom_boxplot()+
stat_compare_means()+
geom_hline(yintercept = 0,col=2,linetype=2)+
theme_classic2()
#save jpeg of plot
ggsave("taj.boxplot.jpeg",width=5,height=7)
# Tajima's D histograms
ggplot(taj, aes(x=TajimaD))+
geom_histogram()+
geom_vline(xintercept = 0,col=2,linetype=2)+
facet_wrap(~population)+
theme_classic2()
#save jpeg of plot
ggsave("taj.hist.jpeg",width=7,height=5)
You should have various jpeg files in your working directory.
What do the Tajima’s D plots show?
What does the difference in Tajima’s D between populations tell you? What can we infer from both the nucleotide diversity and Tajima’s D data?
3.5.4 Using principal components to detect populations
Now we will load up and plot the principal component coordinates we calculated with PLINK2
, from the file Ldonovani.plink.eigenvec
.
First load the data:
#read in PC values from Ldonovani.plink.eigenvec
pc<-read_tsv("Ldonovani.plink.eigenvec")
#rename the first column, for ease of use
names(pc)[1]='FID'
#add a population column, using grep
pc$population = "unknown"
pc[grep("bra",pc$IID),]$population = "Brazil"
pc[grep("ea1",pc$IID),]$population = "East Africa 1"
pc[grep("ea2",pc$IID),]$population = "East Africa 2"
pc[grep("isc1",pc$IID),]$population = "India 1"
pc[grep("isc2",pc$IID),]$population = "India 2"
#check
summary(pc)
view(pc)
Then make the plot showing where each strain is placed on the principal component coordinates:
#make the plot of principal components 1 and 2
ggplot(pc, aes(x=PC1,y=PC2,colour=population))+
geom_jitter(height=0.01,width=0.01,size=5,pch=1)+
theme_classic()
#save the plot
ggsave("my.PCA.plot.jpeg",width=7, height=7)
#NB: you can also plot other principal components using:
ggplot(pc, aes(x=PC3,y=PC4,colour=population))+
geom_jitter(height=0.01,width=0.01,size=5,pch=1)+
theme_classic()
You should have a file called my.PCA.plot.jpeg
in your working directory.
Principal components analysis of genetic diversity shows an approximate genetic distance between strains. Note that we used geom_jitter
to spread out the coordinates of the strains. In reality the strains within a population are much closer. Change the geom_jitter
to zero to see how close, with:
geom_jitter(height=0,width=0,size=5,pch=1)
What do these plots show about the genetic distance of Leishmania strains within one country?
What do they show about the relative genetic distance within vs between countries?
This is the information we can obtain about these populations, from the genomes of only 20 strains:
- estimates of population size (from π)
- estimates of population growth/decline (from Tajima’s D)
- understanding of relative genetic distance within vs between countries (from PCA)
4 End
That is all for today. Today we learned these aspects of population genomics research:
- What we can learn population genomics
- Some population genomics terminology
- Bioinformatics methods used in population genomics
- Read mapping, bam files and SNP-calling
- Population genomics data in VCF files
- Data analysis: Nucleotide diversity, Tajima’s D
- Data analysis: using principal components to detect populations
5 After the workshop
5.1 Some detail on software and population genomics
- Usually we use the BWA software to map reads for diversity analysis. There are many others, that are designed for other uses (such as RNAseq).
- There are many SNP calling tools. My preference (DJ) is to call with both Octopus and Freebayes, and keep only SNPs and indels called by both.
- Read mapping takes time, particularly if one has many samples. However, it is what’s called an embarrassingly parallel process, because the reads from each sample can be mapped separately using a server like Viking which has 134 compute nodes, each with 96 CPU cores per node!
5.2 Population genomics for your report
You do not need to be an expert on evolutionary genetics or population genetics to conduct this analysis. The focus of the report should be simple data analysis and simplistic interpretations of the data. The information on this page and the video below by Mohammed Noor explains nucleotide diversity and Tajima’s D sufficiently well to interpret the data.
There are two data sets that you can analyse. It also OK to analyse either, or both.
5.2.1 Leishmania donovani data
This is an extension of the L. donovani data file that you analysed in the workshop.
Note that the VCF file contains samples from many populations. These are listed in the file Ldonovani.samples.txt
. Included in these are the samples:
- leeds.sample0
- liverpool.sample0
- manchester.sample0
- newcastle.sample0
- york.sample0
Each of these comes from one of the other populations (Brazil, EA1, EA2, ISC1 or ISC2). Part of your project work is to find out which population each comes from, and include them in your nucleotide diversity and Tajima’s D analysis.
There may be other metrics that vcftools can calculate in addition to nucleotide diversity and Tajima’s D.
You can download the and modify script BIO00087H-workshop7-analysis-2024-10-28.R to produce the analysis and plots.
5.2.2 Leishmania guyanensis data
The VCF file and the plink .afreq
file you will need is here /shared/biology/bioldata1/bl-00087h/data/Lguyanensis/
This is a small population genomics data set of Leishmania guyanensis. All samples were obtained from a patients in a medical center in the city of Manaus, and they live in the region of Manaus.
Carrying out the same analysis with PCA coordinates, nucleotide diversity (π) and Tajima’s D is a good idea, as for the L. donovani data. Calculating an Fst estimate, using the -weir-fst-pop
flag with vcftools
might also be interesting. The vcftools manual page explains how to do this (and google will help).
Hint: Calculate PCA coordinates first, before calculating nucleotide diversity and Tajima’s D in case there are several populations. Nucleotide diversity and Tajima’s D only make sense to calculate within populations.