BIO00087H Workshop 7: Population genomics

The history of every organism is stored in its DNA.

Author

Daniel Jeffares

Published

October 16, 2024

The deadline for prject submission is now Wednesday 22nd January at 12:00 noon

Student services requested that we change this to avoid students having a deadline during an exam.

Modules not loading?

Try module --ignore-cache spider <name>

1 Learning objectives

To learn about the data, data analysis tools, and methods of population genomics
To explore an example of population genomics data

2 Introduction

2.1 Population genomics

At the end of workshop 3 we used diamnod to search for proteins that were conserved between Leishmania and yeast. This is an example of a comparative genomics analysis - comparing genomes between species. Population genomics is the study of genetic variation within species.

Typically we conduct population genomics by:

Obtaining many biological samples from many individuals within a species or population
Extracting DNA from each individual
Sequencing the DNA with short reads (eg: Illumina)
Using bioinformatics tools to identify genetic variants
- These are regions of the genome that differ between individuals within the species
- eg: single nucleotide polymorphisms (SNPs)
Analysing the genetic variants to discover aspects of the biology of the species

2.1.1 What can we discover from population genomics?

There are many reasons why we may wish to study genetic and phenotypic variation within species.

We can uncover genotypes that affect phenotypes, using genome wide association studies (GWAS)
We can detect population structuring and other population demographic features, such as between species hybrids, and population size changes
We can discover signals of recent adaptation within populations, such as selective sweeps

Population genomics terminology

Samples/individuals/strains: genetically different individuals within the species we are studying. Terminology differs between species.
Genetic variant: A location in the genome where individuals within a species, or within a population differ from one another.
SNP: single nucleotide polymorphism. A location within the genome where individuals differ by one nucleotide (eg: A or G).
Alleles: The two (or more) states that a genetic variant can be, eg: A or G
Allele frequency: The frequency of an allele in a population.
Population: A group of individuals that are more closely related to each other than another population.
Read coverage: the average number of reads aligned within a particular region of the genome.
Somy: The number of chromosome copies within a particular chromosome. Most diploid chromosomes are disomic. Trisomy is the state where a chromosome has three copies.

2.2 Population genomics methods

Typically, the bioinformatics part of population genomics proceeds in these steps:

genomic sequencing of multiple samples (makes fastq files)
sequence data quality control
mapping (alignment) of reads to a reference genome (generates a bam file)
identifying genetic variations (SNP calling) from bam files (generates a VCF file)
data analysis (many tools)

A more detailed summary is below, showing some of the tools that are used.

Figure 1.Population genomics methods. Examples of software tools that are used to conduct each step are shown in red. Note that, after read mapping and read alignment, large structural variants like copy number variants and inversions, are discovered using different software to small variants like SNPs and indels (small insertion/deletion variants).

3 Exercises

Today we will look at some population genomics data and begin to characterise populations of the Leishmania donovani species complex, which includes Leishmania donovani and Leishmania infantum.

3.1 Setting up

3.1.1 Directory set up

Start up the PC in Linux as last time, log in and open the Terminal window.
You should already have your own directory in
/shared/biology/bioldata1/bl-00087h/students/
Change to your directory with:
cd /shared/biology/bioldata1/bl-00087h/students/${USER}
Make a new directory for this workshop with: mkdir workshop7
Change to this directory with: cd workshop7

3.1.2 File set up

Today we will use files from the directory /shared/biology/bioldata1/bl-00087h/data/VL
To place symbolic links for all files in this directory within your workshop7 directory, do this from within your workshop7 directory:

ln -sf /shared/biology/bioldata1/bl-00087h/data/VL/* .

Don’t forget the dot (.) at the end of the line.

This contains:

bra.sample1.bam: a bam format file for sample bra.sample1 (sorted by chromosome and position)
bra.sample1.bam.bai: the index for this bam file (created with samtools index)
Ldonovani.samples.txt: a list of the samples in the VCF file. Each prefix describes which population the sample comes from (eg: ea1 is from East Africa). This includes the mystery samples.

That is all you will need to use today. It also contains some for those that will conduct your report on population genomics.

Ldonovani.vcf: a VCF file containing 86,697 SNPs from 20 individuals from the LDSC, and five mystery samples. The ’mystery samples are called, newcastle.sample0, liverpool.sample0, manchester.sample0, leeds.sample0 and york.sample0.
Ldon.chromosomes.bed: a browser extensible format (bed) file
A symbolic link to a directory called bams that contains many other .bam files.

All this data will be described below.

3.2 Getting to know the data: read mapping and bam files

3.2.1 Read mapping

Population genomics analysis usually does not create de novo assemblies of reads. Instead, we sequence many individuals with short reads, with low at (5-10x) coverage, and then map (align) reads to a reference genome.

A reference genome is merely one individual from a species that is well-studied, well assembled and well-annotated.

We then use the alignments of all the reads to locate genetic variants. The figure below, which was generated by the Integrative Genomics Viewer (IGV) illustrates how we can identify genetic variants from read alignments.

The most common mapping tool is the Burrows-Wheeler Aligner (BWA).

3.2.2 BAM files

The standard output for all read mapping software is sequence alignment map (SAM) file. These are usually encoded in binary format to create a binary sequence alignment map (BAM) format file. bam files contain information about each where each read aligned to the reference genome, and if there are any differences between the read, and the reference genome.

A visual representation of a BAM is shown below.

Figure 2. IGV view of read alignments. This shows the alignment of reads onto a reference genome. The position in the reference genome is indicated at the top. We are looking at 68 bp of the human genome in this case. The Coverage line show the read coverage at each position in the genome (how many reads are aligned here). At the very bottom the Sequence line show the sequence of the reference genome in this location. The grey bars in the largest section indicate where reads are aligned to the genome.

Where the read differs from the reference genome IGV shows the different base. In this site we have just one SNP (a G/T polymorphism). This is a heterozygous SNP, because about half the reads have each allele. We can also observe base call errors, where one read has a different base.

3.2.2.1 BAM files

BAM files are the substrate for SNP-calling. They contain information about:

Where each read maps to the reference genome (chromosome, position)
How reliable the alignment is (mapping quality)
DNA sequence differences between the read and the reference genome

This information is used by SNP-calling software to identify SNPs. BAM files are binary encoded (not text), so to view them we need to use samtools software, which is able to read, process and extract information from bams.

To view a bam:

Use module spider samtools to find the relevant samtools installation
Load the latest version with module load [samtools version]
Run: samtools view bra.sample1.bam | cut -f 1,3-5,10 | less, which will show you:

the read name (column 1)
the chromosome where is maps to (column 2)
the position on the chromosome where is maps to (column 3)
the mapping score (phred scaled)(column 4)
the aligned sequence (column 1)

A more intuitive view from samtools can be gained using samtools tview. In this view

....... means a read aligned on one strand (forward), with the same sequence as the reference
,,,,,,, means a read aligned on the other strand (reverse), with the same sequence as the reference
letter AGCT or agct, means a difference form the reference with high qualty (uppercase) or lower quality (lowercase)
* means a deletion or insertion

If you scroll along with the arrow keys you will certainly locations where all reads identify a different base in the strain bra.sample1 from the reference genome. These are SNPs. To exit this view, press q.

To start samtools tview, do:

samtools tview bra.sample1.bam TriTrypDB-46_LdonovaniBPK282A1_Genome.fasta

3.2.3 SNP-calling

SNP calling is the process of identifying reliable SNPs within BAM files, and disguising SNPs from read errors. Most SNP-calling software also identify short insertion/deletion mutations (indels).

SNP calling can be carried out using on BAM (from one individual), but is usually performed with BAM from multiple individuals from a species.

3.2.4 Genetic variants in VCF format

The most common format for describing genetic variants is variant call format (VCF)

VCF files are essentially a tab-delimited table, where each row describes the genetic variation at a specific site in the genome, and columns are either:

information about that site in the genome (chromosome, position, the alleles that are present, the quality of the variant (phred scaled)
information about the samples (individuals, strains) (which alleles this individual has, and what data supports these allele calls)

You can see a simplified example of a VCF file here. And by looking at the file Lguyanensis-simplified.vcf, with:

less Lguyanensis-simplified.vcf

It has a header line: #CHROM POS REF ALT, that describes for each genetic variation:

the chromosome where the variation is located
the variant’s position in the chromosome
the reference genome’s allele at this position (eg: A)
the alternative s allele at this position (eg: G)

After this, we have the genotypes for the strains that are in the VCF (four in this case). They are coded as 0 for the reference allele (eg: A), 1 for the second allele (eg: G), so for example:

AA is coded as 0/0
AG is coded as 0/1
GG is coded as 1/1

Lguyanensis-simplified.vcf is a simplified display of a VCF file - not a genuine VCF. The genuine VCF that we will use today is the file: Ldonovani.vcf. Have a look at this using less. You will see that it contains a lot more information, that describes the chromosomes, the SNP quality etc.

An important point is the file contains many lines that start with hash ##. These ‘comment lines’ contain important information about the format of the ‘data lines’ that come next.

You do not need to understand all this information today. But can learn more about the what is in VCF files later here, if you are curious.

3.3 Data analysis: subsetting and analysis of genetic diversity

Now we will conduct some analysis of the genetic diversity of the populations in our data. In our Ldonovani.vcf VCF file we have samples from four populations:

East Africa, population 1 (samples ea1.sample1, ea1.sample2, etc)
East Africa, population 2 (samples ea2.sample2.sample2, etc)
The Indian subcontinent, population 1 (samples isc1.sample1, etc)
The Indian subcontinent, population 2 (samples isc2.sample1, etc)
Brazil (samples bra.sample1, etc)

We will focus on the ea1 (East Africa) and bra (Brazil) populations here, but in your project you may analyse any/all of the samples.

3.3.1 Subsetting

Before we calculate the genetic diversity of each population, we have to extract some ‘columns’ (samples) from the VCF. We need a text file with the samples we want listed (one per row), so make these with:

grep bra Ldonovani.samples.txt > bra.samples.txt
grep ea1 Ldonovani.samples.txt > ea1.samples.txt

To check to see how many lines these file have do: wc -l *txt
bra.samples.txt and ea1.samples.txt should have 4 lines each.

Then, load the vcftools module with:

module load VCFtools/0.1.16-GCC-12.2.0

Remember: module spider [keyword]

Will find modules and the current version.

Then, use vcf-subset (part of vcftools) to extract the samples from each population. Be patient, this may take a minute or two:

cat Ldonovani.vcf | vcf-subset -c bra.samples.txt -e -a > bra.vcf &
cat Ldonovani.vcf | vcf-subset -c ea1.samples.txt -e -a > ea1.vcf &

Note

Using these methods, you could also make subsets of the SNP data corresponding to populations: ea2, isc1 or isc2, or an other set of strains.

3.3.2 Nucleotide diversity

Nucleotide diversity (usually denoted with the Greek letter π) is a measure of the genetic differences contained in a set of samples. It is calculated as the average number of SNP differences between all pairs of strains, divided by the length of the region compared. A good thing about π is that it is independent of the sample size.

π is proportional to the long term population size (the number of individuals). So by sequencing multiple individuals from one more more populations, we can learn about a population in the natural environment, that would be difficult or impossible to estimate using any other method.

We will calculate π along windows of the genome, using the vcftools software.

Then calculate π for windows of the genome (in the case 50 kb), like so:

vcftools --vcf ea1.vcf --window-pi 50000 --out ea1.data &
vcftools --vcf bra.vcf --window-pi 50000 --out bra.data &

The output files will be: bra.data.windowed.pi and ea1.data.windowed.pi. Later we will these load into R Studio for some analysis.

3.3.3 Tajima’s D

Tajima’s D (D) is another metric that describes the genetic diversity within a population. Essentially, D measures the proportion of SNPs with low allele frequencies (rare SNPs). The proportion of rare SNPs is increased when a population is expanding (and decreased when it is shrinking), so we can interpret the genome-wide mean of D like so:

Mean Tajima’s D positive: population is shrinking
Mean Tajima’s D around zero: population is stable in size
Mean Tajima’s D negative: population is expanding

So let’s Calculate D for the two populations we are examining:

vcftools --vcf ea1.vcf --TajimaD 50000 --out ea1.data &
vcftools --vcf bra.vcf --TajimaD 50000 --out bra.data &

To check that the Nucleotide diversity and Tajima’s D calculations ran as we expect, do:

wc -l *.data.*

The files ending in .Tajima.D and windowed.pi should have 570-665 lines. Also have a quick look at some of them with less. You’ll see that these are nicely formatted lines that can be read into R for analysis.

3.4 Discovering population structure using principal components

3.4.1 Principal components analysis (PCA)

With complex data, such as SNP data, there can be many signals. One signal is the genetic differences between each pair of strains. In our sample of 20 strains this would be possible to visualize, but with a sample size of 100 or 1000, this rapidly becomes unfeasible.

Principal components analysis is a statistical technique for reducing the dimensionality of the data. For population genomics, this enables us to capture the strongest signals of population clustering. This is a standard method in population genomics.

3.4.2 Calculating principal components from SNP data

We will use PLINK2 to calculate principal components. First load the PLINK module:

module load PLINK/2.00a3.6-GCC-11.3.0

Then make a file that can be read by PLINK:

plink2 --vcf Ldonovani.vcf --make-bed \
--autosome-num 36 --out Ldonovani.plink \
--allow-extra-chr --vcf-half-call r

Then use PLINK to calculate PCs:

plink2 --pca --bfile Ldonovani.plink \
--autosome-num 36 --allow-extra-chr \
--read-freq Ldonovani.plink.afreq --out Ldonovani.plink

The main output file that you need will be: Ldonovani.plink.eigenvec. This contains each sample (IID) and then the principal component coordinates for PC1, PC2, etc.

3.5 Analysis of data

The last part of this workshop will be conducted in R Studio. The nucleotide diversity, Tajima’s D and principal component data are all easily analysed in R Studio.

Managing and saving files in Linux

Within R Studio, you can open files, directly from the shared drive /shared/biology/bioldata1/bl-00087h/. You can also save files such as plots on that drive space.
To view plots made in R Studio, open the file browser (as above), use pwd in the Terminal app to
You won’t need this for today, but we describe how to move files off the /shared/biology/bioldata1 network drive here.

Figure 3. Ubuntu Linux desktop.

3.5.1 R Studio work

To start:

open the R Studio app, using the Show Apps icon at the bottom left of the screen
create a new R script (with a descriptive name), and save it
work though the R code below, copying the code into our script

First load libraries and set your working directory:

#load libraries
library(tidyverse)
library(ggpubr)

#set working directory to the directory on the /shared/biology/bioldata1 drive
#your path will be different
setwd("/shared/biology/bioldata1/bl-00087h/students/username/workshop7")

3.5.2 Nucleotide diversity data analysis

Next read the nucleotide diversity (π) data into R. Then add a population column using the R code below:

#load EA1 data
ea1pi<-as_tibble(read.table("ea1.data.windowed.pi",h=T))
#add a population column
ea1pi$population='EA1'

#load Brazil data
brapi<-as_tibble(read.table("bra.data.windowed.pi",h=T))
#add a population column
brapi$population='BRA'

#merge the two tibbles
pi<- bind_rows(ea1pi,brapi)

#check 
summary(pi)
glimpse(pi)

Now make a box and whiskers plot showing the nucleotide diversity between populations:

## nucleotide diversity plot
ggplot(pi, aes(x=population,y=PI))+
    geom_boxplot()+
    stat_compare_means()+
    theme_classic2()

#save jpeg of plot
ggsave("pi.boxplot.jpeg",width=5,height=7)

Think about the data

What do these plots show?
What does the difference in nucleotide diversity between populations tell you?

3.5.3 Tajima’s D data analysis

First, read in the data:

#read Tajima's D data
ea1d<-read.table("ea1.data.Tajima.D",h=T)
ea1d$population='EA1'
brad<-read.table("bra.data.Tajima.D",h=T)
brad$population='BRA'

#merge
taj<- bind_rows(ea1d,brad)

#check 
summary(taj)
glimpse(taj)

Now make some plots of Tajima’s D:

# Tajima's D boxplot
ggplot(taj, aes(x=population,y=TajimaD))+
  geom_boxplot()+
  stat_compare_means()+
  geom_hline(yintercept = 0,col=2,linetype=2)+
  theme_classic2()

#save jpeg of plot
ggsave("taj.boxplot.jpeg",width=5,height=7)

# Tajima's D histograms
ggplot(taj, aes(x=TajimaD))+
  geom_histogram()+
  geom_vline(xintercept = 0,col=2,linetype=2)+
  facet_wrap(~population)+
  theme_classic2()

#save jpeg of plot
ggsave("taj.hist.jpeg",width=7,height=5)

You should have various jpeg files in your working directory.

Think about the data again

What do the Tajima’s D plots show?
What does the difference in Tajima’s D between populations tell you? What can we infer from both the nucleotide diversity and Tajima’s D data?

3.5.4 Using principal components to detect populations

Now we will load up and plot the principal component coordinates we calculated with PLINK2, from the file Ldonovani.plink.eigenvec.

First load the data:

#read in PC values from Ldonovani.plink.eigenvec
pc<-read_tsv("Ldonovani.plink.eigenvec")

#rename the first column, for ease of use
names(pc)[1]='FID'

#add a population column, using grep
pc$population = "unknown"
pc[grep("bra",pc$IID),]$population = "Brazil"
pc[grep("ea1",pc$IID),]$population = "East Africa 1"
pc[grep("ea2",pc$IID),]$population = "East Africa 2"
pc[grep("isc1",pc$IID),]$population = "India 1"
pc[grep("isc2",pc$IID),]$population = "India 2"

#check
summary(pc)
view(pc)

Then make the plot showing where each strain is placed on the principal component coordinates:

#make the plot of principal components 1 and 2
ggplot(pc, aes(x=PC1,y=PC2,colour=population))+
    geom_jitter(height=0.01,width=0.01,size=5,pch=1)+
    theme_classic()

#save the plot
ggsave("my.PCA.plot.jpeg",width=7, height=7)

#NB: you can also plot other principal components using:
ggplot(pc, aes(x=PC3,y=PC4,colour=population))+
  geom_jitter(height=0.01,width=0.01,size=5,pch=1)+
  theme_classic()

You should have a file called my.PCA.plot.jpeg in your working directory.

Principal components analysis of genetic diversity shows an approximate genetic distance between strains. Note that we used geom_jitter to spread out the coordinates of the strains. In reality the strains within a population are much closer. Change the geom_jitter to zero to see how close, with:

geom_jitter(height=0,width=0,size=5,pch=1)

Think about the principal components data

What do these plots show about the genetic distance of Leishmania strains within one country?
What do they show about the relative genetic distance within vs between countries?

Think about all the data

This is the information we can obtain about these populations, from the genomes of only 20 strains:

estimates of population size (from π)
estimates of population growth/decline (from Tajima’s D)
understanding of relative genetic distance within vs between countries (from PCA)

4 End

That is all for today. Today we learned these aspects of population genomics research:

What we can learn population genomics
Some population genomics terminology
Bioinformatics methods used in population genomics
Read mapping, bam files and SNP-calling
Population genomics data in VCF files
Data analysis: Nucleotide diversity, Tajima’s D
Data analysis: using principal components to detect populations

5 After the workshop

See examples for Phd positions in genomics

At the Earlham Institute in Norwich

5.1 Some detail on software and population genomics

Usually we use the BWA software to map reads for diversity analysis. There are many others, that are designed for other uses (such as RNAseq).
There are many SNP calling tools. My preference (DJ) is to call with both Octopus and Freebayes, and keep only SNPs and indels called by both.
Read mapping takes time, particularly if one has many samples. However, it is what’s called an embarrassingly parallel process, because the reads from each sample can be mapped separately using a server like Viking which has 134 compute nodes, each with 96 CPU cores per node!

5.2 Population genomics for your report

You do not need to be an expert on evolutionary genetics or population genetics to conduct this analysis. The focus of the report should be simple data analysis and simplistic interpretations of the data. The information on this page and the video below by Mohammed Noor explains nucleotide diversity and Tajima’s D sufficiently well to interpret the data.

There are two data sets that you can analyse. It also OK to analyse either, or both.

5.2.1 Leishmania donovani data

This is an extension of the L. donovani data file that you analysed in the workshop.

Note that the VCF file contains samples from many populations. These are listed in the file Ldonovani.samples.txt. Included in these are the samples:

leeds.sample0
liverpool.sample0
manchester.sample0
newcastle.sample0
york.sample0

Each of these comes from one of the other populations (Brazil, EA1, EA2, ISC1 or ISC2). Part of your project work is to find out which population each comes from, and include them in your nucleotide diversity and Tajima’s D analysis.

There may be other metrics that vcftools can calculate in addition to nucleotide diversity and Tajima’s D.

You can download the and modify script BIO00087H-workshop7-analysis-2024-10-28.R to produce the analysis and plots.

5.2.2 Leishmania guyanensis data

The VCF file and the plink .afreq file you will need is here /shared/biology/bioldata1/bl-00087h/data/Lguyanensis/

This is a small population genomics data set of Leishmania guyanensis. All samples were obtained from a patients in a medical center in the city of Manaus, and they live in the region of Manaus.

Carrying out the same analysis with PCA coordinates, nucleotide diversity (π) and Tajima’s D is a good idea, as for the L. donovani data. Calculating an Fst estimate, using the -weir-fst-pop flag with vcftools might also be interesting. The vcftools manual page explains how to do this (and google will help).

Note

Hint: Calculate PCA coordinates first, before calculating nucleotide diversity and Tajima’s D in case there are several populations. Nucleotide diversity and Tajima’s D only make sense to calculate within populations.

--- title: "BIO00087H Workshop 7: Population genomics" subtitle: "The history of every organism is stored in its DNA." author: Daniel Jeffares date: 2024-10-16 format: html: code-copy: true code-tools: true code-link: true toc: true number-sections: true toc-location: right toc-depth: 2 highlight-style: github editor: source --- :::callout-important ## The deadline for prject submission is now Wednesday 22nd January at 12:00 noon Student services requested that we change this to avoid students having a deadline during an exam. ::: :::callout-important ## Modules not loading? Try `module --ignore-cache spider <name>` ::: # Learning objectives * To learn about the data, data analysis tools, and methods of population genomics * To explore an example of population genomics data # Introduction ## Population genomics At the end of [workshop 3](https://djeffares.github.io/BIO87H/workshop3.html#running-diamond) we used diamnod to search for proteins that were conserved between *Leishmania* and yeast. This is an example of a **comparative genomics** analysis - comparing genomes ***between*** species. Population genomics is the study of genetic variation ***within*** species. Typically we conduct population genomics by: * Obtaining many biological samples from many individuals within a species or population * Extracting DNA from each individual * Sequencing the DNA with short reads (eg: Illumina) * Using bioinformatics tools to identify genetic variants * These are regions of the genome that differ between individuals within the species * eg: single nucleotide polymorphisms (SNPs) * Analysing the genetic variants to discover aspects of the biology of the species ### What can we discover from population genomics? There are many reasons why we may wish to study genetic and phenotypic variation within species. * We can uncover genotypes that affect phenotypes, using [genome wide association studies](https://en.wikipedia.org/wiki/Genome-wide_association_study) (GWAS) * We can detect [population structuring](https://en.wikipedia.org/wiki/Population_structure_(genetics)) and other population demographic features, such as between species hybrids, and population size changes * We can discover signals of recent adaptation within populations, such as [selective sweeps](https://en.wikipedia.org/wiki/Selective_sweep) ::: callout-important ## Population genomics terminology * **Samples/individuals/strains**: genetically different individuals within the species we are studying. Terminology differs between species. * **Genetic variant**: A location in the genome where individuals within a species, or within a population differ from one another. * **SNP**: single nucleotide polymorphism. A location within the genome where individuals differ by one nucleotide (eg: A or G). * **Alleles**: The two (or more) states that a genetic variant can be, eg: A or G * **Allele frequency**: The frequency of an allele in a population. * **Population**: A group of individuals that are more closely related to each other than another population. * **Read coverage**: the average number of reads aligned within a particular region of the genome. * **Somy**: The number of chromosome copies within a particular chromosome. Most diploid chromosomes are disomic. Trisomy is the state where a chromosome has three copies. ::: ## Population genomics methods Typically, the bioinformatics part of population genomics proceeds in these steps: 1. genomic sequencing of multiple samples (makes fastq files) 2. sequence data quality control 3. mapping (alignment) of reads to a reference genome (generates a bam file) 4. identifying genetic variations (SNP calling) from bam files (generates a VCF file) 5. data analysis (many tools) A more detailed summary is below, showing some of the tools that are used. ![](popgen-methods.jpg){width=100%} **Figure 1.Population genomics methods**. Examples of software tools that are used to conduct each step are shown in red. Note that, after read mapping and read alignment, large *structural variants* like copy number variants and inversions, are discovered using different software to small variants like SNPs and indels (small insertion/deletion variants). # Exercises Today we will look at some population genomics data and begin to characterise populations of the *Leishmania donovani* species complex, which includes *Leishmania donovani* and *Leishmania infantum*. ## Setting up ### Directory set up 1. Start up the PC in Linux as last time, log in and open the **Terminal window**. 2. You should already have your own directory in \ `/shared/biology/bioldata1/bl-00087h/students/` 3. Change to your directory with:\ `cd /shared/biology/bioldata1/bl-00087h/students/${USER}` 4. Make a new directory for this workshop with: `mkdir workshop7` 5. Change to this directory with: `cd workshop7` ### File set up 1. Today we will use files from the directory `/shared/biology/bioldata1/bl-00087h/data/VL` 2. To place symbolic links for all files in this directory within your `workshop7` directory, do this **from within your `workshop7` directory**: `ln -sf /shared/biology/bioldata1/bl-00087h/data/VL/* .` Don't forget the dot (**`.`**) at the end of the line. This contains: * `bra.sample1.bam`: a [bam format file](https://en.wikipedia.org/wiki/SAM_(file_format)) for sample bra.sample1 (sorted by chromosome and position) * `bra.sample1.bam.bai`: the index for this bam file (created with samtools index) * `Ldonovani.samples.txt`: a list of the samples in the VCF file. Each prefix describes which population the sample comes from (eg: ea1 is from East Africa). This includes the mystery samples. That is all you will need to use today. It also contains some for those that will conduct your report on population genomics. * `Ldonovani.vcf`: a VCF file containing 86,697 SNPs from 20 individuals from the LDSC, and five *mystery samples*. The ‘mystery samples are called, newcastle.sample0, liverpool.sample0, manchester.sample0, leeds.sample0 and york.sample0. * `Ldon.chromosomes.bed`: a [browser extensible format](https://en.wikipedia.org/wiki/BED_(file_format)) (bed) file * A symbolic link to a directory called `bams` that contains many other .bam files. All this data will be described below. ## Getting to know the data: read mapping and bam files ### Read mapping Population genomics analysis usually does not create *de novo* assemblies of reads. Instead, we sequence many individuals with short reads, with low at (5-10x) coverage, and then map (align) reads to a *reference genome*. A reference genome is merely one individual from a species that is well-studied, well assembled and well-annotated. We then use the alignments of all the reads to locate genetic variants. The figure below, which was generated by the Integrative Genomics Viewer (IGV) illustrates how we can identify genetic variants from read alignments. The most common mapping tool is the [Burrows-Wheeler Aligner (BWA)](https://pubmed.ncbi.nlm.nih.gov/19451168/). ### BAM files The standard output for all read mapping software is sequence alignment map (SAM) file. These are usually encoded in binary format to create a binary sequence alignment map (BAM) format file. bam files contain information about each where each read aligned to the reference genome, and if there are any differences between the read, and the reference genome. A visual representation of a BAM is shown below. \ ![](igv.jpg){width=100%} \ **Figure 2. IGV view of read alignments.** This shows the alignment of reads onto a reference genome. The position in the reference genome is indicated at the top. We are looking at 68 bp of the human genome in this case. The **Coverage** line show the read coverage at each position in the genome (how many reads are aligned here). At the very bottom the **Sequence** line show the sequence of the reference genome in this location. The grey bars in the largest section indicate where reads are aligned to the genome. **Where the read differs from the reference genome** IGV shows the different base. In this site we have just one SNP (a G/T polymorphism). This is a *heterozygous* SNP, because about half the reads have each allele. We can also observe base call errors, where one read has a different base. #### BAM files BAM files are the substrate for SNP-calling. They contain information about: * Where each read maps to the reference genome (chromosome, position) * How reliable the alignment is (mapping quality) * DNA sequence differences between **the read** and **the reference genome** This information is used by SNP-calling software to identify SNPs. BAM files are binary encoded (not text), so to view them we need to use **samtools** software, which is able to read, process and extract information from bams. To view a bam: 1. Use `module spider samtools` to find the relevant samtools installation 2. Load the latest version with `module load [samtools version]` 3. Run: `samtools view bra.sample1.bam | cut -f 1,3-5,10 | less`, which will show you: * the read name (column 1) * the chromosome where is maps to (column 2) * the position on the chromosome where is maps to (column 3) * the mapping score (phred scaled)(column 4) * the aligned sequence (column 1) A more intuitive view from samtools can be gained using `samtools tview`. In this view * `.......` means a read aligned on one strand (forward), with the same sequence as the reference * `,,,,,,,` means a read aligned on the other strand (reverse), with the same sequence as the reference * letter `AGCT` or `agct`, means a difference form the reference with high qualty (uppercase) or lower quality (lowercase) * `*` means a deletion or insertion If you scroll along with the arrow keys you will certainly locations where all reads identify a different base in the strain ` bra.sample1` from the reference genome. These are SNPs. To exit this view, press `q`. To start `samtools tview`, do: `samtools tview bra.sample1.bam TriTrypDB-46_LdonovaniBPK282A1_Genome.fasta` ### SNP-calling SNP calling is the process of identifying reliable SNPs within BAM files, and disguising SNPs from read errors. Most SNP-calling software also identify short insertion/deletion mutations (indels). SNP calling can be carried out using on BAM (from one individual), but is usually performed with BAM from multiple individuals from a species. ### Genetic variants in VCF format The most common format for describing genetic variants is [variant call format (VCF)](https://en.wikipedia.org/wiki/Variant_Call_Format) VCF files are essentially a tab-delimited table, where each row describes the genetic variation at a specific site in the genome, and columns are either: * information about that site in the genome (chromosome, position, the alleles that are present, the quality of the variant (phred scaled) * information about the samples (individuals, strains) (which alleles this individual has, and what data supports these allele calls) You can see a simplified example of [a VCF file here](https://docs.google.com/spreadsheets/d/1lsstwAHEeFe0X0lyfcnw5f3Pz7YttsLNMrFQM1YW6gU/edit?usp=sharing). And by looking at the file `Lguyanensis-simplified.vcf`, with: `less Lguyanensis-simplified.vcf` It has a header line: `#CHROM POS REF ALT`, that describes for each genetic variation: * the chromosome where the variation is located * the variant's position in the chromosome * the reference genome's allele at this position (eg: A) * the alternative s allele at this position (eg: G) After this, we have the genotypes for the strains that are in the VCF (four in this case). They are coded as 0 for the reference allele (eg: A), 1 for the second allele (eg: G), so for example: * AA is coded as 0/0 * AG is coded as 0/1 * GG is coded as 1/1 `Lguyanensis-simplified.vcf` is a simplified display of a VCF file - not a genuine VCF. The genuine VCF that we will use today is the file: `Ldonovani.vcf`. Have a look at this using `less`. You will see that it contains a **lot** more information, that describes the chromosomes, the SNP quality etc. An important point is the file contains many lines that start with hash `##`. These 'comment lines' contain important information about the format of the 'data lines' that come next. You do not need to understand all this information today. But can learn more about the what is in VCF files later [here](https://en.wikipedia.org/wiki/Variant_Call_Format), if you are curious. ## Data analysis: subsetting and analysis of genetic diversity Now we will conduct some analysis of the genetic diversity of the populations in our data. In our `Ldonovani.vcf` VCF file we have samples from four populations: * **East Africa, population 1** (samples ea**1**.sample1, ea**1**.sample2, *etc*) * **East Africa, population 2** (samples ea**2**.sample**2**.sample2, *etc*) * **The Indian subcontinent, population 1** (samples isc1.sample1, *etc*) * **The Indian subcontinent, population 2** (samples isc2.sample1, *etc*) * **Brazil** (samples bra.sample1, *etc*) We will focus on the **ea1** (East Africa) and bra (Brazil) populations here, but in your project you may analyse any/all of the samples. ### Subsetting Before we calculate the genetic diversity of each population, we have to extract some ‘columns’ (samples) from the VCF. We need a text file with the samples we want listed (one per row), so make these with: `grep bra Ldonovani.samples.txt > bra.samples.txt ` \ `grep ea1 Ldonovani.samples.txt > ea1.samples.txt ` To check to see how many lines these file have do: `wc -l *txt`\ `bra.samples.txt` and `ea1.samples.txt` should have 4 lines each. Then, load the vcftools module with: `module load VCFtools/0.1.16-GCC-12.2.0` :::callout-note # Remember: `module spider [keyword]` Will find modules and the current version. ::: Then, use `vcf-subset` (part of vcftools) to extract the samples from each population. Be patient, this may take a minute or two: `cat Ldonovani.vcf | vcf-subset -c bra.samples.txt -e -a > bra.vcf &` \ `cat Ldonovani.vcf | vcf-subset -c ea1.samples.txt -e -a > ea1.vcf &` :::callout-note Using these methods, you could also make subsets of the SNP data corresponding to populations: **ea2**, **isc1** or **isc2**, or an other set of strains. ::: ### Nucleotide diversity Nucleotide diversity (usually denoted with the Greek letter **`π`**) is a measure of the genetic differences contained in a set of samples. It is calculated as the average number of SNP differences between all pairs of strains, divided by the length of the region compared. A good thing about π is that it is independent of the sample size. π is proportional to the long term population size (the number of individuals). So by sequencing multiple individuals from one more more populations, we can learn about a population in the natural environment, that would be difficult or impossible to estimate using any other method. We will calculate π along windows of the genome, using the `vcftools` software. Then calculate π for windows of the genome (in the case 50 kb), like so: `vcftools --vcf ea1.vcf --window-pi 50000 --out ea1.data &` \ `vcftools --vcf bra.vcf --window-pi 50000 --out bra.data &` The output files will be: `bra.data.windowed.pi` and `ea1.data.windowed.pi`. Later we will these load into R Studio for some analysis. ### Tajima’s D Tajima’s D (**`D`**) is another metric that describes the genetic diversity within a population. Essentially, D measures the proportion of SNPs with low allele frequencies (rare SNPs). The proportion of rare SNPs is increased when a population is expanding (and decreased when it is shrinking), so we can interpret the genome-wide mean of D like so: * Mean Tajima’s D positive: population is shrinking * Mean Tajima’s D around zero: population is stable in size * Mean Tajima’s D negative: population is expanding So let's Calculate D for the two populations we are examining: `vcftools --vcf ea1.vcf --TajimaD 50000 --out ea1.data &` \ `vcftools --vcf bra.vcf --TajimaD 50000 --out bra.data &` To check that the Nucleotide diversity and Tajima’s D calculations ran as we expect, do: `wc -l *.data.*` The files ending in `.Tajima.D` and `windowed.pi` should have 570-665 lines. Also have a quick look at some of them with `less`. You'll see that these are nicely formatted lines that can be read into R for analysis. ## Discovering population structure using principal components ### Principal components analysis (PCA) With complex data, such as SNP data, there can be many signals. One signal is the genetic differences between each pair of strains. In our sample of 20 strains this would be possible to visualize, but with a sample size of 100 or 1000, this rapidly becomes unfeasible. Principal components analysis is a statistical technique for reducing the dimensionality of the data. For population genomics, this enables us to capture the strongest signals of population clustering. This is a standard method in population genomics. ### Calculating principal components from SNP data We will use `PLINK2` to calculate principal components. First load the PLINK module: `module load PLINK/2.00a3.6-GCC-11.3.0` Then make a file that can be read by PLINK: `plink2 --vcf Ldonovani.vcf --make-bed \` \ `--autosome-num 36 --out Ldonovani.plink \` \ `--allow-extra-chr --vcf-half-call r` Then use PLINK to calculate PCs: `plink2 --pca --bfile Ldonovani.plink \` \ `--autosome-num 36 --allow-extra-chr \` \ `--read-freq Ldonovani.plink.afreq --out Ldonovani.plink` The main output file that you need will be: `Ldonovani.plink.eigenvec`. This contains each sample (`IID`) and then the principal component coordinates for PC1, PC2, etc. ## Analysis of data The last part of this workshop will be conducted in R Studio. The nucleotide diversity, Tajima's D and principal component data are all easily analysed in R Studio. :::callout-note # Managing and saving files in Linux Within R Studio, you can open files, directly from the shared drive `/shared/biology/bioldata1/bl-00087h/`. You can also save files such as plots on that drive space. \ To view plots made in R Studio, open the file browser (as above), use `pwd` in the Terminal app to \ You won't need this for today, but we describe how to move files off the `/shared/biology/bioldata1` network drive [here](https://djeffares.github.io/BIO87H/). ::: ![](ubuntu-desktop.jpg){width=100%} **Figure 3. Ubuntu Linux desktop.** ### R Studio work To start: * open the R Studio app, using the Show Apps icon at the bottom left of the screen * create a new R script (with a descriptive name), and save it * work though the R code below, copying the code into our script First load libraries and set your working directory: ```{r} #| echo: true #| warning: false #| eval: false #load libraries library(tidyverse) library(ggpubr) #set working directory to the directory on the /shared/biology/bioldata1 drive #your path will be different setwd("/shared/biology/bioldata1/bl-00087h/students/username/workshop7") ``` ### Nucleotide diversity data analysis Next read the nucleotide diversity (π) data into R. Then add a `population` column using the R code below: ```{r} #| echo: true #| warning: false #| eval: false #load EA1 data ea1pi<-as_tibble(read.table("ea1.data.windowed.pi",h=T)) #add a population column ea1pi$population='EA1' #load Brazil data brapi<-as_tibble(read.table("bra.data.windowed.pi",h=T)) #add a population column brapi$population='BRA' #merge the two tibbles pi<- bind_rows(ea1pi,brapi) #check summary(pi) glimpse(pi) ``` \ Now make a box and whiskers plot showing the nucleotide diversity between populations: ```{r} #| echo: true #| warning: false #| eval: false ## nucleotide diversity plot ggplot(pi, aes(x=population,y=PI))+ geom_boxplot()+ stat_compare_means()+ theme_classic2() #save jpeg of plot ggsave("pi.boxplot.jpeg",width=5,height=7) ``` \ :::callout-important # Think about the data What do these plots show? \ What does the difference in nucleotide diversity between populations tell you? ::: \ ### Tajima's D data analysis First, read in the data: ```{r} #| echo: true #| warning: false #| eval: false #read Tajima's D data ea1d<-read.table("ea1.data.Tajima.D",h=T) ea1d$population='EA1' brad<-read.table("bra.data.Tajima.D",h=T) brad$population='BRA' #merge taj<- bind_rows(ea1d,brad) #check summary(taj) glimpse(taj) ``` Now make some plots of Tajima's D: ```{r} #| echo: true #| warning: false #| eval: false # Tajima's D boxplot ggplot(taj, aes(x=population,y=TajimaD))+ geom_boxplot()+ stat_compare_means()+ geom_hline(yintercept = 0,col=2,linetype=2)+ theme_classic2() #save jpeg of plot ggsave("taj.boxplot.jpeg",width=5,height=7) # Tajima's D histograms ggplot(taj, aes(x=TajimaD))+ geom_histogram()+ geom_vline(xintercept = 0,col=2,linetype=2)+ facet_wrap(~population)+ theme_classic2() #save jpeg of plot ggsave("taj.hist.jpeg",width=7,height=5) ``` \ You should have various jpeg files in your working directory. \ :::callout-important # Think about the data again What do the Tajima's D plots show? \ What does the difference in Tajima's D between populations tell you? What can we infer from *both* the nucleotide diversity and Tajima's D data? ::: \ ### Using principal components to detect populations Now we will load up and plot the principal component coordinates we calculated with `PLINK2`, from the file `Ldonovani.plink.eigenvec`. \ First load the data: ```{r} #| echo: true #| warning: false #| eval: false #read in PC values from Ldonovani.plink.eigenvec pc<-read_tsv("Ldonovani.plink.eigenvec") #rename the first column, for ease of use names(pc)[1]='FID' #add a population column, using grep pc$population = "unknown" pc[grep("bra",pc$IID),]$population = "Brazil" pc[grep("ea1",pc$IID),]$population = "East Africa 1" pc[grep("ea2",pc$IID),]$population = "East Africa 2" pc[grep("isc1",pc$IID),]$population = "India 1" pc[grep("isc2",pc$IID),]$population = "India 2" #check summary(pc) view(pc) ``` \ Then make the plot showing where each strain is placed on the principal component coordinates: ```{r} #| echo: true #| warning: false #| eval: false #make the plot of principal components 1 and 2 ggplot(pc, aes(x=PC1,y=PC2,colour=population))+ geom_jitter(height=0.01,width=0.01,size=5,pch=1)+ theme_classic() #save the plot ggsave("my.PCA.plot.jpeg",width=7, height=7) #NB: you can also plot other principal components using: ggplot(pc, aes(x=PC3,y=PC4,colour=population))+ geom_jitter(height=0.01,width=0.01,size=5,pch=1)+ theme_classic() ``` \ You should have a file called `my.PCA.plot.jpeg` in your working directory. **Principal components analysis of genetic diversity** shows an *approximate* genetic distance between strains. Note that we used `geom_jitter` to spread out the coordinates of the strains. In reality the strains within a population are much closer. Change the `geom_jitter` to zero to see how close, with: ```{r} #| echo: true #| warning: false #| eval: false geom_jitter(height=0,width=0,size=5,pch=1) ``` \ :::callout-important # Think about the principal components data What do these plots show about the genetic distance of *Leishmania* strains within one country?\ What do they show about the relative genetic distance within vs between countries? ::: \ :::callout-important # Think about **all** the data This is the information we can obtain about these populations, from the genomes of only 20 strains: * estimates of population size (from π) * estimates of population growth/decline (from Tajima's D) * understanding of relative genetic distance within vs between countries (from PCA) ::: # End That is all for today. Today we learned these aspects of population genomics research: * What we can learn population genomics * Some population genomics terminology * Bioinformatics methods used in population genomics * Read mapping, bam files and SNP-calling * Population genomics data in VCF files * Data analysis: Nucleotide diversity, Tajima’s D * Data analysis: using principal components to detect populations # After the workshop :::callout-note ## See examples for Phd positions in genomics [At the Earlham Institute in Norwich](https://www.earlham.ac.uk/studentship-vacancies?utm_source=Earlham+Institute+Digital+Updates&utm_campaign=066ee2fbe3-Mailchimp_COPY_01&utm_medium=email&utm_term=0_3c2e0987cf-066ee2fbe3-454837657) ::: ## Some detail on software and population genomics * Usually we use the [BWA](https://github.com/lh3/bwa) software to map reads for diversity analysis. There are many others, that are designed for other uses (such as RNAseq). * There are many SNP calling tools. My preference (DJ) is to call with both [Octopus](https://luntergroup.github.io/octopus/) and [Freebayes](https://github.com/freebayes/freebayes), and keep only SNPs and indels called by both. * Read mapping takes time, particularly if one has many samples. However, it is what's called an [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) process, because the reads from each sample can be mapped separately using a server like [Viking](https://www.york.ac.uk/it-services/tools/viking/) which has 134 compute nodes, each with 96 CPU cores per node! ## Population genomics for your report You **do not** need to be an expert on evolutionary genetics or population genetics to conduct this analysis. The focus of the report should be simple data analysis and simplistic interpretations of the data. The information on this page and the video below by Mohammed Noor explains nucleotide diversity and Tajima’s D sufficiently well to interpret the data. {{<video https://youtu.be/wiyay4YMq2A?si=PK-CxJD5IyoWszra>}} --- There are two data sets that you can analyse. It also OK to analyse either, or both. ### Leishmania donovani data This is an extension of the *L. donovani* data file that you analysed in the workshop. Note that the VCF file contains samples from many populations. These are listed in the file `Ldonovani.samples.txt`. Included in these are the samples: * leeds.sample0 * liverpool.sample0 * manchester.sample0 * newcastle.sample0 * york.sample0 Each of these comes from one of the other populations (Brazil, EA1, EA2, ISC1 or ISC2). Part of your project work is to find out which population each comes from, and include them in your nucleotide diversity and Tajima’s D analysis. There may be other metrics that [vcftools](https://vcftools.github.io/man_latest.html) can calculate in addition to nucleotide diversity and Tajima’s D. You can download the and modify script [BIO00087H-workshop7-analysis-2024-10-28.R](https://drive.google.com/file/d/19mmwOKSdmklt2WNcYqp7XSLgfWiLWp-S/view?usp=sharing) to produce the analysis and plots. ### Leishmania guyanensis data The VCF file and the plink `.afreq` file you will need is here `/shared/biology/bioldata1/bl-00087h/data/Lguyanensis/` This is a small population genomics data set of *Leishmania guyanensis*. All samples were obtained from a patients in a medical center in the city of Manaus, and they live in the region of Manaus. Carrying out the same analysis with PCA coordinates, nucleotide diversity (π) and Tajima's D is a good idea, as for the *L. donovani* data. Calculating an Fst estimate, using the `-weir-fst-pop` flag with `vcftools` might also be interesting. The [vcftools manual page](https://vcftools.github.io/man_latest.html) explains how to do this (and google will help). :::callout-note **Hint:** Calculate PCA coordinates *first*, before calculating nucleotide diversity and Tajima's D in case there are several populations. Nucleotide diversity and Tajima's D only make sense to calculate *within* populations. :::