BIO00087H Workshop 3: Genome assembly, gene finding and BLAST

Making a genome from raw reads

Author

Daniel Jeffares

Published

July 17, 2024

The deadline for prject submission is now Wednesday 22nd January at 12:00 noon

Student services requested that we change this to avoid students having a deadline during an exam.

Modules not loading?

Try module --ignore-cache spider <name>

1 Learning objectives

The objectives of this workshop are:

To gain skill and confidence with running software in Linux
To learn concepts and terminology of genome assembly
To learn about genome annotation pipelines
To learn about BLAST (Basic Local Alignment Search Tool) and diamond

How to check that your analysis is running.

Method 1: jobs -l
Method 2: top -u username

Try them both out some time today.

2 Introduction

2.1 From reads to chromosomes

NGS and 3rd generation sequencing machines produce reads, not joined-up chromosomes. If we want to know the complete genome sequence of a new species (all chromosomes, end to end), oursteps after sequencing will be:

Check the quality of our read data
Join the reads together to form contigs
Join the contigs to form scaffolds (aiming for full chromosomes)
Polish (correct) the contigs/scaffolds sequences
Locate the protein-coding gene coordinates; starts, stops, exons and intron positions
Locate non-coding RNA genes
Annotate the genes (identify probable biochemical/cellular functions)

This genome can then be used for many different kinds of genome-scale analysis and other non-genomic research - like molecular cell biology. Today we will illustrate steps 1-7.

Genome assembly terminology

Contigs: a contiguous sequence generated from determining the non-redundant path along an order set of component sequences. A contig should contain no gaps.
Scaffolds: an ordered and oriented set of contigs. A scaffold will contain gaps.
N50: The length of the shortest contig, in the set of contigs that contain 50% of the total assembly length. The image below will help to explain this.
N90: As above but for 90% of the total assembly length.
L50: The number of contigs that make up 50% of the genome.

Figure 1. Assembly concepts explained Reads are the initial DNA sequences produced fom a sequencing machine. These may be short as shown here* (eg: Illumina technology), or long (eg:ONT/PacBIO). Reads are assembled into conigs. Contigs are ordered into scaffolds.
* Short reads are typically 150 nt long (not as short as shown here).

Figure 2. N50 The N50 is an assembly metric. In this example, the total assembly length is 100 kb, and the N50 is 8 kb. So 50% of our assembly is in contigs of the N50 value or larger. To calculate the N50, we order the contigs from the longest to the shortest. Then we add up their length, until the total length makes up at least 50% of our total assembly length. The L50 is the number of contigs that add up to 50% of the assembly length. Both N50 and L50 are simple, useful metrics.

3 Exercises

3.1 Setting up

3.1.1 Directory set up

Start up the PC in Linux as last time, log in and open the Terminal window.
You should already have your own directory in
/shared/biology/bioldata1/bl-00087h/students/

If not, make one as described here.
Change to your directory with:
cd /shared/biology/bioldata1/bl-00087h/students/[username]
Make a new directory for this workshop with: mkdir workshop3
Change to this directory with: cd workshop3

3.1.2 File set up

Today we will use files from this directory:
/shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/
To place symbolic links for all files in this directory within your workshop3 directory, do this:

ln -sf /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/* .

Don’t forget the dot . at the end!
Check that everything worked by listing the files with ls
You should now have symbolic links to these files (and some others):

Lbraz-subset.fastq.gz
Lbraz-subset.necat.config.txt
Lbraz-subset.readlist.txt

Lbraz-subset.fastq.gz contains some Oxford Nanopore Technology (ONT) long read data from the protozoan parasite Leishmania braziliensis. The other files are required to run NECAT, a genome assembly software.

Dots and spaces are important in Linux commands.

3.2 Assessing read data quality

It is important to assess the quality of your data before you use it. We will use the fastp tool for this. Fastp can also filter fastQ files to remove reads that you prefer not to use. You won’t need it today, but the fastp manual is here.

To run fastp:

Load the module

module load fastp/0.23.4-GCC-13.2.0

Run fastp

fastp --in1 Lbraz-subset.fastq.gz --out1 Lbraz-subset.filtered.fastq.gz --html Lbraz-subset.fastp.html --reads_to_process 1000 &> Lbraz-subset.fastp.output.txt &

Note all the flags that we use:

--in1 specifies the input file
--out1 specifies the output file name for the filtered reads
--html specifies the html format report file name

We capture the output using the &> symbols. This captures text output that would usually appear on the screen, and saves it to a file called Lbraz-subset.fastp.output.txt.

Here we use --reads_to_process 1000 to process only 1,000 reads, so the software runs quickly. If you run fastp for your project, omit this option to process all the reads.

To look at the fastp results, view the file Lbraz-subset.fastp.output.txt using:

less Lbraz-subset.fastp.output.txt

You will see various statistics about the reads:

The total bases tells us how many nucleotides of sequence data are in this file.
Since this is 1000 reads, what is the average read length?
Q20 bases and Q30 bases described how many bases are above Phred scaled quality scores 20 and 30 respectively. Q20 means one error error per 100 nucleotides.

3.3 Genome assembly

Now we will assemble these ONT reads using the NECAT assembly software.

NECAT runs in three stages:

necat correct corrects the raw reads
necat assemble generates contigs
necat bridge bridges (joins) contigs to create scaffolds and polishes (corrects base call errors) in the scaffolds.

We will run only correct and assemble today, because with so few reads, bridge doesn’t improve the assembly. Note that bridge does improve the assembly if we have more reads.

The manual for NECAT is here, and the original article is Chen 2021. You don’t need to read these today.

3.3.1 NECAT files

NECAT requires two files to run.

A config file. This contains details about what the software should do.
A read list file. This contains a list of all the sequence reads to assemble.

You should have these files your working directory:

Lbraz-subset.necat.config.txt
Lbraz-subset.readlist.txt

3.3.2 Running NECAT correct

Load the NECAT module: module load NECAT/0.0.1-GCCcore-12.3.0
Run NECAT to correct the raw reads:
necat.pl correct Lbraz-subset.necat.config.txt &> Lbraz-subset.correct.log & This will take 5-10 minutes to complete.
Using ls -F will show you what files you have, and mark directories with a forward slash (/) after the directory name. You should now have a directory called Lbraz-subset.assembly. All the assembly output file swill go here.

Wait for NECAT correct to complete before running the next step.

Check that your analysis is running.

Method 1: jobs -l
Method 2: top -u username

You might look at the contents of the file Lbraz-subset.assemble.log, or read ahead while you wait. Or talk to us about genomics!

3.3.3 Running NECAT assemble and brige

Once NECAT correct is complete, run NECAT assemble:
necat.pl assemble Lbraz-subset.necat.config.txt &> Lbraz-subset.assemble.log &
When assemble is complete, run NECAT bridge:
necat.pl bridge Lbraz-subset.necat.config.txt &> Lbraz-subset.bridge.log &

Your assembly is complete. Well done. You now have some long ONT reads, assembled into contigs, and then into scaffolds.

3.3.4 Assembly quality assessment

There are many ways to assess the quality an assembly. A simple way is to look at the metrics that NECAT itself provides. These are at the end the log files such as Lbraz-subset.bridge.log. To view the last 11 lines of this file do:

tail -n 11 Lbraz-subset.bridge.log

You will see information something like this (but not exactly), telling us the number of contigs (counts), their total, maximum and minimum length, and the N and L values for 25%, 50% and 75% of the assembly lengths.

Count: 2
Total: 133018
Max: 124310
Min: 8708
N25: 124310
L25: 1
N50: 124310
L50: 1
N75: 124310
L75: 1

You can also check how many contigs your assembly has by looking at the final output file in your directory Lbraz-subset.assembly/6-bridge_contigs/polished_contigs.fasta. Use either less or grep -c '>' to have quick look.

3.3.5 This is a terrible assemby.

Think about the result

This assembly has just one contig! That can’t be right, because other Leishmania species have 35 or 36 chromosomes.

It is bad because we used very few reads. To save time in the assembly process, we provided only 10% of the reads we had available for this species.

See Appendix 1 if you want to make an assembly of all the reads for your report.

3.4 Genome annotation

Once we have an assembled genome, the next step is to annotate the genome, to locate the genes and other features within the genome. Protein-coding and non-coding (eg: tRNA) gene-finding is a complex process, which utilities a variety of information, including:

gene start and stop codons
transcript data from processed mRNAs (cDNAs)
homology between predicted proteins in the new genome and related genomes
signals of non-coding RNA secondary structures

3.4.1 Annotation with AUGUSTUS

There are many tools to identify protein coding and non-coding genes. Today, we will use AUGUSTUS, which is simple and runs rapidly.

We will not get much from annotating the L. braziliensis assembly because it is such a terrible assembly. Instead, let’s annotate another Leishmania species assembly, Leishmania lainsoni.

A link the the assembly file Llainsoni_NMT.fasta would already be in your workshop3 directory. Check this with ls *fasta.

3.4.1.1 Running AUGUSTUS

To run AUGUSTUS, load the module:

module load AUGUSTUS/3.5.0-foss-2023b

Then run:

augustus --sample=10 --species=leishmania_tarentolae Llainsoni_NMT.fasta > Llainsoni_NMT.fasta.gff &

This will take about 2-5 minutes to run. Note that we use --species=leishmania_tarentolae. This is because a pre-constructed database for this species is available, but there is not one for Leishmania lainsoni.

You can safely ignore this warning

Warning: Number of sample iterations ‘sample’=10 is low. Posterior probabilities will be only rough estimates.

Once the previous command is done, the output file is in the gff (general feature format) file Llainsoni_NMT.fasta.gff. Have a quick look at it using less.

You can extract the just the protein sequences from the gff output file, with:

getAnnoFasta.pl Llainsoni_NMT.fasta.gff

This will produce the file Llainsoni_NMT.fasta.aa. Use less to have a quick look at this file.

To count how many proteins have been predicted in this genome do
grep -c '>' Llainsoni_NMT.fasta.aa.

Note

Detailed instructors on running AUGUSTUS are here - but you won’t need these for today.

3.5 BLAST

BLAST (the Basic Local Alignment Search Tool) is one of the most used tools in bioinformatics.

BLAST finds regions of similarity between DNA or protein sequences. Each search we perform compares a nucleotide or protein sequence query to a sequence database (which can be DNa or protein). BLAST calculates how matches we should expect, and reports this as the E-value (the ‘expect value’).

Uses for BLAST

Gene function prediction. Once we have found new proteins in our genome, we want to know what their biochemical functions are. By identifying similar proteins, whose functions we know from experimental data, we can infer that our ‘query’ protein has similar function to the ‘hit’ protein.
Identifying homologous genes for evolutionary analysis. Evolutionary biologists use phylogenetic trees to understand many processes. BLAST helps to locate proteins to build these trees.

To run a BLAST search you need:

a query: a protein or nucleotide sequence that you want to find something similar to
a database of protein or nucleotide sequences. This is where we look for the similar proteins/nucleotide sequences. This is just a list of protein/nucleotide sequences re-formatted for BLAST.

Proteins in the database that match the query are usually called hits

3.6 Diamond: faster than BLAST

BLAST is good (and very flexible), but today we will use diamond, which is more than 100 times faster than BLAST for protein-protein searches. We will use diamond to find which proteins in Leishmania lainsoni are very similar to proteins in budding yeast Saccharomyces cerevisiae.

Optional information: The source of the yeast data

http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans_all.fasta.gz

3.6.1 Running diamond

You will have links to the files S.cerevisae.S288C.proteins.fasta and a pre-built diamond database of these yeast proteins in your workshop3 directory (S.cerevisae.S288C.proteins.dmnd).

We will now search every Leishmania lainsoni protein against every Saccharomyces cerevisiae protein to find potential orthologs (proteins that are similar by evolutionary descent). Here is how:

Check you have the yeast data with ls S*.
Load the diamond module:

module load DIAMOND/2.1.9-GCC-13.2.0

Run the search. Here we use --id 70 to specify that we want to see only hits that are ≥ 70% identical. If you omit this flag, diamond locates all protein matches.

diamond blastp --query Llainsoni_NMT.fasta.aa --db S.cerevisae.S288C.proteins.dmnd --out Llainsoni-v-yeast.ident70.tsv --id 70

The output file will be: Llainsoni-v-yeast.ident70.tsv

3.6.2 Evaluating diamond results

First, look at your output file with:

less Llainsoni-v-yeast.ident70.tsv

And count how many hits we got with:

wc -l Llainsoni-v-yeast.ident70.tsv

Think about the result.

Use grep -c '>' *.fasta to find out:

How many L. lainsoni proteins we used as queries?
How many yeast proteins we searched against?

Why did we get so few hits?

The information in the output file are below:

column	field	meaning
1	qseqid	query or source (gene) sequence id
2	sseqid	subject or target (reference genome) sequence id
3	pident	percentage of identical positions (in the alignment)
4	length	alignment length (sequence overlap)
5	mismatch	number of mismatches (in alignment)
6	gapopen	number of gap openings (in alignment)
7	qstart	start of alignment in query
8	qend	end of alignment in query
9	sstart	start of alignment in subject
10	send	end of alignment in subject
11	evalue	the ‘expect value’ (see below)
12	bitscore	the ‘bit score’ (see below)

Table 1. Diamond standard output format. The bit score (bitscore) gives an indication of how good the alignment is; the higher the score, the better the alignment. The **expect value* (E-value) is a probability for how many alignments of this quality you would expect in a database of this size. The lower the E-value, the better the hit. E-values are similar to, but not the same as, p-values. Because outputs are simple tab-delimited files, it is easy to read these into R and analyse them.

There are many things you can do with BLAST or diamond searches. See the After the workshop section for one example.

4 End

That is all for today. Today we learned:

About genome assemblies
How to run NECAT, a genome assembly tool
How to run AGUSTUS a genome annotation tool
About using BLAST and diamond to search for proteins in one species (Leishmania) that are similar to another species (Saccharomyces), or a larger data set (such as all proteins ever discovered).

The power of such BLAST/diamond searches is that may proteins in Saccharomyces have been experimentally characterised. If we find a Leishmania protein that has a similar protein sequence to an experimentally characterised Saccharomyces protein, we have an excellent prediction for the Leishmania protein’s function.

5 After the workshop

5.1 Evaluating diamond results

Note

Can you guess what kinds of proteins have not changed much since the > hundred million years since Leishmania and yeast diverged?

Let’s find out what kinds of proteins have not changed much since Leishmania and yeast diverged.

First extract the yeast gene names from your diamond output with:

cat Llainsoni-v-yeast.ident70.tsv | cut -f 2

Then go to the yeast genome database gene list query page.

Copy and paste the yeast genes from the diamond output into the box, and click CONTINUE.

That’s it!.

6 Planning for your report

The assessment for this module is a 2000 word written report. The bioinformatic data analysis section involves replicating and extending one of the data analysis workshops we teach during the module. The workshops are:

Workshop 3: Genome assembly, gene finding and BLAST
Workshop 4: RNAseq analysis
Workshop 5: Single cell RNAseq
Workshop 6: Population genomics

The guide to writing your report is here.

7 Appendix: Assembly, annotation and/or BLAST for your project

7.1 Appendix 1. Making an assembly with all the ONT reads

To save time for this workshop, we only provided 10% of the reads we had for L. braziliensis. If you want to run genome assembly and annotation for your report, the full set of reads are available here:

/shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome

This directory contain these files, that you should copy into (or make symbolic links) to your report directory.

Lbraz-allreads.necat.config.txt
Lbraz-allreads.readlist.txt
Lbraz-minion.fastq.gz
Lbraz-promethion.fastq.gz

7.1.1 How to edit config files, filter reads and run an assembly

We now provide an example of the commands to run fastp, filter reads, calculate read coverage, run and assembly, and annotate that assembly here: https://djeffares.github.io/make-gald-assembly-2024-12-07.txt

We also provide a filtered We also provide a much smaller set of reads that have been filtered to keep only reads that are at least 10,000 nt long, and have a mean base quality score ≥ 20. These are ere:

/shared/biology/bioldata1/bl-00087h/data/G.sulphuraria/genome/ont-reads/gald.filt.len10k.avQ20.fastq.gz

Read coverage

‘Read coverage’ is how many times we have sequenced the genome. For example, if each region of the genome is sequenced by 10 reads we have ten times (10x) coverage. We usually need ≥ 40 x coverage for a reasonable assembly, and increasing this to ~100 x can improve it. After 100x, we often don’t see much gain.

7.2 Appendix 2. Assessing assembly quality with QUAST

Problems running software?

Here’s what the experts do:

Check the module is loaded with module list
Use -h (or --help). For example, typing quast -h will show you the help page for QUAST.
Read the manual: Googling quast manual will usually find the manual.
Most command line software we use on Linux has a --help or -h flag.

For project work, we advise you to use QUAST (Quality Assessment Tool) to assess the quality of your assembly. Here’s how to do it:

Load the module: module load QUAST/5.2.0-foss-2022a (or use module spider quast if this doesn’t work)
Run QUAST: quast --output-dir quast-results <assemby_contigs.fasta>

Here <assemby_contigs.fasta> is the name of your assembly file. For example, if you are using the L. braziliensis assembly, this would be quast --output-dir quast-results Lbraz-subset.assembly/6-bridge_contigs/polished_contigs.fasta.

The directory quast-results contains the results of QUAST. We find that the quast-results/report.tsv file is the simplest to read.

7.3 Appendix 3. Annotating your assembly with AUGUSTUS

Load the AUGUSTUS module and run on your final assembly as above.

7.3.1 How accurate are the predictions?

AUGUSTUS reports the posterior probabilities of its predictions for genes. This is the likelihood that the prediction is correct, given the data.

The posterior probabilities are estimated using a sampling algorithm. The parameter –sample==n adjusts the number of sampling iterations.You may choose to improve the accuracy of the likelihood estimates by increasing the sampling, like so:

augustus --sample=100 --species=leishmania_tarentolae my.fasta > my.fasta.sample100.gff

More information about AUGUSTUS is here.

You can extract the probabilities of the CDS (coding sequences) from the gff format output file like so:

cat my.gff | grep -v '#' | grep CDS > my.CDS.gff

And the GFF file could be loaded into RStudio for analysis. The posterior probabilities are in the 6th column of this file. Plotting and comparing posterior probabilities between assemblies or different samplings could make for an interesting project.

7.4 Appendix 4. BLAST/Diamond work

When you have predicted protein sequences from your assembly, running diamond will give you some information about how the protein sequences from the L. braziliensis genome compare to another species.

You can obtain FASTA format protein data sets for 66 different Trypanosomatid species here:

/shared/biology/bioldata1/bl-00087h/data/L.braziliensis/leishmania-protein-data/trypanomatid_protein_sets

Identifying the most and least conserved proteins between L. braziliensis and one selected other species would be sufficient for the BLAST-searching component of a report.

--- title: "BIO00087H Workshop 3: Genome assembly, gene finding and BLAST" subtitle: "Making a genome from raw reads" author: Daniel Jeffares date: 2024-07-17 format: html: code-copy: true code-tools: true code-link: true toc: true number-sections: true toc-location: right toc-depth: 2 highlight-style: github editor: source --- :::callout-note ## The deadline for prject submission is now Wednesday 22nd January at 12:00 noon Student services requested that we change this to avoid students having a deadline during an exam. ::: :::callout-important ## Modules not loading? Try `module --ignore-cache spider <name>` ::: # Learning objectives The objectives of this workshop are: 1. To gain skill and confidence with running software in Linux 2. To learn concepts and terminology of genome assembly 3. To learn about genome annotation pipelines 4. To learn about BLAST (Basic Local Alignment Search Tool) and diamond ::: callout-tip # How to check that your analysis is running. * Method 1: `jobs -l` * Method 2: `top -u username` \ Try them both out some time today. ::: # Introduction ## From reads to chromosomes NGS and 3rd generation sequencing machines produce reads, not joined-up chromosomes. If we want to know the **complete** genome sequence of a new species (all chromosomes, end to end), oursteps after sequencing will be: 1. Check the quality of our read data 2. Join the reads together to form **contigs** 3. Join the contigs to form **scaffolds** (aiming for full chromosomes) 4. Polish (correct) the contigs/scaffolds sequences 5. Locate the protein-coding gene coordinates; starts, stops, exons and intron positions 6. Locate non-coding RNA genes 7. Annotate the genes (identify probable biochemical/cellular functions) This genome can then be used for many different kinds of genome-scale analysis and other non-genomic research - like molecular cell biology. Today we will illustrate steps 1-7. ::: callout-important ## Genome assembly terminology * **Contigs**: a **contiguous** sequence generated from determining the non-redundant path along an order set of component sequences. A contig should contain no gaps. * **Scaffolds**: an ordered and oriented set of contigs. A scaffold will contain gaps. * **N50**: The length of the shortest contig, in the set of contigs that contain 50% of the total assembly length. The image below will help to explain this. * **N90**: As above but for 90% of the total assembly length. * **L50**: The **number** of contigs that make up 50% of the genome. ::: \ ![](fig1-assembly.png){width=100%} \ **Figure 1. Assembly concepts explained** Reads are the initial DNA sequences produced fom a sequencing machine. These may be short as shown here* (eg: Illumina technology), or long (eg:ONT/PacBIO). Reads are assembled into conigs. Contigs are ordered into scaffolds. \ * Short reads are typically 150 nt long (not as short as shown here). \ ![](fig2-metrics.png){width=100%} \ **Figure 2. N50** The N50 is an assembly metric. In this example, the total assembly length is 100 kb, and the N50 is 8 kb. So **50% of our assembly is in contigs of the N50 value or larger**. To calculate the N50, we order the contigs from the longest to the shortest. Then we add up their length, until the total length makes up at least 50% of our total assembly length. The **L50** is the *number* of contigs that add up to 50% of the assembly length. Both N50 and L50 are simple, useful metrics. # Exercises ## Setting up ### Directory set up 1. Start up the PC in Linux as last time, log in and open the **Terminal window**. 2. You should already have your own directory in \ `/shared/biology/bioldata1/bl-00087h/students/` \ \ If not, make one [as described here](https://djeffares.github.io/BIO87H/workshop2.html#create-your-own-directory). 3. Change to your directory with:\ `cd /shared/biology/bioldata1/bl-00087h/students/[username]` 4. Make a new directory for this workshop with: `mkdir workshop3` 5. Change to this directory with: `cd workshop3` ### File set up 1. Today we will use files from this directory:\ ``/shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/`` 2. To place **symbolic links** for all files in this directory within your `workshop3` directory, do this:\ ``ln -sf /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/* .`` 3. Don't forget the dot **`.`** at the end! 4. Check that everything worked by listing the files with ``ls`` 5. You should now have symbolic links to these files (and some others): * Lbraz-subset.fastq.gz * Lbraz-subset.necat.config.txt * Lbraz-subset.readlist.txt `Lbraz-subset.fastq.gz` contains some Oxford Nanopore Technology (ONT) long read data from the protozoan parasite *Leishmania braziliensis*. The other files are required to run NECAT, a genome assembly software. ::: callout-important ## Dots and spaces are important in Linux commands. ::: ## Assessing read data quality It is important to assess the quality of your data before you use it. We will use the `fastp` tool for this. Fastp can also filter fastQ files to remove reads that you prefer not to use. You won't need it today, but the `fastp` manual is [here](https://github.com/OpenGene/fastp). To run fastp: 1. Load the module `module load fastp/0.23.4-GCC-13.2.0` 2. Run fastp ``fastp --in1 Lbraz-subset.fastq.gz --out1 Lbraz-subset.filtered.fastq.gz --html Lbraz-subset.fastp.html --reads_to_process 1000 &> Lbraz-subset.fastp.output.txt &`` 3. Note all the **flags** that we use: * `--in1` specifies the input file * `--out1` specifies the output file name for the filtered reads * `--html` specifies the html format report file name We capture the output using the `&>` symbols. This captures text output that would usually appear on the screen, and saves it to a file called **Lbraz-subset.fastp.output.txt**. Here we use `--reads_to_process 1000` to process only 1,000 reads, so the software runs quickly. If you run fastp for your project, omit this option to process all the reads. To look at the fastp results, view the file `Lbraz-subset.fastp.output.txt` using: `less Lbraz-subset.fastp.output.txt` You will see various statistics about the reads: * The **total bases** tells us how many nucleotides of sequence data are in this file. * Since this is 1000 reads, what is the average read length? * Q20 bases and Q30 bases described how many bases are above [Phred scaled quality scores](https://en.wikipedia.org/wiki/Phred_quality_score) 20 and 30 respectively. **Q20** means one error error per 100 nucleotides. ## Genome assembly Now we will assemble these ONT reads using the NECAT assembly software. NECAT runs in three stages: 1. `necat correct` corrects the raw reads 2. `necat assemble` generates contigs 3. `necat bridge` bridges (joins) contigs to create scaffolds and polishes (corrects base call errors) in the scaffolds. We will run only `correct` and `assemble` today, because with so few reads, `bridge` doesn't improve the assembly. Note that `bridge` **does** improve the assembly if we have more reads. The manual for NECAT is [here](https://github.com/xiaochuanle/NECAT), and the original article is [Chen 2021](https://doi.org/10.1038/s41467-020-20236-7). You don't need to read these today. ### NECAT files NECAT requires two files to run. 1. A **config file**. This contains details about what the software should do. 2. A **read list file**. This contains a list of all the sequence reads to assemble. You should have these files your **working directory**: * Lbraz-subset.necat.config.txt * Lbraz-subset.readlist.txt ### Running NECAT correct 1. Load the NECAT module: `module load NECAT/0.0.1-GCCcore-12.3.0` 2. Run NECAT to correct the raw reads: \ `necat.pl correct Lbraz-subset.necat.config.txt &> Lbraz-subset.correct.log &` This will take 5-10 minutes to complete. 3. Using `ls -F` will show you what files you have, and mark **directories** with a forward slash (``/``) after the directory name. You should now have a directory called `Lbraz-subset.assembly`. All the assembly output file swill go here.\ ___ **Wait for `NECAT correct` to complete before running the next step.** ::: callout-tip # Check that your analysis is running. * Method 1: `jobs -l` * Method 2: `top -u username` \ ::: You might look at the contents of the file `Lbraz-subset.assemble.log`, or read ahead while you wait. Or talk to us about genomics! ___ ### Running NECAT assemble and brige 5. Once NECAT correct is complete, run NECAT assemble:\ `necat.pl assemble Lbraz-subset.necat.config.txt &> Lbraz-subset.assemble.log &` 6. When assemble is complete, run NECAT bridge: \ `necat.pl bridge Lbraz-subset.necat.config.txt &> Lbraz-subset.bridge.log &` Your assembly is **complete**. Well done. You now have some long ONT reads, assembled into [contigs](https://www.genome.gov/genetics-glossary/Contig#:~:text=00%3A00,representation%20of%20a%20genomic%20region.), and then into scaffolds. \ ### Assembly quality assessment There are many ways to assess the quality an assembly. A simple way is to look at the metrics that NECAT itself provides. These are at the end the log files such as `Lbraz-subset.bridge.log`. To view the last 11 lines of this file do: `tail -n 11 Lbraz-subset.bridge.log` You will see information something like this (but not exactly), telling us the number of contigs (counts), their total, maximum and minimum length, and the **N** and **L** values for 25%, 50% and 75% of the assembly lengths. Count: 2\ Total: 133018\ Max: 124310\ Min: 8708\ N25: 124310\ L25: 1\ N50: 124310\ L50: 1\ N75: 124310\ L75: 1\ You can also check how many contigs your assembly has by looking at the final output file in your directory `Lbraz-subset.assembly/6-bridge_contigs/polished_contigs.fasta`. Use either `less` or `grep -c '>'` to have quick look. ### This is a terrible assemby. :::callout-warning # Think about the result This assembly has just **one** contig! That can't be right, because other *Leishmania* species have 35 or 36 chromosomes. It is bad because we used very few reads. To save time in the assembly process, we provided only 10% of the reads we had available for this species. See [Appendix 1](https://djeffares.github.io/BIO87H/workshop3.html#appendix-1.-making-an-assembly-with-all-the-ont-reads) if you want to make an assembly of all the reads for your report. ::: ## Genome annotation Once we have an assembled genome, the next step is to **annotate** the genome, to locate the genes and other features within the genome. Protein-coding and non-coding (eg: tRNA) gene-finding is a complex process, which utilities a variety of information, including: * gene start and stop codons * transcript data from processed mRNAs (cDNAs) * homology between predicted proteins in the new genome and related genomes * signals of non-coding RNA secondary structures ### Annotation with AUGUSTUS There are many tools to identify protein coding and non-coding genes. Today, we will use AUGUSTUS, which is simple and runs rapidly. We will not get much from annotating the *L. braziliensis* assembly because it is such a terrible assembly. Instead, let's annotate another *Leishmania* species assembly, *Leishmania lainsoni*. A link the the assembly file `Llainsoni_NMT.fasta` would already be in your workshop3 directory. Check this with `ls *fasta`. #### Running AUGUSTUS To run AUGUSTUS, load the module: `module load AUGUSTUS/3.5.0-foss-2023b` Then run: `augustus --sample=10 --species=leishmania_tarentolae Llainsoni_NMT.fasta > Llainsoni_NMT.fasta.gff &` This will take about **2-5 minutes** to run. Note that we use `--species=leishmania_tarentolae`. This is because a pre-constructed database for this species is available, but there is not one for *Leishmania lainsoni*. ::: callout-tip # You can safely **ignore** this warning *Warning: Number of sample iterations 'sample'=10 is low. Posterior probabilities will be only rough estimates.* ::: Once the previous command is done, the output file is in the gff (general feature format) file ` Llainsoni_NMT.fasta.gff`. Have a quick look at it using `less`. You can extract the just the **protein** sequences from the gff output file, with: `getAnnoFasta.pl Llainsoni_NMT.fasta.gff` This will produce the file `Llainsoni_NMT.fasta.aa`. Use `less` to have a quick look at this file. To count how many proteins have been predicted in this genome do \ `grep -c '>' Llainsoni_NMT.fasta.aa`. ::: callout-note Detailed instructors on running AUGUSTUS are [here](hhttps://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md#basic-usage) - but you won't need these for today. ::: ## BLAST BLAST (the Basic Local Alignment Search Tool) is one of the most used tools in bioinformatics. BLAST finds regions of similarity between DNA or protein sequences. Each search we perform compares a nucleotide or protein sequence *query* to a sequence *database* (which can be DNa or protein). BLAST calculates how matches we should expect, and reports this as the *E-value* (the 'expect value'). ::: callout-note ## Uses for BLAST 1. **Gene function prediction.** Once we have found new proteins in our genome, we want to know what their biochemical functions are. By identifying similar proteins, whose functions we know from experimental data, we can infer that our 'query' protein has similar function to the 'hit' protein. 2. **Identifying homologous genes for evolutionary analysis.** Evolutionary biologists use phylogenetic trees to understand many processes. BLAST helps to locate proteins to build these trees. ::: To run a BLAST search you need: * a **query**: a protein or nucleotide sequence that you want to find something similar to * **a database** of protein or nucleotide sequences. This is where we look for the similar proteins/nucleotide sequences. This is just a list of protein/nucleotide sequences re-formatted for BLAST. Proteins in the database that match the query are usually called **hits** ## Diamond: faster than BLAST BLAST is good (and very flexible), but today we will use [diamond](https://github.com/bbuchfink/diamond), which is more than **100 times faster** than BLAST for protein-protein searches. We will use diamond to find which proteins in *Leishmania lainsoni* are very similar to proteins in budding yeast *Saccharomyces cerevisiae*. :::{.callout-note collapse="true"} # Optional information: The source of the yeast data `http://sgd-archive.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans_all.fasta.gz` ::: ### Running diamond You will have links to the files `S.cerevisae.S288C.proteins.fasta` and a pre-built diamond database of these yeast proteins in your workshop3 directory (`S.cerevisae.S288C.proteins.dmnd`). We will now search every *Leishmania lainsoni* protein against every *Saccharomyces cerevisiae* protein to find potential **orthologs** (proteins that are similar by evolutionary descent). Here is how: 1. Check you have the yeast data with `ls S*`. 2. Load the diamond module: `module load DIAMOND/2.1.9-GCC-13.2.0` 3. Run the search. Here we use ``--id 70`` to specify that we want to see only hits that are ≥ 70% identical. If you omit this flag, diamond locates all protein matches. `diamond blastp --query Llainsoni_NMT.fasta.aa --db S.cerevisae.S288C.proteins.dmnd --out Llainsoni-v-yeast.ident70.tsv --id 70` The output file will be: `Llainsoni-v-yeast.ident70.tsv` ### Evaluating diamond results First, look at your output file with: `less Llainsoni-v-yeast.ident70.tsv` And count how many **hits** we got with: `wc -l Llainsoni-v-yeast.ident70.tsv` :::callout-warning # Think about the result. Use `grep -c '>' *.fasta` to find out: * How many *L. lainsoni* proteins we used as queries? * How many yeast proteins we searched against? **Why did we get so few hits?** ::: The information in the output file are below: | column | field | meaning | |------|------|------| | 1 | qseqid | query or source (gene) sequence id | | 2 | sseqid | subject or target (reference genome) sequence id | | 3 | pident | percentage of identical positions (in the alignment) | | 4 | length | alignment length (sequence overlap) | | 5 | mismatch | number of mismatches (in alignment) | | 6 | gapopen | number of gap openings (in alignment) | | 7 | qstart | start of alignment in query | | 8 | qend | end of alignment in query | | 9 | sstart | start of alignment in subject | | 10 | send | end of alignment in subject | | 11 | evalue | the 'expect value' (see below) | | 12 | bitscore | the 'bit score' (see below) | **Table 1. Diamond standard output format. ** The **bit score** (bitscore) gives an indication of how good the alignment is; the higher the score, the better the alignment. The **expect value* (E-value) is a probability for how many alignments of this quality you would expect in a database of this size. The lower the E-value, the better the hit. E-values are similar to, but not the same as, p-values. Because outputs are simple tab-delimited files, it is easy to read these into R and analyse them. There are many things you can do with BLAST or diamond searches. See the **After the workshop** section for one example. # End That is all for today. Today we learned: * About genome assemblies * How to run NECAT, a genome *assembly* tool * How to run AGUSTUS a genome *annotation* tool * About using BLAST and diamond to search for proteins in one species (*Leishmania*) that are similar to another species (*Saccharomyces*), or a larger data set (such as all proteins ever discovered). **The power of such BLAST/diamond searches** is that may proteins in *Saccharomyces* have been experimentally characterised. If we find a *Leishmania* protein that has a similar protein sequence to an experimentally characterised *Saccharomyces* protein, we have an excellent prediction for the *Leishmania* protein's function. \ \ \ # After the workshop ## Evaluating diamond results ::: callout-note Can you guess what kinds of proteins have not changed much since the > hundred million years since *Leishmania* and yeast diverged? ::: Let's find out what kinds of proteins have not changed much since *Leishmania* and yeast diverged. First extract the yeast gene names from your diamond output with: `cat Llainsoni-v-yeast.ident70.tsv | cut -f 2` Then go to the [yeast genome database gene list query page](https://www.alliancegenome.org/bluegenes/alliancemine/upload/input). Copy and paste the yeast genes from the diamond output into the box, and click **CONTINUE**. That's it!. # Planning for your report The assessment for this module is a 2000 word written report. The bioinformatic data analysis section involves replicating and extending one of the data analysis workshops we teach during the module. The workshops are: * Workshop 3: Genome assembly, gene finding and BLAST * Workshop 4: RNAseq analysis * Workshop 5: Single cell RNAseq * Workshop 6: Population genomics The guide to writing your report is [here](https://docs.google.com/document/d/19fI94usu-mPuBSbQf5j22jP1yhvExZE8EXied95IaQ0/edit?usp=sharing). # Appendix: Assembly, annotation and/or BLAST for your project ## Appendix 1. Making an assembly with all the ONT reads To save time for this workshop, we only provided **10% of the reads** we had for *L. braziliensis*. If you want to run genome assembly and annotation for your report, the full set of reads are available here: ``/shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome`` This directory contain these files, that you should copy into (or make symbolic links) to your ``report`` directory. Lbraz-allreads.necat.config.txt \ Lbraz-allreads.readlist.txt \ Lbraz-minion.fastq.gz \ Lbraz-promethion.fastq.gz \ ### How to edit config files, filter reads and run an assembly We now provide an example of the commands to run fastp, filter reads, calculate *read coverage*, run and assembly, and annotate that assembly here: [https://djeffares.github.io/make-gald-assembly-2024-12-07.txt](https://djeffares.github.io/make-gald-assembly-2024-12-07.txt) We also provide a filtered We also provide a much smaller set of reads that have been filtered to keep only reads that are at least 10,000 nt long, and have a mean base quality score ≥ 20. These are ere: `/shared/biology/bioldata1/bl-00087h/data/G.sulphuraria/genome/ont-reads/gald.filt.len10k.avQ20.fastq.gz` :::callout-note # Read coverage 'Read coverage' is how many times we have sequenced the genome. For example, if each region of the genome is sequenced by **10 reads** we have ten times (10x) coverage. We usually need ≥ 40 x coverage for a reasonable assembly, and increasing this to ~100 x can improve it. After 100x, we often don't see much gain. ::: ## Appendix 2. Assessing assembly quality with QUAST ::: callout-warning # Problems running software? Here's what the experts do: - Check the module is loaded with `module list` - Use `-h` (or `--help`). For example, typing `quast -h` will show you the help page for QUAST. - Read the manual: Googling `quast manual` will usually find the manual. - Most command line software we use on Linux has a `--help` or `-h` flag. ::: For project work, we advise you to use [QUAST](https://github.com/ablab/quast) (Quality Assessment Tool) to assess the quality of your assembly. Here's how to do it: 1. Load the module: `module load QUAST/5.2.0-foss-2022a` (or use `module spider quast` if this doesn't work) 2. Run QUAST: `quast --output-dir quast-results <assemby_contigs.fasta>` Here `<assemby_contigs.fasta>` is the name of your assembly file. For example, if you are using the *L. braziliensis* assembly, this would be `quast --output-dir quast-results Lbraz-subset.assembly/6-bridge_contigs/polished_contigs.fasta`. 3. The directory `quast-results` contains the results of QUAST. We find that the `quast-results/report.tsv` file is the simplest to read. ## Appendix 3. Annotating your assembly with AUGUSTUS Load the AUGUSTUS module and run on your final assembly as above. ### How accurate are the predictions? AUGUSTUS reports the posterior probabilities of its predictions for genes. This is the likelihood that the prediction is correct, given the data. The posterior probabilities are estimated using a sampling algorithm. The parameter --sample==n adjusts the number of sampling iterations.You may choose to improve the accuracy of the likelihood estimates by increasing the sampling, like so: `augustus --sample=100 --species=leishmania_tarentolae my.fasta > my.fasta.sample100.gff` More information about AUGUSTUS is [here](https://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md#posterior-probabilities). You can extract the probabilities of the CDS (coding sequences) from the [gff format](https://en.wikipedia.org/wiki/General_feature_format) output file like so: `cat my.gff | grep -v '#' | grep CDS > my.CDS.gff` And the GFF file could be loaded into RStudio for analysis. The posterior probabilities are in the 6th column of this file. Plotting and comparing posterior probabilities between assemblies or different samplings could make for an interesting project. ## Appendix 4. BLAST/Diamond work When you have predicted protein sequences from your assembly, running [diamond](https://github.com/bbuchfink/diamond) will give you some information about how the protein sequences from the *L. braziliensis* genome compare to another species. You can obtain FASTA format protein data sets for 66 different [Trypanosomatid species](https://en.wikipedia.org/wiki/Trypanosomatida) here: `/shared/biology/bioldata1/bl-00087h/data/L.braziliensis/leishmania-protein-data/trypanomatid_protein_sets` Identifying the most and least conserved proteins between *L. braziliensis* and **one** selected other species would be sufficient for the BLAST-searching component of a report.