BIO00087H Workshop 2: Linux IV

Linux skills development

Author

Daniel Jeffares

Published

July 17, 2024

DataCamp Classroom

During Semester 1 (16 September 2024 – 31 January 2025) you can join the BIO00087H Genomics 2024 ‘DataCamp Classroom’ using this link. Sign in with your University of York Google account. This will provide free access to the entire Introduction to Shell course.

1 Learning objectives

2 Introduction

Philosophy

It’s OK to make mistakes. It’s OK to need help.

there are no stupid questions (ask us anything!)
don’t worry about making mistakes (we all make mistakes)

2.1 How to get the most from this workshop

We suggest that

You sit with your poster group, so you get to know them
You make a plain your ‘lab book’ file that you record your work in
Ask questions if you are at all confused!

Is it essential to document bioinformatic analysis

We always keep good records in our lab books when we are in the lab, or out in the field.
In the same way is it essential to document bioinformatic analysis.
Start this now, but creating a plain text file using Notepad, Notepad++ or some other plan text editor. Record all your work in this document, like a lab book.

2.1.1 Instructions and code

In these workshops, instructions are in this font and code is in this font.

This workshop must be run on university PCs, not your own laptop. This is because you will need a Linux machine that has the correct software installed. Because there are diversity of laptops, it is not feasible to support software installation for all of you!

3 Exercises

3.1 Starting PCs in Linux

To start the PC in Linux, make sure you are not logfgerd in.
Then re-start the PC and wait.
When the option to boo tinto Linix comes up, use arrow keys to select the Linux option (rather than Microsoft Windows).
Log in with your usual University of York username and password
Start up the Terminal window, the text editor and the Firefox web browser.

In these workshops, we will work almost exclusively in the Linux command line using the Terminal window.

Figure 1. Linux screen. Clicking on the ‘nine dots’ symbol at the lower left of the screen will show the apps that are available. The Text Editor, Terminal Window and Firefox (web browser) are all you will need. The On/Off button is at the very top right of the screen. Don’t forget to log off at the end of workshops.

Figure 2. Adding the Text editor to favorites. We suggest hat you add the Text Editor app to your favorites, so it appears on the left hand apps panel.

3.2 First commands to orient yourself

To get started with Linux, open the Terminal window.

To copy and paste text in the terminal window:

Use the mouse to select text
Use Right click to copy or paste.

Try typing these commands in the terminal window:

Find out which user you are with whoami
Check who else is logged in with who
Find out what directory you are in with pwd (print working directory)
List files in this directory with ls
List files in this directory in long format, sorted by time ls -latr

Be patient with yourself.

It will take you a little while to get used to working on the command line in Linux. Be patient with yourself. This entire workshop is devoted to getting used to Linux.

Every Linux command has a manual

The manual explains the options for the command. To see the manual, type man [command], for example man ls . Manuals show useful ‘flags’ (options) for commands. For example ls -ltr does a long list (l), sorted by time (t) in reverse (r).

3.3 Create your own directory

The module directory, where all the data and your work will be stored is: /shared/biology/bioldata1/bl-00087h/

You should do your work within the students directory: /shared/biology/bioldata1/bl-00087h/students/.

Within the students directory, create a directory of your own, using your username. You will be the only person who can create or delete files in this directory.

Here’s how:

Change directory (cd) to the students directory: cd /shared/biology/bioldata1/bl-00087h/students
Make your directory: mkdir [username]
Change the permissions of your directory, so that only you can delete files from the directory: chmod -R g-w username

Being responsible with the drive space

Please don’t create files in any other directory. This will help us to keep track of how much disc space everyone is using. Misplaced files may be deleted. Do not copy files from the data directory to your own. Instead use symbolic links (see below).

3.4 Important Linux command line elements

3.4.1 Basics

You should be familiar with the commands below from the DataCamp Introduction to Shell course. If not, here is a cheat sheet to remind you.

ls (list files)
cd (change working directory)
mv (move a file, files, or directories)
cp (copy a file, files, or directories)
rm (move a file, files, or directories)
pwd(print working directory)

3.4.2 Creating symbolic links (soft links)

It’s not a good idea to copy large files around, because it takes up server space. There is no need to, because we can use symbolic links (sometimes called soft links). A soft link is not a standard file, but a special file that points to an existing file. To create a soft link, use the ln -s command and the following syntax:

ln -s [file path you want to point to] [link file name]

Soft links are very useful, because we can use them as input for any tool, just like we could a normal file. If we delete the link, the original file remains!

Try this out. Create a link to this genome assembly file:Lbraz.M2904.fasta. Which is located in this directory (the path to the file): /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome .

First, cd to your directory:

cd /shared/biology/bioldata1/bl-00087h/students/[username]

Then make the soft link:

ln -s /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/Lbraz.M2904.fasta \
soft.link.fasta

To check that it has worked do: ls -lF . You should see:

soft.link.fasta -> /shared/biology/(and so on)

We can name the link file anything we like. To name the link bananas, do:

ln -s /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/Lbraz.M2904.fasta bananas

If we omit the name and use a dot (.) instead, the link will have the same name as the original file. This is simpler and often very helpful:

ln -s /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/Lbraz.M2904.fasta .

Don’t forget the dot (.) at the end of this command!

We can also make soft links for many files at once using wildcard symbols like *. Do this now, so that you have a collection of links in your directory.

ln -s /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/* .

3.4.3 Wildcards

Sometimes you want to run a Linux command on many files at once. We use two symbols to specify a collection of files. The * means any set of characters (including no character!), and ? to mean any single character. For example, to list all the files that contain the text fast in their file name, we do:

ls *fast*

To list any files that start with L, or end with .fast we do:

ls L*

ls *.fast

To delete all the files in the current directory that have the pattern alpha.?.beta (where ? means any one character), do:

rm *alpha.?.beta*

Don’t do this! rm *.*

It will remove all your files!

3.4.4 Using `\` , `&` and `&>`***

The \ symbol allows you to compose and type long (multi line) commands, without the shell running the command. Effectively \ means, I’m not done yet, don’t run this code. For example, the code below will not run until we get to soft.link.fasta:

ln -s /shared/biology/bioldata1/bl-00087h/data/L.braziliensis/genome/Lbraz.M2904.fasta \
soft.link.fasta

3.4.5 Loading modules

The software installed on these PCs and servers at Yrok are made available through the module system lmod. This allows multiple versions of the same software to be installed without conflicting or interfering with each other. To load a module we run a command such as:

module load NECAT/0.0.1-GCCcore-12.3.0

Do this now, so you have the NECAT software available for the next step.

3.4.6 The ampersand symbol: `&`

Adding & at the end of a command allows you to run a command ‘in the background’. This means that you can set a process running but still type other commands into the shell, without waiting for it to complete. For example, Some commands will take a half an hour to complete:

#correct some reads with NECAT
necat.pl correct Lbraz-subset.necat.config.txt &> Lbraz.assembly.log &

Since we put & at the end of this last comment, we can do other things while we wait.

3.4.7 A special kind of pipe: `&>`

There are two kinds of outputs in Linux:

standard error (STDERR, warnings and other info)
standard output (STDOUT)

While > redirects only STDOUT, &> redirects both STDERR and STDOUT. This can be handy if you are not capturing what you want to. The command below captures all the information that would usually be ‘printed’ onto the screen, and stores in a the file Lbraz.assembly.log.

necat.pl correct Lbraz_config.txt &> Lbraz.assembly.log &

3.4.8 grep, the grabbing tool

grep is a command used to search files for the occurrence of a string of characters that matches a specified pattern. It’s easy to use and very powerful. grep extracts lines of a file that match a pattern. Or lines that don’t match a pattern. Or counts the lines that match a pattern. The syntax is:

grep [OPTION...] PATTERNS [FILE...]

Imagine we have a file that contains the symbol >, such as a FASTA format sequence file (eg: Lbraz.M2904.fasta). We want to see only the lines in that file that contain this symbol.

First take. look at the filr Lbraz.M2904.fasta using :

less Lbraz.M2904.fasta

You can use the arrow keys to scroll up and down trhough this file. Use the key q to quit from less.

To extract out all lines with the > character, we do:

grep '>' Lbraz.M2904.fasta

The quotes (') are important here because the > symbol has a specific meaning in Linux, but we usually don’t need it for grep.

If we want to count all the lines that have this symbol we use the -c flag of grep.

grep -c '>' Lbraz.M2904.fasta

To see all the lines that do not contain the > character, we can use the -v flag:

grep -v '>' Lbraz.M2904.fasta

3.4.9 Pipes and output redirection

Sometimes you want to feed the output from one command into another. Like feeding the grep output directly into grep, above. The | symbol (pipe) does this. It’s like connecting a pipe from one machine, so that what comes out of it it fed directly into another machine! You can also chain pipes together, eg:

ls -R data | grep fastq | wc -l

This command lists all the files in the data directory (ls -R data), then greps out those that are fastq files, then uses wordcount (wc) to count the number of lines (which equals the number of files).

You can also redirect outputs to files, so you can keep the information. We use the > symbol for this. So to capture the list of fastq files in the data directory, we can do:

ls -R /shared/biology/bioldata1/bl-00087h/data | grep fastq > the.fastq.files.txt

We’ll now have a file called the.fastq.files.txt

3.4.10 history, grep and running commands again

To see the commands you have run before do: history .

This is very useful if you can’t remember what you last did!

If you want to run a command that you have run before, press the up arrow in Linux. Keep pressing it for previous commands. You can also edit a command, by moving though it with the left and right arrows, and adding removing text. Press [enter] when you are done, to run the command.

If you know you have run a certain type of command, but can’t find it, you can grep it from your history like this. Here we use grep to search for gzip commands:

history | grep gzip

3.5 Tidy up

With so many files it is it important to keep your directory tidy. Tidy up by making a directory called workshop 3, and moving all the files in there. We’ll use these files next time.

mkdir workshop3
mv * workshop3

4 Reflection

You know know how to navigate around directories in Linux. You should have some understanding of these concepts:

directories in Linux
working directory
symbolic links
pipes
wild cards

5 The end

6 After the workshop

6.1 Consolidation exercises

It will take a while to used to Linux. This cheat sheet should help. You may wish to print it out and bring it with you next time.

6.2 Planning for your report

The only summative assessment for BIO00087H is a 2000 word report. Your report should have two parts:

An introduction (up to 1000 words*). This should be a literature review, covering material that is relevant to the data analysis you chose to do. Don’t attempt a general genomics introduction!
A bioinformatic data analysis section (up to 1000 words*). During the module, we will show you how to analyse data to carry out these analyses on various test data sets:

Genome assembly annotation and BLAST (Workshop 3)
Bulk RNAseq analysis (Workshop 4)
Single cell RNAseq analysis (Workshop 5)
Population genomics (Workshop 7)

For your report, you should choose one of these analyses methods, and repeat it on a larger data set that we provide. All data sets are in /shared/biology/bioldata1/bl-00087h/data.

These word limits are maximums, not targets. You can achieve a very high grade with a substantially shorter report.

Detailed information is available in the BIO00087H Genomics report guide.

1 Learning objectives

2 Introduction

2.1 How to get the most from this workshop

2.1.1 Instructions and code

3 Exercises

3.1 Starting PCs in Linux

3.2 First commands to orient yourself

3.3 Create your own directory

3.4 Important Linux command line elements

3.4.1 Basics

3.4.2 Creating symbolic links (soft links)

3.4.3 Wildcards

3.4.4 Using \ , & and &>***

3.4.5 Loading modules

3.4.6 The ampersand symbol: &

3.4.7 A special kind of pipe: &>

3.4.8 grep, the grabbing tool

3.4.9 Pipes and output redirection

3.4.10 history, grep and running commands again

3.5 Tidy up

4 Reflection

5 The end

6 After the workshop

6.1 Consolidation exercises

6.2 Planning for your report

3.4.4 Using `\` , `&` and `&>`***

3.4.6 The ampersand symbol: `&`

3.4.7 A special kind of pipe: `&>`