BIO00066I Workshop 2

Cell Biology Data Analysis Workshop 2

Author

Daniel Jeffares

Published

January 25, 2024

Note

Workshop 1 for this module is here

Important

The data you will work on in these workshops is the complete data set for the report. This is not example data.

1 Learning objectives

In this workshop, you will learn technical skills and data science concepts.

Technical skills

RStudio skills to manage, view and analyse large data sets
How to create plots in RStudio to assist with exploring and interpreting data
How to make your data analysis reproducible and well-explained

Thinking like a data scientist

How to interpret data with care
How to integrate knowledge from laboratory work and computational analysis

The Module learning outcomes can be seen here.

2 Introduction

Philosophy

Workshops are not a test. It’s OK to make mistakes. It’s OK to need help. You should be familiar with independent study content before the workshop, but you don’t need to understand every detail. The workshop will hlpe you to build and consolidate your understanding. Tips:

there are no stupid questions (ask us anything!)
don’t worry about making mistakes (we all make mistakes)
discuss code with your neighbours and with the staff
outside of workshop times, google is a good resource

2.1 The biology

Today we will look at some data from mesenchymal stem cells (MSCs, sometimes called stromal cells). MCSs are are multipotent cells, than can differentiate into many kinds of mesenchymal cells, including bone, cartilage and fat cells. They are not pluripotent, in that they do have limits about how they differentiate.

MSCs are also:

immuno-modulatory (they make cytokines)
heterogeneous (they have subpopulations that are different from each other)

The MSCs we are looking at are from bone marrow femoral head tissues (from the hip bone). These MSCs were immortalised (by transforming them with a telomerase gene), and then cloned by limiting dilution. We will compare the data on cell shape and size from just two clones, both obtained from one person’s femoral head tissues. Each clone was derived from a single cell from the person. We call these two clones clone A and clone B. Today, we will use our data analysis skills to explore how these cell line differ. Remember, these two cell lines were both obtained from one person. You probably have these kinds of MSCs too.

2.2 Research questions

What parameters do we have in the data obtained from the Livecyte machine/method?
Which parameters differ between the clones?
Which parameters are correlated?

2.3 The data

The data we have today are derived from live cell imagining of mesenchymal stem cells with a Livecyte microscope and the software it has. Each clones was cultured and the microscope captured cell shape and size information for thousands of cells from each clone (clone A and clone B). For each clone, we measured three biological replicates.

The Livecyte tracked each cell every 23 minutes, giving each data point a tracking ID (which tracks a unique cell ‘object’) and a lineage ID, which recalls the lineage of cells as they divide. For example, if cell 1, divided into cell 1A and cell 1B, they are all recorded by the Livecyte as being part of lineage ID 1.

This video shows what Livecyte can do (perhaps turn the sound down, or watch later).

Biological replicates and technical replicates

A biological replicate repeats an experiment from different cell types, tissue types, or organisms to see if similar results can be observed.

Technical replicates are repeat measurements of the same biological material, which help to show how much variation comes from the equipment or different methods (rather from the biology).

Our replicates are almost biological replicates, because we grew the cells three times.

Thinking like a data scientist

Data science can be challenging when you first start. But data science does have core concepts, like any other science.

Consider the motivation or scientific question before drawing a plot
Reflect on how the plot or analysis addresses the scientific question
Use each plots to adapt, to inspire new research questions and new inquiries
Ensure that your scripts are reproducible and clearly commented
Consider what the results tell you about the biology

3 Exercises

Develop your R script, carefully.

As we run though the exercises, copy the code to your script. You can cop the code by clicking on the ‘clipboard’ symbol at the top tight of the grey code chunks.

#Comments In R scripts, lines that start with the hash symbol (#) are comments. Add comments so you can remember what this code does. Commenting code makes it readable, for you and anyone else. Good data science includes clearly commented code.

3.1 Getting started

Start RStudio from the Start menu
Open your RStudio project using the dropdown menu at the very top right of the RStudio window.
Make a new script then save it with a sensible name that will help you to know later what is in this file. BIO00066I-workshop2.R would work.
Add a comment to the script so you know what it is about, for example

#Data Analysis 2: Cell Biology
#date 2024-30-01

Clear all the previous data, and load tidyverse package by adding these lines to your script:

#clear previous data
rm(list=ls())

#load the tidyverse
library(tidyverse)

Finally load another library that allows us to make multi-part plots (this can be useful sometimes). #we need this to make pretty plots with the ‘ggarrange’ package

#load the ggpubr package for multi-part plots
library(ggpubr)

Important

The data you will work on in these workshops is the complete data set for the report. This is not example data.

Make sure all these lines of code are in your script, with comments.
Save the script.

3.2 Loading the data

In workshop 1, we showed you how to import data from files. Today, we load a tab-separated value (TSV) file from a website:

cells <-read_tsv(url("https://djeffares.github.io/BIO66I/all-cell-data-FFT.filtered.2024-02-22.tsv"),
    col_types = cols(
        clone = col_factor(),
        replicate = col_factor(),
        tracking.id=col_factor(),
        lineage.id=col_factor()
    )
)

About read_tsv

We use the read_tsv function to read the tab-separated value file. Clicking on the link in the read_tsv takes you to a website about this function. All the code chunks in these workshops have links like this. The clone = col_factor(),replicate = col_factor() etc let’s R know that we want these columns to be factors, rather than numeric values. Factors are used to represent categorical data.

3.3 Exploring the data

It is important to know what data you have. How many rows and columns etc. There are many ways to do this in R. Here are some of our favourites. Copy and past these into your R script, and try them out.

#look at the data, like an excel table:
view(cells)

#what are the names of the columns?
names(cells)

#how many rows and columns you we have?
nrow(cells)
ncol(cells)
dim(cells)

#other ways to peek at data:
summary(cells)
glimpse(cells)

You may have noticed that we have two clones (cloneA and cloneB). For each clone, we have three replicates. For each replicate, we have many readings of cell widths, cell volume, cell sphericity, and so on.

Which method of do you like the best?

Why not take a note of this, and use it all the time?

3.4 Cleaning up

When you ran names(cells) you will have seen many metrics from the Livecyte software. We don’t need all these, and the movement metrics them are unreliable, so let’s remove them. It is simple to remove columns of data, using the select command, like so:

#see what we have
names(cells)

#remove all the movement metrics (which are not reliable)
cells <- select(cells, 
                -position.x,
                -position.y, 
                -pixel.position.x, 
                -pixel.position.y,
                -displacement,
                -instantaneous.velocity,
                -instantaneous.velocity.x,
                -instantaneous.velocity.y,
                -track.length
)

#check that our data is simpler
names(cells)

In the select function, using the minus sign - before a column name removes it. We can also also use select to define which columns to keep, by omitting the -. For example cells <- select(cells, clone, replicate, volume) will only keep the columns we listed.

3.5 Save your work regularly

Now save your data and save your script. This command will save all the variables you have loaded, or created so far:

#save all my stuff
save.image("BIO00066I-workshop2.Rda")

You can load all the data again with:

#load all my stuff from last time
load("BIO00066I-workshop2.Rda")

3.6 Summarise with dplyr

The view(cells) command shows that we have many cell shape metrics. We saw a summary of our cell shape metrics with the summary(cells) command above.

We can do even better, with tools from the dplyr package (part of the tidyverse). These allow us to make summaries of this data quickly. These methods will work for any data frames of any kind of data.

dplyr is wonderful

dplyr is like a set of pliers, helping us to ‘bend’ or ‘reshape’ our data.

Here is the command. I will explain it below.

summary.table <- cells |> 
    group_by(clone, replicate) |> 
    summarise(
        volume=median(volume),
        mean.thickness=median(mean.thickness),
        radius=median(radius),
        area=median(area),
        sphericity=median(sphericity),
        length=median(length),
        dry.mass=median(dry.mass),
        length.to.width=median(length.to.width)
)

3.6.1 What this code does

The cells |> part takes the data from the cells data frame, and ‘pipes’ it into the group_by function. The |> symbol as means put this data into the next bit.
group_by(clone, replicate) means that make groups of data, according to which clone they are, and which replicate culture they were from.
summarise calculates some summaries of all the data rows (for each clone and replicate). The part volume=median(volume) creates a header called volume and fills this with the median cell volume for each clone and replicate
Right at the top, summary.table <- stores the results of all the piping in an object called summary.table.

Have a look at the information we generated with:

view(summary.table)

Do you see any patterns in this data?

What does this table tell you about clone A and clone B?
What does this tell you about mesenchymal stem cells?

3.7 Making plots

It looks like many of the cell shape metrics from above might differ between clones. So let’s look deeper, starting with cell width. We will make a box and whisker plot. We use a small ‘trick’ here: by including fill=replicate we force R to make different plots for each replicate.

ggplot(cells,aes(x=clone,y=width,fill=replicate))+
    geom_boxplot()

Making plots with ggplot

No matter what shape plot you want, ggplot uses the same syntax.

Start by telling R what data you want to use like this: ggplot(some_data_frame,aes(x=x_axis_data, y=y_axis_data) +

In this line, x_axis_data and y_axis_data are columns of some_data_frame.

Then define what the shape plot you want (the ‘geometry’): geom_boxplot(), geom_histogram() etc.
Add extra things to customise your plot, eg: xlab("my x axis label")

If you are unsure what to do, it’s OK to ask! Googling “ggpplot how to make a boxplot” (or any other plot) will help.

It does look like clone A and clone B differ consistently in width, with all repeats. To prove this, we would like to do a statistical test. A Student’s t-Test t.test would work. But t-Test’s assume that the data are normally distributed (like a bell curve).

We can test this approximately by plotting data, as below. It doesn’t look perfectly ‘bell shaped’, so let’s play is safe and use a nonparametric test.

ggplot(cells, aes(x = width)) +
  geom_density()

The nonparametric equivalent of a t-Test is a Wilcoxon rank sum test. There are two ways to run this test in R. We can do wilcox.test(vectorA, vectorB), where vectorA and vectorB contain the numeric values we want to test.

But does does not suit tidyverse data frames very well. So we will use the wilcox.test(numeric_values ~ category_name, data = some_data_frame) method, like so:

#Wilcoxon rank sum test
#To test if cloneA and cloneB have statistically different widths
wilcox.test(width ~ clone, data = cells)


    Wilcoxon rank sum test with continuity correction

data:  width by clone
W = 657964994, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Because the p-value is very low (< 2.2e-16), this means the null hypothesis (that both sets of numbers come from the same population) is very unlikely to be true.

Non-parametric tests are safe

Parametric tests are based on assumptions about the distribution of the real data. Nonparametric statistics are not based on these assumptions. So they are safer. Many biological metrics need nonparametric tests.

3.7.1 An easier way

Fortunately, you can add the results of nonparametric tests to ggplot, by adding stat_compare_means() to our plot. Note that below we do not force R to split up the replicates as we did above.

ggplot(cells,aes(x=clone,y=width))+
    geom_boxplot()+
    stat_compare_means()

4 Next step: modify the code

If there is something you don’t uderstand, it’s OK to ask us!

To increase your understanding, run through all the plots in section 3.5 with some other cell shape or size metric. To start, find what metrics are present with:

names(cells)

We suggest using one of these, in place of width:

volume
mean.thickness
radius
area
sphericity
length
dry.mass

5 Analysis with small data

So far, we have used all the rows in our data frame. But perhaps this isn’t wise, because the Livecyte tracking followed each cell over time. So each measurement of cell shape is not a completely independent measurement. It is more of a technical replicate than a biological replicate.

It is also valuable to learn how to make plots and analyse small data sets, as well as large ones. So not we will create a very small subset of the data, and investigate this.

First, read in the small data set we prepared, and have a quick look, to check what we have.

#read in some data from a website
small.tracking.data <-read_tsv(url("https://djeffares.github.io/BIO66I/trackingid.small.data.tsv"),
    col_types = cols(
        clone = col_factor(),
        replicate = col_factor(),
        tracking.id=col_factor(),
        lineage.id=col_factor()
    )
)

#check what we have
glimpse(small.tracking.data)
summary(small.tracking.data)

5.1 Plotting a small data set

Previously we used geom_boxplot() to compare clone A and clone B. Box and whisker plots (geom_boxplot) and violin plots (geom_violin, which we describe below) are good ways of displaying large amounts of data, but they can be misleading for small data sets. For small data sets it is better to simpler plot all the data points, so the audience has all the information. Let’s compare boxplot and a ggplot method to plot all the data (geom_jitter).

ggplot(small.tracking.data, aes(x=clone, y=width))+
    geom_boxplot(fill=NA)+
    geom_jitter(width = 0.2,size=3,pch=1)+
    theme_minimal()

names(small.tracking.data)

if you have time

Try two alternations of this plot.

Remove the geom_boxplot(fill=NA)+ line. Does it display the data more clearly?
Replace width in y=width with some other cell shape/size parameter. Using names(small.tracking.data) will show what metrics we have.

5.2 Statistical tests with small data and large sets

We use statistical tests like the Wilcoxon rank sum test (wilcox.test) to estimate how likely to be drawn from the same ‘population’ (or distribution). If the P-value is very low (eg: one chance in 10,000 or 0.0001), then we can conclude that the two data sets are indeed different. But the power of such a test depends on the number of data points.

We can prove that to ourselves by comparing a Wilcox test results from the small.tracking.data and the much larger cells data set:

#test with small.tracking.data
wilcox.test(width ~ clone, data = small.tracking.data)


    Wilcoxon rank sum exact test

data:  width by clone
W = 75, p-value = 0.1261
alternative hypothesis: true location shift is not equal to 0

#test with 'cells' data frame - a *much* larger data set  
wilcox.test(width ~ clone, data = cells)


    Wilcoxon rank sum test with continuity correction

data:  width by clone
W = 657964994, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

P-values can be misleading

P-values are not the biology. Use them with caution, and your own judgment. Sometimes very small differences, or very weak correlations can be statistically significant, but have little relevance for the biology.

5.3 Comparing plots side by side

Now, lets look plotting the small and large data sets side by side. Sometimes, a plot can give us a better sense of the data that a P-value.

#plot with the large data
#storing the plot in an object called large.data.plot
large.data.plot <- ggplot(cells, aes(x=clone, y=width))+
    geom_boxplot()+
    geom_jitter(width = 0.2,size=3,pch=1)+
    stat_compare_means()+
    ylim(0,200)+
    ggtitle("large data")

#plot with the small data
#storing the plot in an object called small.data.plot
small.data.plot <- ggplot(small.tracking.data, aes(x=clone, y=width))+
    geom_boxplot()+
    geom_jitter(width = 0.2,size=3,pch=1)+
    stat_compare_means()+
    ylim(0,200)+
    ggtitle("small data")

#create a two panel plot, with large.data.plot and small.data.plot
ggarrange(large.data.plot,small.data.plot)

The geom_jitter with the large data set looks terrible! This is not a good way to show this data. So remove this part of the plot, and try again. You may also want ot try using geom_violin instead of geom_boxplot.

You will also notice that we set the same y axis range for both plots, with ylim(0,200). We did this so we can compare the plots side by side.

6 Reflection

After loading the data, what did we do first?
What did making the plots do that summary statistics (liek average, or median) cannot do?
What did we learn about the two clones?
How do we plot and consider P-values for small and large data sets?

7 The end

This is the end of the workshop for today. We provide some consolidation exercises, extra material and advice about the R Studio project below.

8 After the workshop

Look at your script again. Add some comments, so that the next time you look at it, it will make more sense to you. Also, have a go at the consolidation exercises below.

8.1 Consolidation exercises

8.1.1 Violin plots `geom_violin`

We used geom_boxplot()to show the differences between categories (clone A and clone B). Try geom_violin instead. Which one do you prefer?

ggplot(cells,aes(x=clone,y=width,fill=replicate))+
    geom_violin()

Now adjust the title with ggtitle(label), and the axis labels with xlab(label) and ylab(label), and the theme (the general style of the plot), with theme_classic(). There are many themes for ggplot, which you can see here. Use whatever theme you feel makes the data clear and simple.

ggplot(cells,aes(x=clone,y=length,fill=replicate))+
    geom_violin()+
    ggtitle("put your title here!")+
    xlab("my X axis label")+
    ylab("my Y axis label")+
    theme_classic()

8.1.2 Making the small data set

This is how we made the small data set trackingid.small.data. Reading through and/or trying out this code will enhance your skills.

Use the dplyr tools group_by and summarise to calculate the average of all the cell shape measurements for each cell. Each cell is marked by the Livecyte software with a tracking id (tracking.id), so we group cells by this. The code group_by(clone, replicate, tracking.id) will group each measurement by the clone, replicate and tracking id, then the summarise part calculates the mean values for all our metrics:

#take the mean of each unique tracking id
trackingid.summary.table<-cells |>
    group_by(clone, replicate, tracking.id) |>
    summarise(
        volume=median(volume),
        mean.thickness=median(mean.thickness),
        radius=median(radius),
        area=median(area),
        sphericity=median(sphericity),
        length=median(length),
        width=median(width),
        dry.mass=median(dry.mass),
        length.to.width=median(length.to.width)
    )

`summarise()` has grouped output by 'clone', 'replicate'. You can override
using the `.groups` argument.

Then we use the dplyr function sample_n, which samples a number of rows from a data frame at random. If the data frame is grouped (which our data is),the number we set applies to each group:

#collect a random subset, of 5 rows for each tracking id
trackingid.small.data <- sample_n(trackingid.summary.table, 5)

#check that we have
nrow(trackingid.small.data)
glimpse(trackingid.small.data)

We write the output to a file:

#output trackingid.small.data as a tab-separated value (tsv) file
write_tsv(trackingid.small.data, file ="/Users/dj757/gd/modules/BIO66I/data/trackingid.small.data.tsv")

Look at the trackingid.small.data data frame, so we know what we have:

#how many rows do we have?
nrow(trackingid.small.data)

#what does the data look like?
view(trackingid.small.data)

8.2 Planning for your report

For this module, 30% of the grade is for an R Studio project. The script that you are creating now is the start of this. Here is some information that will help you to prepare to submit your project.

For the R Studio project

Create a logical folder structure for your analysis. Your submission should include the script you used for your work, any accessory functions and the data itself. Your script should be well-commented, well-organised and follow good practice in use of spacing, indentation and variable naming. It should include all the code required to reproduce data import and formatting as well as the summary information, analyses and figures in your report.

1 Learning objectives

2 Introduction

2.1 The biology

2.2 Research questions

2.3 The data

3 Exercises

3.1 Getting started

3.2 Loading the data

3.3 Exploring the data

3.4 Cleaning up

3.5 Save your work regularly

3.6 Summarise with dplyr

3.6.1 What this code does

3.7 Making plots

3.7.1 An easier way

4 Next step: modify the code

5 Analysis with small data

5.1 Plotting a small data set

5.2 Statistical tests with small data and large sets

5.3 Comparing plots side by side

6 Reflection

7 The end

8 After the workshop

8.1 Consolidation exercises

8.1.1 Violin plots geom_violin

8.1.2 Making the small data set

8.2 Planning for your report

8.1.1 Violin plots `geom_violin`