BIO00066I Workshop 3

Cell Biology Data Analysis Workshop 3

Author

Daniel Jeffares

Published

February 18, 2026

1 Learning objectives

Philosophy

Workshops are not a test. You may make a lot of mistakes - that is fine. It’s OK to need help. The staff are here to help you, so don’t be afraid to ask us anything.

How to cite packages

print(citation("package_name"), style = "text")

1.1 Technical skills

Today we will work with cell movement data. We’ll use cell movement metrics that were automatically data collected in the Livecyte microscope and also manually tracked data. We will use R to explore these data, and to test our intuitions about how the two clones of mesenchymal stromal cells move.

We will continue to use plotting and data summarising skills we developed in workshop 2:

  • creating simple plots with ggplot2
  • making a summary table, showing which numeric metrics differ betwen clones A and B
  • making a principle component analysis (PCA) plot

And we will learn some new R plotting skills

  • facet_wrap a method to plot one variable split up over different treatments (or chromosomes, or days, or replicates)
  • how to create a correlation data frame using the correlate function. This allows us to explore many correlations between different metrics. Typically, these metrics are stored in the columns of our data frames.

1.2 Thinking like a data scientist

We encourage you to develop your data handling skills. By:

  • keeping our script clear, simple, and well-annotated
  • developing your own habits to keep your awareness on what data the data contains
Know your data!

To understand the biology captured in the data, you need to know what is in the data. At each step, be sure you know the rows, the columns and what they mean. Keep notes about this in your script.

2 Introduction

2.1 The biology

Today, we will continue our quest to understand mesenchymal stromal cells (MSCs). Remember, the two clones we are studying were both obtained from one person, and they have been transformed with telomerase to make them immortal (so they don’t age). Then, they have been cultured in a lab for years.

In workshop 2, we saw that these two clones were different shapes. This time, we examine how they move.

2.2 The data

In this workshop, we first examine cell movement data that was collected in an automated way by the Livecyte microscope. This shows something interesting, but it is not very reliable, because the Livecyte is not perfect at tracking individual cells.

People are better at tracking individual cells. So we then look at some manual tracking data. We will see such metrics as euclidean.distance (how far the cells have moved), mean.speed (how fast they go), and meandering.index (how much they meander and change their minds about where they are going!).

While we could guess how the cells differ from looking down the microscope, we can use our data science skills in two ways to enrich our perception, by:

  • showing metrics with plots - this will enhance our intuition
  • using statistical tests to test our intuitions
  • the tests will determine which metrics (if any) are significantly different between the clones

This work transforms intuitions into evidence.

2.3 Research questions

  • Do the two mesenchymal stromal cell clones move differently?
  • What data set(s) are most reliable?
  • What are the best parameters to distinguish the clones?
  • What is th best way to summarise this complex data to distinguish the clones?

3 Exercises

3.1 Setting up

  • Start up R Studio, and open your Project.
  • Open the script your worked on in workshop 2

We will keep working on this script. We advise you to mark clearly where workshop 1, workshop 2 and workshop 3 are. Something like this will help:

# WORKSHOP 3: CELL MOVEMENT DATA ----

Notice that placing at least four dashes ---- after the line of text creates a section in RStudio. This is a good way to keep your script organised, and to find your way around it later.

Save your R script often!

On a PC, use Ctrl+S to save your script. On a Mac, use Cmd+S.

First we will install the corrr package. To do this, we do:

install.packages("corrr")
install.packages is like installing an app on your phone.

R obtains the package from the The Comprehensive R Archive Network (CRAN), which is like the ‘app store’ for R. You need an internet connection to use it, but once you have installed a package once, you never need to install it again. Unless you are using a new computer.

Note that in install.packages command we need to put quotes " around the library name (corrr in this case), whereas when we load a library with library(corrr).

Then clear the previous work, and load the libraries we need:

# SET UP AND LIBRARY LOADING ----

#clear previous data
rm(list=ls())

#load the tidyverse
library(tidyverse)

#load the corr library: this is for examining correlations between many metrics
library(corrr)

#we need this to make pretty plots with the 'ggarrange' package
library(ggpubr)

3.2 Automated Livecyte cell movement data

Now read in the data from the file all-cell-data-FFT.filtered.2024-02-22.tsv. This contains the cell shape data we looked at lasst time, and some cell moment data.

# LOADING LIVECYTE DATA ----

# Read the automated Livecyte data
cells <-read_tsv(url("https://djeffares.github.io/BIO66I/data/all-cell-data-FFT.filtered.2024-02-22.tsv"),
                 col_types = cols(
                   clone = col_factor(),
                   replicate = col_factor(),
                   tracking.id=col_factor(),
                   lineage.id=col_factor()
                 )
)

Lets see what data we have:

names(cells)
Keep it simple

In data science, we can easily get confused. keep the data as simple as you can (but no simpler).

In the spirit of keeping it simple, let’s retain only the columns in this data frame that we need, using the select function. The cells data frame still has all the data, so it is not lost.

#examine what we have in the data frame
names(cells)
 [1] "clone"                    "replicate"               
 [3] "frame"                    "tracking.id"             
 [5] "lineage.id"               "position.x"              
 [7] "position.y"               "pixel.position.x"        
 [9] "pixel.position.y"         "volume"                  
[11] "mean.thickness"           "radius"                  
[13] "area"                     "sphericity"              
[15] "length"                   "width"                   
[17] "orientation"              "dry.mass"                
[19] "displacement"             "instantaneous.velocity"  
[21] "instantaneous.velocity.x" "instantaneous.velocity.y"
[23] "track.length"             "length.to.width"         
#select only the columns we need
cell.movement.data <- select(cells,
        clone,
        replicate,
        displacement, 
        track.length, 
        instantaneous.velocity
)

And then check what we have in cell.movement.data now:

#check that we have
names(cell.movement.data)

#get a simple summary, using summary and also glimpse
summary(cell.movement.data)
glimpse(cell.movement.data)
Save your R script often!

On a PC, use Ctrl+S to save your script. On a Mac, use Cmd+S.

3.2.1 Making plots with Livecyte movement data

let’s see how clone A and clone B move. First, we’ll plot the instantaneous.velocity:

# PLOTTING LIVECYTE DATA ----

#instantaneous.velocity - geom_violin
ggplot(cell.movement.data,aes(x=clone,y=instantaneous.velocity,colour=clone))+
    geom_violin(alpha=0.5)+
    stat_compare_means()

This plot isn’t very revealing is it? That is because most of the instantaneous.velocity values are very low, and the data are certainly not normally distributed. Biological data is often like this. Plotting metrics on a log scale is often the solution. Log2 or log10 scales are commonly used.

Adjust this plot yourself

So that you plot y=log10(instantaneous.velocity). Does it look better?

Enhance the plot with facet_wrap

This time, show the repeats by adding a new line to the plot code:

facet_wrap(~replicate)+

What is facet_wrap?. Our first categorical value that we split up the data into was clone A and clone B. facet_wrap splits the data up again into a second categorical value, and plots each category. In this case our second category was ~replicate.

If these adjustments to the code worked, you will end up with a plot like this. What does this tell you about clone movement?.

Optional

Use your plot code to explore, displacement, track.length and/or instantaneous.velocity.

3.2.2 Summarising the Livecyte data

In the last workshop, we made a summary table to show which cell shape metrics were different between the two clones. We can do the same thing for the cell movement data. This will give us a clear summary of which metrics are different between the clones, and how different they are.

We start by making clone A and clone B data frames:

# SUMMARY OF LIVECYTE DATA ----

# make clone A and clone B data frames
cloneA.data <- cell.movement.data |> filter(clone == "cloneA")
cloneB.data <- cell.movement.data |> filter(clone == "cloneB")

And then collecting a list of the numeric columns in the data frame, so we can loop through them and compare them between the clones.

#get the names of the numeric columns
numeric.columns <- cell.movement.data |> 
  select(where(is.numeric)) |> 
  names()

#see what we have
numeric.columns
[1] "displacement"           "track.length"           "instantaneous.velocity"

Then we create an empty data frame tos ave the results in:

#create empty data frame
clone.comparisons <- data.frame(
  variable = character(),
  cloneA.median = numeric(),
  cloneB.median = numeric(),
  median.ratio = numeric(),
  p.value = numeric()
)

Finally, we run this loop to compare the clones for each numeric variable, and save the results in the clone.comparisons data frame. Again it is important to run the loop code all at once.

#loop through each numeric column
for(column.name in numeric.columns) {
  
  #calculate median for clone A
  cloneA.median <- median(cloneA.data[[column.name]], na.rm = TRUE)
  
  #calculate median for clone B  
  cloneB.median <- median(cloneB.data[[column.name]], na.rm = TRUE)
  
  #calculate the median ratio (cloneA.median / cloneB.median)
  ratio <- cloneA.median / cloneB.median
  
  #run wilcox test comparing the two clones for this variable
  test.result <- wilcox.test(cells[[column.name]] ~ cells$clone)
  
  #extract the p-value
  p.val <- test.result$p.value
  
  #add this row to our results table
  new.row <- data.frame(
    variable = column.name,
    cloneA.median = signif(cloneA.median,2),
    cloneB.median = signif(cloneB.median,2),
    median.ratio = signif(ratio,2),
    p.value = p.val
  )
  
  #add the new.row to the clone.comparisons data frame
  clone.comparisons <- rbind(clone.comparisons, new.row)
}

When this is done, examine the output data frame with view(clone.comparisons). This

Tip

In your report, you could make one table for cell shape and movement data, or keep the tables separate. One table would be a more comprehensive summary. All you need to do to acheive this, is to run the loop code above with a data frame that contains both the cell shape and movement data.

3.2.3 PCA with Livecyte movement data

We saw in Workshop 2 that PCA is a powerful way to summarise complex data, and to show how the clones differ in a single plot.

So let’s try a PCA plot with only the cell movement data, displacement, track.length and instantaneous.velocity

We’ll prepare the data for PCA and then calculate PCA coordinates with the prcomp() function:

# PCA WITH LIVECYTE DATA ----

# prepare data for PCA

cell.data.for.pca <- cell.movement.data |>
  drop_na() # remove rows with NA values

# calculate PCA coordinates 
pca.result <- cell.data.for.pca |>
  select(-clone, -replicate) |>   # removes the group categories
  prcomp()                        # performs the PCA

To examine a summary of PCA results, we can use the summary(pca.result), this will show us how much of the variance (range of values) in the data is explained by each PCA coordinate. We can also use screeplot(pca.result) to visualise this. You will see that almost all the variance is explained by the first PCA coordinate (PC1). Note that there are only 3 possible PCA coordinates with this data.

Then we extract PCA scores and combine with clone and replicate info, and down sample the data so we can make a clearer plot. You may chose to show more or less of the data in the PCA plot. YOU can do this by changing this line of code, that currently selects 300 rows of the fata at random slice_sample(n = 300).

# extract PCA scores and combine with clone and replicate info
pca.scores <- as.data.frame(pca.result$x) |>
  bind_cols(cell.data.for.pca |> select(clone, replicate))

# downsample the PCA scores to 3000 cells per clone
pca.scores.sample <- pca.scores |>
  group_by(clone) |>
  slice_sample(n = 300) |>
  ungroup()

Finally, generate the PCA plot:

# make the plot
pca.plot <- ggplot(pca.scores.sample, 
  aes(x = PC1, y = PC2, color = clone, fill = clone)) +
  geom_point(alpha = 0.1, size = 4)+
  stat_ellipse(geom = "polygon", alpha = 0.1, linewidth = 1.2) +
  scale_color_manual(values = c("cloneA" = "red", "cloneB" = "blue"))+
  facet_wrap(~replicate)+
  theme_classic2()


# view the plot
pca.plot

At this stage, we could save this plot with ggsave("BIO00066I-workshop3-movement-pca-plot.png", width = 6, height = 4) if we like, but this PCA does not distingish clone A from clone B well at all.

Note
  • Why do you think this PCA does not work so well?
  • Should we run a PCA analysis with only 3 movement metrics?
  • Perhaps we should run PCA with cell movement and cell shape data together?

3.3 Manual tracking data

Sometimes, the automated measurements are not the best quality. In this case, we know that the Livecyte microscope is not very good at tracking cells, so the cell movement metrics are not as good as we would like.

So Amanda spent many hours manually tracking cells. These results were processed into manual tracking data. First, load the data and of course examine what you have.

# LOAD MANUAL TRACKING DATA ----

#load the manual tracking data
track <-read_tsv(url("https://djeffares.github.io/BIO66I/data/A1-and-B2-tracking.data.2025-02-27.tsv"))

As usual, it is wise to check what is in our data frame with names() and/or glimpse() or summary(track).

In this data, TID is the Livecyte tracking ID, a unique number that the microscope gives to each ‘object’ it can identify. LID is the Livecyte lineage ID. This keeps track of the cell lineage (ie: the initial cells, and the subsequent daughter cells that are derived from it as it divides). When a cell divides, both ‘daughter cells’ keep the same lineage ID, but each is assigned a new unique tracking ID.

Different names, same metric!

Note that different analysis software can use different terminologies to define their metrics. How confusing! In our case instantaneous.velocity from the cells data frame is the same as mean.speed in the tracking data. Both of these metrics are a measure of distance traveled/time.

Figure 1. The tracking ID (TID) is a unique ID that the microscope assigned to each object (cell), as it tracks it through time. The lineage ID (LID) is a number assigned to the ‘family’ of cells that derived by cell division during the the experiment. When a cell divides, each daughter cell is given a new tracking ID, but they keep their lineage ID, because they are still part of the same family.

3.3.1 Plotting manual tracking data

Let’s start by looking at the speed that each cell moves at during the experiment. This is recorded as mean.speed in this data set (the same as the instantenous.velocity in the cells data set). We can compare the clones (cell.line) like so:

# PLOTTING MANUAL TRACKING DATA ----

#compare mean.speed between cell lines
ggplot(track, aes(x=cell.line,y=mean.speed))+
  geom_boxplot()+
  stat_compare_means()
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_compare_means()`).

Improve the plot

Try these things to improve this plot:

  • use geom_violin instead of geom_boxplot
  • adding theme_classic
  • making the x-axis and y-axis look better with xlab("something") and ylab("something")
  • giving the plot a title with ggtitle("top title", subtitle ="sub text")

After the workshop, we advise you to plot some other movement metrics from the tracking data.

3.3.2 Correlations abound!

With biological data (and data from many other sources), different measurements of the same set of ‘things’ are often correlated. For example, human height and weight are strongly correlated. So it is with cells.

We could examine each correlation one by one, like so:

#examine whether track.length and mean.speed are correlated
cor.test(track$track.length,track$mean.speed,method="spearman")

    Spearman's rank correlation rho

data:  track$track.length and track$mean.speed
S = 701096, p-value = 1.92e-13
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4741649 
A Non parametric warning

Here we use the non parametric test method="spearman" to calculate correlations, because we know from looking at the data that the track.length and probably mean.speed are not normally distributed. In a Spearman rank correlation, rho is the correlation coefficient (sometimes rho is written using the Greek symbol \(\rho\)).

But since there are 11 cell movement metrics, this would quickly become boring. And (just as importantly) the data would be a challenge to understand.

Fortunately, there is an easier way. First we use the correlate function to calculate all pairwise correlations. We save the correlation coefficients in the track.correlations data frame.

# CORRELATIONS WITHIN MANUAL TRACKING DATA ----

#calculate all pairwise correlations
track.correlations <- 
  track |>
  correlate(method="spearman")
Non-numeric variables removed from input: `track.present.at.start.or.end`, `track.never.divides`, and `cell.line`
Correlation computed with
• Method: 'spearman'
• Missing treated using: 'pairwise.complete.obs'
#see what we have
head(track.correlations)
# A tibble: 6 × 8
  term                   LID    TID track.duration track.length meandering.index
  <chr>                <dbl>  <dbl>          <dbl>        <dbl>            <dbl>
1 LID               NA       -0.148         0.0411       0.0346           0.0281
2 TID               -0.148   NA            -0.255       -0.268            0.221 
3 track.duration     0.0411  -0.255        NA            0.840           -0.477 
4 track.length       0.0346  -0.268         0.840       NA               -0.345 
5 meandering.index   0.0281   0.221        -0.477       -0.345           NA     
6 euclidean.distan…  0.00902 -0.133         0.413        0.626            0.407 
# ℹ 2 more variables: euclidean.distance <dbl>, mean.speed <dbl>

Then we use pivot_longer ass we did in BIO00066I core workshop 1, to simplify the data format.

Figure 2. How pivot_longer reshapes data. Notice that no data is lost.

We will use pivot_longer this way:

#Adjust the name of the first column to "Variable1"
names(track.correlations)[1]="Variable1"

#simplify the data with pivot_longer
track.correlations.pivot <- 
  track.correlations |> 
  pivot_longer(-Variable1, names_to = "Variable2", values_to = "corr.coeff")

#examine what we have
head(track.correlations.pivot)
# A tibble: 6 × 3
  Variable1 Variable2          corr.coeff
  <chr>     <chr>                   <dbl>
1 LID       LID                  NA      
2 LID       TID                  -0.148  
3 LID       track.duration        0.0411 
4 LID       track.length          0.0346 
5 LID       meandering.index      0.0281 
6 LID       euclidean.distance    0.00902

Finally, we can show all the correlation coefficients in one plot. This gives us a unique and revealing sense of the data.

#plot data in the track.correlations.pivot table
#using geom_tile (for coloured boxes) and geom_text (to show the correlation coefficient values) 
ggplot(track.correlations.pivot, aes(Variable1, Variable2)) +
  geom_tile(aes(fill = corr.coeff)) +
  geom_text(aes(label = round(corr.coeff, 1))) +
  scale_fill_gradient(low = "white", high = "red")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_text()`).

3.3.2.1 Removing unhelpful columns

OK, this is a useful plot. If you have got this far in the workshop, you are doing well!

But it is a bit cluttered. We could improve it by removing the TID and LID columns, because these are arbitrary numbers that don’t tell us anything about the biology.

Luckily, we don’t have to do all our analysis again. We can just remove these columns from the track.correlations.pivot data frame, before we make the plot. Here is how:

#pipe track.correlations.pivot data frame into filter
track.correlations.pivot |>

  #filter out the TID and LID columns
  filter(Variable1 != 'LID') |>
  filter(Variable2 != 'LID') |>
  filter(Variable1 != 'TID') |>
  filter(Variable2 != 'TID') |>
  
  #now we put the plotting code here
  ggplot(aes(Variable1, Variable2)) +
  geom_tile(aes(fill = corr.coeff)) +
  geom_text(aes(label = round(corr.coeff, 1))) +
  scale_fill_gradient(low = "white", high = "red")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_text()`).

#and voila! no more TID and LID columns

3.3.3 Relate the maths to the biology

  • Positive correlations mean that when one metric is high, the other is high too. For example, track.length and mean.speed are positively correlated. This makes sense, because a cell that has managed to travel a long way, is probably a fast mover.

  • Negative correlations mean that when one metric is high, the other is low. For example, track.length and meandering.index are negatively correlated. This makes sense, because a cell that meanders about (doesn’t go a in straight line) probably doesn’t have a long total track length?

  • Correlations near zero mean that the two metrics are not strongly related.

Some points to note in this plot:
  • round(corr.coeff, 1) reduces the number of decimal places to 1.
  • try the plot with only: geom_text(aes(label = corr.coeff))+ (eek!)
  • element_text(angle = 90, vjust = 0.5, hjust=1) changes the direction of the x axis labels
  • try omitting this line (again, eeek!)
  • the geom_tile method can be used to plot any pairwise interaction data

3.3.4 PCA with manual tracking data

As with the Livecyte cell shape and cell movement data, we can also run a PCA analysis with the manual tracking data. This will show us how well the clones are separated by the PCA coordinates, and which metrics contribute to this separation.

This time, we can bundle all the code to one block, because it is not very long, and it is easier to understand if we see it all together. You can run this code block all at once, and then examine the PCA plot that it produces.

# prepare data for PCA
track.data.for.pca <- track |>
  #select the numeric columns
  select(cell.line, track.duration, track.length, euclidean.distance, meandering.index, mean.speed) |>
  # remove rows with NA values
  drop_na()


# calculate PCA coordinates
pca.result.track <- track.data.for.pca |>
  select(-cell.line) |>  # remove the group categories
  prcomp()               # performs the PCA

# extract PCA scores and combine with clone and replicate info
pca.scores.track <- as.data.frame(pca.result.track$x) |>
  bind_cols(track.data.for.pca |> select(cell.line))

# make the plot
pca.plot.track <- ggplot(pca.scores.track, 
  aes(x = PC1, y = PC2, color = cell.line, fill = cell.line)) +
  geom_point(size = 4)+
  stat_ellipse(geom = "polygon", alpha = 0.1, linewidth = 1.2) +
  theme_classic2()

# view and then save the plot
pca.plot.track

ggsave("BIO00066I-workshop3-track-pca-plot.png", width = 6, height = 4)

In this case, the elipses arund the points are not too helpful in my opinion. So you can remove these by removing the line stat_ellipse(geom = "polygon", alpha = 0.1, linewidth = 1.2) +.

4 Reflection

Today you have learned some data handling and plotting skills, and explored many facets of these mesenchymal stromal cells. You have learned:

  • How to use facet_wrap to split plots by a second category
  • How to plot values on a log10() scale
  • How to ‘lengthen’ and simplify data using pivot_longer
  • How to calculate and display numerous correlations in a grid with geom_tile
  • That PCA analysis lacks explanatory power if we have few metrics, but is powerful when we have many metrics
Some things to think about
  • What do we know about the clones now?
  • What do you know about how they move?
  • Which metrics are correlated?
  • What do correlations with negative \(\rho\) mean?

5 The end

That is the end for today.

Download Code

You can download all the R code from this workshop as a single script file: workshop3.R

6 After the workshop

6.1 Consolodation exercises

6.1.1 Explore the data

There are various ways you could explore this data further for your report. here, I will show how to make multi-part plots with ggarrange, show how to make a PCA plot with both cell shape and cell movement data together, and show how to extract more information from PCA results.

6.1.1.1 Making multi-part plots

You can use the ggarrange function from the ggpubr package to arrange multiple ggplot objects into a single figure. This is useful for comparing different plots side by side.

To do this you’ll need to save each plot in an object, like so:

# multi part livecyte data plot
# plot 1: displacement
displacement.plot <- cells |> 
  ggplot(aes(x=clone,y=log10(displacement)))+
  geom_violin(alpha=0.5)+
  stat_compare_means()

# plot 2: track.length
track.length.plot <- cells |> 
  ggplot(aes(x=clone,y=log10(track.length)))+
  geom_violin(alpha=0.5)+
  stat_compare_means()

# plot 3: width
width.plot <- cells |> 
  ggplot(aes(x=clone,y=log10(width)))+
  geom_violin(alpha=0.5)+
  stat_compare_means()

# plot 4: sphericity
sphericity.plot <- cells |> 
  ggplot(aes(x=clone,y=log10(sphericity)))+
  geom_violin(alpha=0.5)+
  stat_compare_means()

# arrange the plots into a 2x2 grid
combined.livecyte.plot <- ggarrange(
  displacement.plot, 
  track.length.plot, 
  width.plot, 
  sphericity.plot, 
  ncol=2, nrow=2,
  labels = c("A", "B", "C", "D"))

# save the combined plot
ggsave("BIO00066I-workshop3-combined-livecyte-plot.png", 
       combined.livecyte.plot, width = 8, height = 6)

6.1.1.2 PCA with Livecyte cell shape and cell movement data together

Very briefly, you can run PCA with both cell shape and cell movement data together, by including all the relevant columns in the data frame that you use for PCA. For example, you could use this code to prepare the data for PCA:

# Read the automated Livecyte data
cells <-read_tsv(url("https://djeffares.github.io/BIO66I/data/all-cell-data-FFT.filtered.2024-02-22.tsv"),
                 col_types = cols(
                   clone = col_factor(),
                   replicate = col_factor(),
                   tracking.id=col_factor(),
                   lineage.id=col_factor()
                 )
)

# prepare data for PCA with both cell shape and movement data
cells.for.pca.alldata <- cells |>
  #select the numeric columns, including the cell shape data and also the cell movement data
  select(clone, replicate, 
      # shape data
      volume, mean.thickness, radius, area, sphericity, length, width, orientation, dry.mass, 
      # movement data
      length.to.width, displacement, track.length, instantaneous.velocity) |>
  # remove rows with NA values
  drop_na()


#run PCA
pca.result <- cells.for.pca.alldata  |>
  select(-clone, -replicate) |>  # remove the group categories
  prcomp()                        # performs the PCA

# extract PCA scores and combine with clone and replicate info
pca.scores <- as.data.frame(pca.result$x) |>
  bind_cols(cells.for.pca.alldata |> select(clone, replicate))

# down sample the PCA scores to 3000 cells per clone
pca.scores.sample <- pca.scores |>
  group_by(clone) |>
  slice_sample(n = 3000) |>
  ungroup()

# create PCA1, PC2 plot
pca.plot <- ggplot(pca.scores.sample, 
  aes(x = PC1, y = PC2, color = clone, fill = clone)) +
  geom_point(alpha = 0.5, size = 2)+
  stat_ellipse(geom = "polygon", alpha = 0.05, linewidth = 1.2)+
  theme_classic2()

# view and save the plot  
pca.plot
ggsave("BIO00066I-workshop3-pca-alldata-plot.png", width = 6, height = 4)

6.1.1.3 Exploring PCA results

There is more to PCA than just plotting the PCA scores. It can be interesting to know how much of total variance in the data is explained by each principal component. This is often shown in a scree plot, but we can also calculate it manually from the PCA results. Here is the code to do both:

# make a scree plot from pca.result
screeplot(pca.result.track, type = "lines")

# Calculate manually from standard deviations
variance <- pca.result.track$sdev^2
percent.variance <- variance / sum(variance) * 100

# Or create a nice data frame
variance.explained <- data.frame(
  PC = paste0("PC", 1:length(percent.variance)),
  Variance = variance,
  Percent = percent.variance,
  Cumulative = cumsum(percent.variance)
)

# Visualize with a scree plot
barplot(percent.variance, 
        names.arg = paste0("PC", 1:length(percent.variance)),
        xlab = "Principal Component",
        ylab = "% Variance Explained")

6.2 Planning for your report

The RStudio project is worth 20% of this module. The submission of your RStudio project should:

  • have a logical folder structure for your analysis
  • contain a well-commented script
  • be well-organised and follow good practice in use of spacing, indentation and variable naming.

The details are described in this document.

The aim of all this is to make the code and the data analysis clear and readable for someone else.

So spending just 10 minutes tidying your code now will make your future life easier. For example, look at your variable names, and make them clearer.