Statistical Analysis｜ドクターフント(Dr. Hund)

[R beginners] Drawing ROC curve

brainblog — Sat, 25 Mar 2023 07:00:00 +0000

I will introduce how to draw an ROC curve. In R, it is very easy to draw an ROC curve. There are many libraries available for drawing ROC curves, but this time I will introduce how to draw it using the pROC library, which is commonly used and easy to use.

Creating Sample Data

This section is about creating sample data. If you don’t need it, please skip to the next section. Let’s consider how well we can detect people with the disease using two assays, “assay1” and “assay2”.

set.seed(3)
condition <- rep(c("healthy", "disease"), each = 50)
assay1 <- c(rnorm(40, mean = 1, sd = 2), 
            rnorm(10, mean = 3, sd = 1), 
            rnorm(40, mean = 5, sd = 2), 
            rnorm(10, mean = 7, sd = 1))
assay2 <- c(rnorm(30, mean = 1, sd = 3), 
            rnorm(20, mean = 2, sd = 2), 
            rnorm(30, mean = 3, sd = 3), 
            rnorm(20, mean = 4, sd = 2))

df = data.frame(condition, assay1, assay2)

these distributions would look like this:

library(beeswarm)
beeswarm(data = df, assay1 ~ condition, 
         col = c("red","blue"), pch=19)
beeswarm(data = df, assay2 ~ condition,
         col = c("red", "blue"), pch=19)

Assay1

Assay2

Use pROC

Install pROC package

First, you need to install and load the pROC package to use it in R.

install.packages("pROC")
library(pROC)

To draw a single ROC curve

The data frame looks like this:

> head(df)
  condition     assay1     assay2
1   healthy -0.9238668  2.8520501
2   healthy  0.4149486 -0.2152325
3   healthy  1.5175764  4.1593113
4   healthy -1.3042638  2.8068527
5   healthy  1.3915657  4.0523835
6   healthy  1.0602479  2.8245020

There are several ways to proceed from here, but first let’s use roc() to store the result in an object.

roc1 <- roc(condition, assay1, data=df, ci=TRUE,
            levels=c("healthy", "disease"))

As an alternative, you can also use the following notation with “~”:

roc1 <- roc(df$condition ~ df$assay1, ci=TRUE, 
            levels=c("healthy", "disease"))

The condition column can be 0 and 1, as it is a binary variable. ci refers to the confidence interval.

you can use the plot() function to draw the ROC curve.

plot(roc1)

Assay1

Convert to Percentage Display

You can customize the axis to display in percentage format by adding percent=TRUE.

roc1 <- roc(condition, assay1, data=df, percent=TRUE,
            levels=c("healthy", "disease"))
plot(roc1)

Display Confidence Interval of Sensitivity

You can display the confidence interval of the sensitivity. To calculate the confidence interval of sensitivity, use ci.se(). col is used to specify the color.

roc2 <- roc(condition, assay2, data=df, ci=TRUE,
            levels=c("healthy", "disease"))
plot(roc2)
rocCI <- ci.se(roc2)
plot(rocCI, type="shape", col="lightblue")

Display Optimal Cutoff Point

One important aspect of the ROC curve is determining the optimal cut-off point. While determining the optimal cut-off point is a big topic on its own, here we introduce two methods: the Youden Index and top left.

To explain briefly, the Youden Index is calculated as “sensitivity + specificity – 1”, and the point with the highest value is selected as the optimal cut-off point. On the other hand, the top left method extracts the point closest to the upper left corner of the ROC curve.

display Optimal Cutoff based on Youden Index

You can add the optimal cutoff to the plot by modifying the plot() function call. Setting print.thres = "best" and print.thres.best.method = "youden" will display the optimal point. Setting legacy.axes = TRUE will make the x-axis display as 1-Specificity.

plot(roc1, main = "ROC Curve",      
     identity = TRUE,
     print.thres = "best",
     print.thres.best.method = "youden",
     legacy.axes = TRUE)

optimal cutoff using the Youden Index (J = Sensitivity + Specificity – 1)

The optimal cut-off point in this case is 3.912 with sensitivity of 0.820 and specificity of 0.960.

Display Optimal Cutoff using Top Left Method

In the case of top left, set print.thres.best.method="closest.topleft" in plot() function.

plot(roc1, main = "ROC Curve",      
     identity = TRUE,
     print.thres = "best",
     print.thres.best.method = "closest.topleft"
     legacy.axes = TRUE)

The optimal cutoff based on the top left approach

Note that in this case, the optimal cutoff is 3.365, with a sensitivity of 0.860 and specificity of 0.880.

Extracting Necessary Values

When presenting or publishing the values of AUC (area under the curve), sensitivity, specificity, etc., you can display them as follows.

> auc(roc1)
Area under the curve: 0.9536

# youden
> coords(roc1, "best", ret=c("threshold", "sens", "spec", "ppv", "npv"))
          threshold sensitivity specificity       ppv       npv
threshold  3.912407        0.82        0.96 0.9534884 0.8421053

# topleft
> coords(roc1, "best", best.method="closest.topleft",　ret=c("threshold", "sens", "spec", "ppv", "npv"))
          threshold sensitivity specificity      ppv       npv
threshold  3.364778        0.86        0.88 0.877551 0.8627451

Comparing two ROC curves by overlapping them

We have created sample data assuming two tests. Let’s see which test is better by comparing the two ROC curves.

Overlay Two ROC Curves

When overlapping two ROC curves, you can use the lines() function for the one that you want to add later, or simply use plot(..., add=TRUE) for the second curve. You can specify the color using col.

# Use roc() to create objects for the two assays
roc1 <- roc(condition, assay1, data=df, ci=TRUE,
            levels=c("healthy", "disease"))
roc2 <- roc(condition, assay2, data=df, ci=TRUE, 
            levels=c("healthy", "disease"))

# Use lines()
obj1 <- plot(roc1,
             col="red")
obj2 <- lines(roc2,
              col="blue")

# Use add = TRUE
obj1 <- plot(roc1,
             col="red")
obj2 <- plot(roc2,
             col="blue",
             add=TRUE)

Compare Two Assays

o compare two assays, use roc.test(). By default, it compares the AUCs (area under the curve) using the DeLong method, but you can change it to the bootstrap method by specifying method = "bootstrap". The p-value is displayed in the center using text(). You can also specify the title using main=.

roc1 <- roc(condition, assay1, data=df, ci=TRUE,
            levels=c("healthy", "disease"))
roc2 <- roc(condition, assay2, data=df, ci=TRUE, 
            levels=c("healthy", "disease"))

obj1 <- plot(roc1,
             main="Comparison",
             col="red")
obj2 <- lines(roc2,
              col="blue")
obj <- roc.test(obj1, obj2)
text(.5, .5, labels=paste("p-value =", format.pval(obj$p.value, 3)), 
     adj=c(0, .5))

If you just want to see the comparison results, roc.test() will provide you with the results.

> roc.test(roc1, roc2)

	DeLong's test for two correlated ROC curves

data:  roc1 and roc2
Z = 3.7621, p-value = 0.0001685
alternative hypothesis: true difference in AUC is not equal to 0
95 percent confidence interval:
 0.0969536 0.3078464
sample estimates:
AUC of roc1 AUC of roc2 
     0.9536      0.7512

Finally, let’s add a legend.

roc1 <- roc(condition, assay1, data=df, ci=TRUE,
            levels=c("healthy", "disease"))
roc2 <- roc(condition, assay2, data=df, ci=TRUE, 
            levels=c("healthy", "disease"))

obj1 <- plot(roc1,
             main="Comparison",
             col="red")
obj2 <- lines(roc2,
              col="blue")

obj <- roc.test(obj1, obj2)

# p values in the graph
text(.5, .5, labels=paste("p-value =", format.pval(obj$p.value, 3)), 
     adj=c(0, .5))

#legend
legend("bottomright", legend=c("Assay1", "Assay2"),
       col=c("red", "blue"), lty=1, lwd=2)

That was a helpful tutorial on how to draw ROC curves in R using the pROC package. Thank you for sharing!

[R for beginners] Creating Error Bars for Bar Graphs and Line Graphs in ggplot2

brainblog — Fri, 24 Mar 2023 13:13:56 +0000

The ‘T’ symbol on bar graphs and line graphs represents error bars that extend above and below the data point, indicating standard error or standard deviation. Adding error bars in R’s ggplot2 is easy. Here’s a step-by-step guide:

Preparing the Data

We’ll be using tidyverse for this. If you have your own data, skip ahead.

# Set working directory
setwd("~/Rpractice/")

# Load tidyverse
library(tidyverse)

# Generate random data
dat <- list(
  X <- rnorm(50, 30, 10),
  Y <- rnorm(50, 50, 5),
  Z <- rnorm(50, 40, 15)
)
df <- data.frame(matrix(unlist(dat), nrow=50))
colnames(df) <- c("A","B","C")

# Transform the data from wide to long
df.long <- pivot_longer(df, cols = A:C, 
                        names_to = "Categories", 
                        values_to = "Values")

Now we have data that looks like this:

> head(df.long)
# A tibble: 6 x 2
  Categories Values
         
1 A            26.8
2 B            53.7
3 C            27.3
4 A            31.0
5 B            58.7
6 C            56.2
>

To add error bars to bar graphs and line graphs, you need the mean and standard error (or standard deviation) of the data. We use pivot_longer to transform the data from wide to long format.”

Calculating the Mean, Standard Deviation, and Standard Error of the Data

To add error bars, we need to calculate the mean, standard deviation, and standard error. If you already have this data, you can skip this section.

Calculating these values is easy with the dplyr package in tidyverse.

Use group_by and summarise_all functions in dplyr to Calculate

a <- group_by(df.long, Categories) %>% 
  summarise_all(list(mean = ~mean(.), 
                     sd = ~sd(.), 
                     se = ~sd(.)/sqrt(length(.))))

We can simplify this code further,

a <- group_by(df.long, Categories) %>% 
  summarise_all(list(mean = mean, 
                     sd = sd, 
                     se = ~sd/sqrt(length(.))))

Let’s take a look at the data for ‘a’

> a
# A tibble: 3 x 4
  Categories  mean    sd    se
          
1 A           30.1  9.93 1.40 
2 B           49.2  5.00 0.708
3 C           43.9 15.5  2.20

We have generated a distribution that looks like this. Since the data was generated randomly, the values may vary slightly if you follow the same steps.

Drawing Error Bars in Bar Graphs

We will use the calculated data to create a bar graph and specify the error bars using geom_errorbar(). First, let’s specify only ymin and ymax in geom_errorbar() and take a look at the graph.

ggplot(a, aes(x = Categories, y = mean, fill = Categories))+
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se))

The width is too wide. Let’s adjust it using the width argument. We will narrow both the bar graph and error bars.

ggplot(a, aes(x = Categories, y = mean, fill = Categories))+
  geom_bar(stat = "identity", width = 0.6) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                width = .1)

It’s a good to adjust the width according to the size of the output image.

Drawing Error Bars on a Line Graph

The process for adding error bars to a line graph is the same as above. First, draw the line graph and then add geom_errorbar().

ggplot(a, aes(x = Categories, y = mean)) +
  geom_line(group = 1) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                width = .1)

Adding Error Bars to Grouped Data

Let’s generate grouped data.

# Add ID column
data <- df.long %>% 
  tibble::rownames_to_column(var = "ID")

# Convert ID from character to numeric
data$ID <- as.numeric(data$ID)

# Assign 1 to ID <= 75 and 0 to ID >= 76 to create the "group" column
data <- mutate(data, group = ifelse(ID < 76, 1, 0))

# Calculate Mean, SD, and SE
b <- group_by(data, group, Categories) %>% 
  summarise_at(vars(Values), list(mean = ~mean(.), 
                     sd = ~sd(.), 
                     se = ~sd(.)/sqrt(length(.))))

# Convert group column values from numeric to character
b$group <- as.character(b$group)

# Check the data included in b
> head(b)
# A tibble: 6 x 5
# Groups:   group 
  group Categories  mean    sd    se
           
1 0     A           31.4 11.1  2.21 
2 0     B           51.0  4.31 0.862
3 0     C           33.9 13.5  2.70 
4 1     A           29.5 10.4  2.08 
5 1     B           49.5  4.02 0.804
6 1     C           36.5 15.8  3.16 
>

Now that we have grouped data, let’s draw a bar graph with error bars.

Adding Error Bars to Grouped Bar Graphs

By specifying position = "dodge", you can create a grouped bar chart like this.

ggplot(b, aes(x = Categories, y = mean, fill = group))+
  geom_bar(stat = "identity", width = 0.6, position = "dodge")

Let’s add error bars to this plot.

ggplot(b, aes(x = Categories, y = mean, fill = group))+
  geom_bar(stat = "identity", width = 0.6, position = "dodge")+
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se), 
                    position = "dodge", width = .1)

It is shifted to the center. You can adjust the position with position = position_dodge().

ggplot(b, aes(x = Categories, y = mean, fill = group)) +
  geom_bar(stat = "identity", width = 0.6, position = "dodge") +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                position = position_dodge(0.6), width = .1)

You’ve created a nice bar graph!

Drawing Error Bars on a Grouped Line Graph

The same can be done for a grouped line graph as well.

ggplot(b, aes(x = Categories, y = mean, group = group, color = group)) +
  geom_line() +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                width = .1, color = "black")

The error bars are on top and overlapping, making it difficult to see. Let’s move the error bars to the back first.

ggplot(b, aes(x = Categories, y = mean, group = group, color = group)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                width = .1, color = "black") +
  geom_line() +
  geom_point(size = 3)

By shifting the position to the left or right using position_dodge(), overlapping of error bars can be avoided.

ggplot(b, aes(x = Categories, y = mean, group = group, color = group)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
                width = .1, color = "black", position = position_dodge(.2)) +
  geom_line(position = position_dodge(.2)) +
  geom_point(size = 3, position = position_dodge(.2))

Now you can add error bars to line graphs. I hope this was helpful.

Violin Plot in R – How to Draw It –

brainblog — Fri, 24 Mar 2023 10:33:25 +0000

There are several ways to visualize data, and one of them is the violin plot! This method is useful for comparing multiple sets of data, and it has an appealing appearance.

First, you’ll need to prepare your data. While you’re at it, you can also check the data distribution using a dot plot.

Data Generation and Distribution Checking

Now, let’s create some data. If you have your own data, feel free to use that instead.

# Set working directory
setwd("~/Rpractice/")
# load tidyverse and ggbeeswarm
library(tidyverse)
library(ggbeeswarm)

# generate sample data
dat <- list(
  X <- rnorm(100, 5, 10),
  Y <- rnorm(100, 20, 10),
  Z <- rnorm(100, 15, 15)
)

# format the data into a more usable shape using pivot_longer
df <- data.frame(matrix(unlist(dat), nrow=100))
colnames(df) <- c("A","B","C")
df.long <- pivot_longer(df, cols = A:C, names_to = "Categories", values_to = "Values")

# draw dotplot
ggplot(df.long, aes(x = Categories, y = Values))+
  geom_beeswarm(aes(color = Categories),
                size = 2,
                cex = 2,
                alpha = .8)+
  theme_classic()+
  theme(legend.position = "none")

The dot plot displays the data distribution and can be used to confirm that the data is appropriate for creating a violin plot.

Draw a Simple Violin Plot

To draw a violin plot using ggplot2, you can utilize the geom_violin() function. To create a clean and simple plot, set the background color to white using the theme_classic() function.

ggplot(df.long, aes(x = Categories, y = Values))+
  geom_violin()+
  theme_classic()

The ends of the violin plot may appear cut off. By overlaying a dot plot on top of the violin plot, you can address this issue. To do so, you can use either geom_dotplot() or geom_beeswarm(), which are both part of the ggplot2 package.

ggplot(df.long, aes(x = Categories, y = Values))+
  geom_violin()+
  geom_beeswarm(aes(color = Categories),
                size = 2,
                cex = 2,
                alpha = .8)+
  theme_classic()

You can see that the ends of this violin plot are cut off at the minimum and maximum values. If you don’t want to cut off the ends, you can use geom_violin(trim = FALSE) to specify this preference.

ggplot(df.long, aes(x = Categories, y = Values))+
  geom_violin(trim = FALSE)+
  geom_beeswarm(aes(color = Categories),
                size = 2,
                cex = 2,
                alpha = .8)+
  theme_classic()

The violin plot may stretch vertically up and down, even in areas where there are no data points.

Add Color to Violin Plot

Add Color to the Borders

To add color to the border of the violin plot, you can use aes(color = ) inside the geom_violin() function.

ggplot(df.long, aes(x = Categories, y = Values, color = Categories))+
  geom_violin()+
  theme_classic()

Fill the interior of Violin Plot

If you want to fill the interior of the violin plot with color, you can use aes(fill =) inside the geom_violin() function.

ggplot(df.long, aes(x = Categories, y = Values, fill = Categories))+
  geom_violin()+
  theme_classic()

There are several ways to change the color of the violin plot. One way is to use the scale_fill_brewer() function to specify the color scheme.

ggplot(df.long, aes(x = Categories, y = Values, fill = Categories))+
  geom_violin()+
  scale_fill_brewer(palette = "Set2")+
  theme_classic()

Add Mean or Median to the Violin Plot

To add the mean or median to the violin plot, you can use the stat_summary() function.

Add Mean

ggplot(df.long, aes(x = Categories, y = Values, color = Categories))+
  geom_violin()+
  stat_summary(fun = mean, geom = "point", 
               shape = 16, size = 2, color = "red")+
  theme_classic()

The shape parameter in stat_summary() is the same as pch in base R. Here’s a list of shapes that correspond to each numerical value:

Add Median

ggplot(df.long, aes(x = Categories, y = Values, color = Categories))+
  geom_violin()+
  stat_summary(fun = median, geom = "point", 
               shape = 3, size = 2, color = "red")+
  theme_classic()

Change the Degree of Smoothing

To change the degree of smoothing in the violin plot, you can use the adjust parameter inside the geom_violin() function. The default value for adjust is 1.

To decrease the degree of smoothing, you can set adjust to a smaller value. For example, to set adjust to 0.2, you can use the following code:

ggplot(df.long, aes(x = Categories, y = Values, fill = Categories))+
  geom_violin(adjust = .2)+
  theme_classic()

Conversely, if you want to increase the degree of smoothing in the violin plot, you can set the adjust parameter to a larger value.

ggplot(df.long, aes(x = Categories, y = Values, fill = Categories))+
  geom_violin(adjust = 2)+
  theme_classic()

If you increase the degree of smoothing too much, the violin plot can become overly smoothed and lose important details.

Overlaying a Box Plot

Overlaying a violin plot with a box plot is a common technique in data visualization, and it can be a powerful way to display data.

ggplot(df.long, aes(x = Categories, y = Values, fill = Categories))+
  geom_violin()+
  geom_boxplot(width = .1, fill = "white")+
  theme_classic()

To hide the outliers in the box plot when overlaying it with a violin plot, you can use the outlier.color parameter inside the geom_boxplot() function.

ggplot(df.long, aes(x = Categories, y = Values, fill = Categories))+
  geom_violin()+
  geom_boxplot(width = .1, fill = "white", outlier.color = NA)+
  theme_classic()

You can fill the box plot with black color and add a white circle at the median value.

ggplot(df.long, aes(x = Categories, y = Values))+
  geom_violin()+
  geom_boxplot(width = .1, fill = "black", outlier.color = NA) +
  stat_summary(fun = median, geom = "point", fill = "white", shape = 21, size = 3) +
  theme_classic()

With the information provided, I believe you can now create a violin plot. I hope this guidance proves helpful!