Use mock data in tests

Wrap up vignette and paper
willgryan · Oct 6, 2023 · 32f0d56 · 32f0d56
1 parent 08da313
commit 32f0d56
Show file tree

Hide file tree

Showing 11 changed files with 232 additions and 114 deletions.
diff --git a/R/PAVER_theme_plot.R b/R/PAVER_theme_plot.R
@@ -14,9 +14,9 @@
 PAVER_theme_plot <- function(PAVER_result) {
 
   plot = PAVER_result$umap$layout %>%
-    tibble::as_tibble(rownames = NA) %>%
+    tibble::as_tibble(rownames = NA, .name_repair = "universal") %>%
     tibble::rownames_to_column("UniqueID") %>%
-    dplyr::rename(UMAP1 = "V1", UMAP2 = "V2") %>%
+    dplyr::rename_with(.cols = 2:3, ~ c("UMAP1", "UMAP2")) %>%
     dplyr::inner_join(PAVER_result$clustering %>%
                         dplyr::select(.data$UniqueID, .data$Group, .data$Cluster), by = "UniqueID") %>%
     ggplot2::ggplot(ggplot2::aes(x = .data$UMAP1,

diff --git a/joss/paper.md b/joss/paper.md
@@ -38,15 +38,15 @@ affiliations:
 
 # Summary
 
-Omics experiments are commonly used to predict changes in pathways underlying phenotypes. However, the results of these experiments are often long lists of pathways that are difficult to interpret. PAVER is an R package that automatically curates long lists of pathways into groups, identifies which pathway is most representative of each group, and provides publication-ready intuitive visualizations. PAVER makes it easy to integrate multiple pathway analyses, identify relevant biological insights and can work with any pathway database.
+Omics studies are commonly used to predict changes in biological pathways underlying phenotypes. However, the results of omics experiments can be long lists of pathways that are difficult to interpret. PAVER is an R package that automatically curates long lists of pathways into groups, identifies which pathway is most representative of each group, and provides publication-ready intuitive visualizations. PAVER makes it easy to integrate multiple pathway analyses, identify relevant biological insights and can work with any pathway database.
 
 # Statement of Need
 
-Multiomics is used extensively in biological research today. However, the development of omics technologies has vastly outpaced the expertise of researchers in its analysis, and the resulting “data deluge” now overwhelms the capacity of human cognition [@RN16; @RN20; @RN19]. Analysis of omics data is therefore the major bottleneck in most research projects today and its use in precision medicine remains limited accordingly [@RN26; @RN63]. Pathway analysis has since become ubiquitous to help interpret omics data and elucidate mechanisms of biological phenomena under study [@RN6]. Despite the last decade bringing a host of different computational tools to perform pathway analysis, they each generally result in lists of results too long to manually inspect and extract relevant targets for downstream wet lab validation without introducing biases [@RN5; @RN81]. Interpretation of results is accordingly the greatest expense in any omics project [@RN21]. With the total volume of omics data continuing to grow, novel ways of data management are needed [@RN22]. FAIR (Findable, Accessible, Interoperable, Reusable) scientific data principles necessitate automated interpretation of omics results [@RN25].
+Omics is used extensively in biological research today. However, the development of omics technologies has vastly outpaced the expertise of researchers in its analysis, and the resulting “data deluge” now overwhelms the capacity of human cognition [@RN16; @RN20; @RN19]. Analysis of omics data is therefore the major bottleneck in most research projects today and its use in precision medicine remains limited accordingly [@RN26; @RN63]. Pathway analysis has since become ubiquitous to help interpret omics data and elucidate mechanisms of biological phenomena under study [@RN6]. Despite the last decade bringing a host of different computational tools to perform pathway analysis, they each generally result in lists of results too long to manually inspect and extract relevant targets for downstream wet lab validation without introducing biases [@RN5; @RN81]. Interpretation of results is accordingly the greatest expense in any omics project [@RN21]. With the total volume of omics data continuing to grow, novel ways of data management are needed [@RN22]. FAIR (Findable, Accessible, Interoperable, Reusable) scientific data principles necessitate automated interpretation of omics results [@RN25].
 
 # Overview
 
-PAVER uses vector embeddings to help interpret pathway analyses. Embeddings encode the meaning of pathways into numerical representations which can then be clustered and visualized (\autoref{fig:overview}). To identify which pathway is most representative of a cluster, PAVER first takes the average embedding of all pathways in a cluster to capture it's overall meaning into a single numerical representation [@RN49]. It then finds which pathway is most similar to the average embedding and labels the cluster with that pathway. This allows PAVER to automatically curate long lists of pathways into groups and identify which pathway is most representative of each group.
+PAVER uses vector embeddings to help interpret pathway analyses. Embeddings encode the meaning of pathways into numerical representations which can then be hierarchically clustered and visualized (\autoref{fig:overview}). To identify which pathway is most representative of a cluster, PAVER first takes the average embedding of all pathways in a cluster to capture it's overall meaning into a single numerical representation [@RN49]. It then finds which pathway is most similar to the average embedding and labels the cluster with that pathway. This allows PAVER to automatically curate long lists of pathways into groups and identify which pathway is most representative of each group.
 
 ![PAVER uses numerical representations of pathways to find functionally related clusters.\label{fig:overview}](figures/overview.png)
 
@@ -60,4 +60,8 @@ The PAVER R package is licensed under the GNU General Public License v3.0. It ca
 
 This work was supported by NIH T32-G-RISE grant number 1T32GM144873-01.
 
+# Disclosure
+
+The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
+
 # References
diff --git a/tests/testthat/test-PAVER_combined_plot.R b/tests/testthat/test-PAVER_combined_plot.R
@@ -4,19 +4,32 @@ library(ggpubr)
 
 test_that("PAVER_combined_plot works correctly", {
 
-  #Use vignette example data
-  input = gsea_example
-
-  embeddings = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/embeddings_2023-03-06.RDS"))
-
-  term2name = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/term2name_2023-03-06.RDS"))
-
-  PAVER_result = prepare_data(input, embeddings, term2name)
-
-  PAVER_result <- generate_themes(PAVER_result, minClusterSize = 40)
+  #Mock input data
+  mock_input <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    GroupA = rnorm(250),
+    GroupB = rnorm(250),
+    GroupC = rnorm(250)
+  )
+
+  # Mock embeddings data
+  mock_embeddings <- matrix(rnorm(250 * 10), 250, 10)
+  rownames(mock_embeddings) <- paste0("GO:", sprintf("%07d", 1:250))
+
+  # Mock term2name data
+  mock_term2name <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    TermName = paste0("Term ", 1:250)
+  )
+
+  # Generate the mock PAVER_result using the prepare_data function
+  mock_PAVER_result <- prepare_data(mock_input, mock_embeddings, mock_term2name)
+
+  # Run the generate_themes function with the mock PAVER_result
+  result <- generate_themes(mock_PAVER_result)
 
   # Run the function and catch the result
-  p <- PAVER_combined_plot(PAVER_result)
+  p <- PAVER_combined_plot(result)
 
   # Verify the function runs and produces a ggplot object
   expect_s3_class(p, "gg")

diff --git a/tests/testthat/test-PAVER_export.R b/tests/testthat/test-PAVER_export.R
@@ -5,19 +5,32 @@ library(tibble)
 
 test_that("PAVER_export works correctly", {
 
-  #Use vignette example data
-  input = gsea_example
-
-  embeddings = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/embeddings_2023-03-06.RDS"))
-
-  term2name = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/term2name_2023-03-06.RDS"))
-
-  PAVER_result = prepare_data(input, embeddings, term2name)
-
-  PAVER_result <- generate_themes(PAVER_result, minClusterSize = 40)
+  #Mock input data
+  mock_input <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    GroupA = rnorm(250),
+    GroupB = rnorm(250),
+    GroupC = rnorm(250)
+  )
+
+  # Mock embeddings data
+  mock_embeddings <- matrix(rnorm(250 * 10), 250, 10)
+  rownames(mock_embeddings) <- paste0("GO:", sprintf("%07d", 1:250))
+
+  # Mock term2name data
+  mock_term2name <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    TermName = paste0("Term ", 1:250)
+  )
+
+  # Generate the mock PAVER_result using the prepare_data function
+  mock_PAVER_result <- prepare_data(mock_input, mock_embeddings, mock_term2name)
+
+  # Run the generate_themes function with the mock PAVER_result
+  result <- generate_themes(mock_PAVER_result)
 
   # Test PAVER_export function
-  export_result <- PAVER_export(PAVER_result)
+  export_result <- PAVER_export(result)
 
   # Verify the structure and content of the output
   expect_s3_class(export_result, "tbl_df")

diff --git a/tests/testthat/test-PAVER_hunter_plot.R b/tests/testthat/test-PAVER_hunter_plot.R
@@ -4,19 +4,32 @@ library(ggpubr)
 
 test_that("PAVER_hunter_plot works correctly", {
 
-  #Use vignette example data
-  input = gsea_example
-
-  embeddings = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/embeddings_2023-03-06.RDS"))
-
-  term2name = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/term2name_2023-03-06.RDS"))
-
-  PAVER_result = prepare_data(input, embeddings, term2name)
-
-  PAVER_result <- generate_themes(PAVER_result, minClusterSize = 40)
+  #Mock input data
+  mock_input <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    GroupA = rnorm(250),
+    GroupB = rnorm(250),
+    GroupC = rnorm(250)
+  )
+
+  # Mock embeddings data
+  mock_embeddings <- matrix(rnorm(250 * 10), 250, 10)
+  rownames(mock_embeddings) <- paste0("GO:", sprintf("%07d", 1:250))
+
+  # Mock term2name data
+  mock_term2name <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    TermName = paste0("Term ", 1:250)
+  )
+
+  # Generate the mock PAVER_result using the prepare_data function
+  mock_PAVER_result <- prepare_data(mock_input, mock_embeddings, mock_term2name)
+
+  # Run the generate_themes function with the mock PAVER_result
+  result <- generate_themes(mock_PAVER_result)
 
   # Run the function and catch the result
-  p <- PAVER_hunter_plot(PAVER_result)
+  p <- PAVER_hunter_plot(result)
 
   # Verify the function runs and produces a ggplot object
   expect_s4_class(p, "HeatmapList")

diff --git a/tests/testthat/test-PAVER_interpretation_plot.R b/tests/testthat/test-PAVER_interpretation_plot.R
@@ -4,19 +4,32 @@ library(ggpubr)
 
 test_that("PAVER_interpretation_plot works correctly", {
 
-  #Use vignette example data
-  input = gsea_example
-
-  embeddings = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/embeddings_2023-03-06.RDS"))
-
-  term2name = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/term2name_2023-03-06.RDS"))
-
-  PAVER_result = prepare_data(input, embeddings, term2name)
-
-  PAVER_result <- generate_themes(PAVER_result, minClusterSize = 40)
+  #Mock input data
+  mock_input <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    GroupA = rnorm(250),
+    GroupB = rnorm(250),
+    GroupC = rnorm(250)
+  )
+
+  # Mock embeddings data
+  mock_embeddings <- matrix(rnorm(250 * 10), 250, 10)
+  rownames(mock_embeddings) <- paste0("GO:", sprintf("%07d", 1:250))
+
+  # Mock term2name data
+  mock_term2name <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    TermName = paste0("Term ", 1:250)
+  )
+
+  # Generate the mock PAVER_result using the prepare_data function
+  mock_PAVER_result <- prepare_data(mock_input, mock_embeddings, mock_term2name)
+
+  # Run the generate_themes function with the mock PAVER_result
+  result <- generate_themes(mock_PAVER_result)
 
   # Run the function and catch the result
-  p <- PAVER_interpretation_plot(PAVER_result)
+  p <- PAVER_interpretation_plot(result)
 
   # Verify the function runs and produces a ggplot object
   expect_s3_class(p, "gg")

diff --git a/tests/testthat/test-PAVER_regulation_plot.R b/tests/testthat/test-PAVER_regulation_plot.R
@@ -4,19 +4,32 @@ library(ggpubr)
 
 test_that("PAVER_regulation_plot works correctly", {
 
-  #Use vignette example data
-  input = gsea_example
-
-  embeddings = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/embeddings_2023-03-06.RDS"))
-
-  term2name = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/term2name_2023-03-06.RDS"))
-
-  PAVER_result = prepare_data(input, embeddings, term2name)
-
-  PAVER_result <- generate_themes(PAVER_result, minClusterSize = 40)
+  #Mock input data
+  mock_input <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    GroupA = rnorm(250),
+    GroupB = rnorm(250),
+    GroupC = rnorm(250)
+  )
+
+  # Mock embeddings data
+  mock_embeddings <- matrix(rnorm(250 * 10), 250, 10)
+  rownames(mock_embeddings) <- paste0("GO:", sprintf("%07d", 1:250))
+
+  # Mock term2name data
+  mock_term2name <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    TermName = paste0("Term ", 1:250)
+  )
+
+  # Generate the mock PAVER_result using the prepare_data function
+  mock_PAVER_result <- prepare_data(mock_input, mock_embeddings, mock_term2name)
+
+  # Run the generate_themes function with the mock PAVER_result
+  result <- generate_themes(mock_PAVER_result)
 
   # Run the function and catch the result
-  p <- PAVER_regulation_plot(PAVER_result)
+  p <- PAVER_regulation_plot(result)
 
   # Verify the function runs and produces a ggplot object
   expect_s3_class(p, "gg")

diff --git a/tests/testthat/test-PAVER_theme_plot.R b/tests/testthat/test-PAVER_theme_plot.R
@@ -4,19 +4,32 @@ library(ggpubr)
 
 test_that("PAVER_theme_plot works correctly", {
 
-  #Use vignette example data
-  input = gsea_example
-
-  embeddings = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/embeddings_2023-03-06.RDS"))
-
-  term2name = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/term2name_2023-03-06.RDS"))
-
-  PAVER_result = prepare_data(input, embeddings, term2name)
-
-  PAVER_result <- generate_themes(PAVER_result, minClusterSize = 40)
+  #Mock input data
+  mock_input <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    GroupA = rnorm(250),
+    GroupB = rnorm(250),
+    GroupC = rnorm(250)
+  )
+
+  # Mock embeddings data
+  mock_embeddings <- matrix(rnorm(250 * 10), 250, 10)
+  rownames(mock_embeddings) <- paste0("GO:", sprintf("%07d", 1:250))
+
+  # Mock term2name data
+  mock_term2name <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    TermName = paste0("Term ", 1:250)
+  )
+
+  # Generate the mock PAVER_result using the prepare_data function
+  mock_PAVER_result <- prepare_data(mock_input, mock_embeddings, mock_term2name)
+
+  # Run the generate_themes function with the mock PAVER_result
+  result <- generate_themes(mock_PAVER_result)
 
   # Run the function and catch the result
-  p <- PAVER_theme_plot(PAVER_result)
+  p <- PAVER_theme_plot(result)
 
   # Verify the function runs and produces a ggplot object
   expect_s3_class(p, "gg")

diff --git a/tests/testthat/test-generate_themes.R b/tests/testthat/test-generate_themes.R
@@ -7,27 +7,40 @@ library(umap)
 library(dynamicTreeCut)
 library(randomcoloR)
 
-test_that("generate_themes works correctly", {
-
-  #Use vignette example data
-  input = gsea_example
-
-  embeddings = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/embeddings_2023-03-06.RDS"))
-
-  term2name = readRDS(url("https://github.com/willgryan/PAVER_embeddings/raw/main/2023-03-06/term2name_2023-03-06.RDS"))
-
-  PAVER_result = prepare_data(input, embeddings, term2name)
-
-  # Test generate_themes function
-  result <- generate_themes(PAVER_result, minClusterSize = 40)
-
-  # Verify output structure
-  expect_type(result, "list")
-  expect_named(result, c("prepared_data", "embedding_mat", "umap", "goterms_df", "clustering", "avg_cluster_embeddings", "mds", "colors"))
-  expect_s3_class(result$clustering, "tbl_df")
-  expect_s3_class(result$avg_cluster_embeddings, "tbl_df")
-  expect_s3_class(result$mds, "smacof")
-  expect_type(result$colors, "character")
-
+# Mock unit test for the generate_themes function
+test_that("generate_themes function works with mock PAVER_result", {
+
+  #Mock input data
+  mock_input <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    GroupA = rnorm(250),
+    GroupB = rnorm(250),
+    GroupC = rnorm(250)
+  )
+
+  # Mock embeddings data
+  mock_embeddings <- matrix(rnorm(250 * 10), 250, 10)
+  rownames(mock_embeddings) <- paste0("GO:", sprintf("%07d", 1:250))
+
+  # Mock term2name data
+  mock_term2name <- data.frame(
+    GOID = paste0("GO:", sprintf("%07d", 1:250)),
+    TermName = paste0("Term ", 1:250)
+  )
+
+  # Generate the mock PAVER_result using the prepare_data function
+  mock_PAVER_result <- prepare_data(mock_input, mock_embeddings, mock_term2name)
+
+  # Run the generate_themes function with the mock PAVER_result
+  result <- generate_themes(mock_PAVER_result)
+
+  # Test that result is a list
+  expect_true(is.list(result))
+
+  # Test that the result contains expected elements
+  expect_true("clustering" %in% names(result))
+  expect_true("avg_cluster_embeddings" %in% names(result))
+  expect_true("mds" %in% names(result))
+  expect_true("colors" %in% names(result))
 })