Introduction
High-throughput technologies such as RNA-Seq and mass
spectrometry-based proteomics are increasingly being applied to large
sample cohorts, which creates vast amount of quantitative data for genes
and proteins. Many algorithms, software, and pipelines have been
developed to analyze these data. However, how to select optimal
algorithms, software, and parameters for analyzing a specific omics
dataset remains a significant challenge. To address this challenge, we
have developed an R package named OmicsEV
, which is
dedicated to compare and evaluate different data matrices generated from
the same omics dataset using different tools, algorithms, or parameter
settings. In OmicsEV
, we have implemented more than 15
evaluation metrics and all the evaluation results are included in an
HTML-report for intuitive browsing. OmicsEV is easy to install and use.
Only one function is needed to perform the whole evaluation process. A
GUI based on R shiny is also implemented.
Example data
A few examples can be downloaded at https://github.com/bzhanglab/OmicsEV. One of the examples contains 6 data matrices generated from the same RNA dataset using different normalization methods. In addition, a proteomics data matrix and a sample list are also included. How to run this example is shown below.
Running OmicsEV
Preparing inputs
The two major inputs files are the omics data tables and a sample annotation file. More details can be found below.
Running evaluation process
In OmicsEV
, Only one function
(run_omics_evaluation) is needed to perform the whole
evaluation process. An example is showing below:
library(OmicsEV)
run_omics_evaluation(data_dir = "datasets/",
sample_list = "sample_list.tsv",
x2 = "protein.tsv",
cpu=6,
data_type="gene",
class_for_ml="sample_ml.tsv")
In general, only a few parameters have to be set:
- data_dir: a folder contains the datasets (data tables) in tsv format to be evaluated. All datasets must be the same format. In these files, the first column must be gene ID or protein ID. The expression value of gene or protein must be in non-log scale. Missing values must be present as “NA”. If there is only one data table, it must also be put in a folder. An example input dataset looks like below:
ID | sample_1 | sample_2 | sample_3 | sample_4 | sample_5 | sample_6 |
---|---|---|---|---|---|---|
A1BG | 0.699 | 1.022 | 0.256 | 1.322 | 0.854 | 0.525 |
A2M | 3.085 | 0.392 | 0.681 | 0.540 | 0.930 | 0.757 |
A2ML1 | 1.398 | 1.106 | 0.981 | 0.954 | 1.869 | 0.790 |
A4GALT | 0.364 | 1.340 | 2.035 | 3.158 | 1.725 | 0.280 |
AAAS | 0.802 | 1.019 | 1.634 | 0.695 | 0.308 | 0.829 |
AACS | 0.689 | 0.505 | 0.420 | 1.069 | 0.266 | 0.333 |
AADAT | 1.312 | 2.429 | 2.344 | 1.491 | 0.983 | 0.467 |
AAED1 | 2.800 | 1.263 | 0.935 | 0.716 | 0.201 | 2.055 |
AAGAB | 0.230 | 1.149 | 0.634 | 0.599 | 1.753 | 0.810 |
AAK1 | 2.317 | 0.713 | 1.407 | 1.410 | 1.336 | 0.617 |
- sample_list: a file in tsv format contains sample class, batch and order information. This file must contain sample ID, sample class, batch and order information. An example sample list file looks like below. If there is no batch design in the study, the batch for all samples can be set to 1. The order is typically the order of data generation for samples. It should start from 1 and the max number is the number of samples in the table. The order is only used for sorting samples in some plots. It is not used for any quantitative metrics calculation. So, if users don’t know the data generation order, arbitrary order can be assigned to the samples. If there are QC samples in the data table, the class for these QC samples should be set as “QC”. OmicsEV will generate metrics based on these QC samples. Please note QC samples are optional in the data tables. If there is no QC sample in the data table, QC-related metrics will not be generated. In the sample list file, each row is a sample with a unique sample ID. That means there shouldn’t have samples in different rows with the same sample ID.
sample | class | batch | order |
---|---|---|---|
sample_1 | T | 1 | 1 |
sample_3 | T | 1 | 2 |
sample_5 | C | 1 | 3 |
sample_2 | T | 2 | 4 |
sample_4 | C | 2 | 5 |
sample_6 | C | 2 | 6 |
- data_type: the quantification data type in folder data_dir: protein, gene. Default is protein.
All other parameters are optional. When input data tables for
parameter data_dir are protein expression data and
users also have gene expression data for the same samples, users can set
parameter x2 as a file contains the gene expression
data in tsv format, and vice versa. If parameter x2 is
not NULL, sample wise and gene wise correlation analysis will be
performed. See ?run_omics_evaluation
for a more in-depth
description of all its arguments.
The parameter class_for_ml is also set in above example. This parameter is used to specify the class information for class prediction. A sample list file or a character vector such as class_for_ml=c(“T”,“C”) is supported. If this is a sample list file, it must have the same format with the parameter “sample_list”. This is useful when the class users want to predict is different from the one in the file for parameter “sample_list”. OmicsEV uses an R S3 data class object to store data table and sample annotation data so it also needs to have batch and order as this is format requirement although order and batch are not used in class prediction. This file can be modified from the file for parameter “sample_list” by only updating the class to what users want for class prediction. If users want to predict the class present in the file for parameter “sample_list”, then only a character vector to specify the class name is needed, such as class_for_ml=c(“T”,“C”). If sample class prediction is not needed, then don’t set anything to the parameter class_for_ml.
When the function is finished successfully, an HTML-based report that contains different evaluation metrics will be generated. Example reports are available at https://github.com/bzhanglab/OmicsEV.
Evaluation metrics implemented in OmicsEV
So far, more than 15 evaluation metrics have been implemented in
OmicsEV
and the evaluation result is organized in the
following structure:
- Introduction
- Overview
- Data depth
- Study-wise (#identified features, #quantifiable features)
- Sample-wise
- Missing value distribution (Non-missing value percentage in the data table)
- Data normalization
- Boxplot (Data distribution similarity)
- Density plot
- Batch effect
- Silhouette width (silhouette width)
- PCA with batch annotation (pcRegscale)
- Correlation heatmap
- Biological signal
- Correlation among protein complex members (complex_ks)
- Gene function prediction (func_auc)
- Sample class prediction (class_auc)
- PCA with sample class annotation
- Unsupervised clustering
- Platform reproducibility (optional with QC sample)
- Coefficient of variation distribution (median CV)
- Multi-omics concordance (optional with two omics)
- Gene-wise mRNA-protein correlation (gene wise cor)
- Sample-wise mRNA-protein correlation (sample wise cor)
OmicsEV evaluation report
A few example evaluation reports are available at https://github.com/bzhanglab/OmicsEV.
Session information
All software and respective versions used to produce this document are listed below.
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] formattable_0.2.1 kableExtra_1.3.4 dplyr_1.0.9
## [4] R.utils_2.11.0 R.oo_1.24.0 R.methodsS3_1.8.1
## [7] OmicsEV_0.99 xcms_3.14.1 MSnbase_2.18.0
## [10] ProtGenerics_1.24.0 S4Vectors_0.30.2 mzR_2.26.1
## [13] Rcpp_1.0.8.3 Biobase_2.52.0 BiocGenerics_0.38.0
## [16] BiocParallel_1.26.2 BiocStyle_2.20.2
##
## loaded via a namespace (and not attached):
## [1] Hmisc_4.7-0 svglite_2.1.0
## [3] corpcor_1.6.10 class_7.3-20
## [5] DiffCorr_0.4.2 foreach_1.5.2
## [7] crayon_1.5.1 MASS_7.3-57
## [9] CAMERA_1.48.0 nlme_3.1-157
## [11] backports_1.4.1 sva_3.40.0
## [13] ellipse_0.4.2 impute_1.66.0
## [15] rlang_1.0.4 XVector_0.32.0
## [17] caret_6.0-92 ROCR_1.0-11
## [19] limma_3.48.3 filelock_1.0.2
## [21] xgboost_1.6.0.1 rjson_0.2.21
## [23] bit64_4.0.5 glue_1.6.2
## [25] pheatmap_1.0.12 rngtools_1.5.2
## [27] AnnotationDbi_1.54.1 vsn_3.60.0
## [29] R2HTML_2.3.2 coop_0.6-3
## [31] tidyselect_1.1.2 SummarizedExperiment_1.22.0
## [33] XML_3.99-0.9 tidyr_1.2.0
## [35] ggpubr_0.4.0 MassSpecWavelet_1.58.0
## [37] xtable_1.8-4 magrittr_2.0.3
## [39] MsCoreUtils_1.4.0 evaluate_0.15
## [41] ncdf4_1.19 ggplot2_3.3.5
## [43] cli_3.3.0 zlibbioc_1.38.0
## [45] rstudioapi_0.13 doRNG_1.8.2
## [47] bslib_0.3.1 rpart_4.1.16
## [49] pls_2.8-0 lambda.r_1.2.4
## [51] prettydoc_0.4.1 xfun_0.30
## [53] clue_0.3-60 multtest_2.48.0
## [55] cluster_2.1.3 metaX_2.0.0
## [57] TSP_1.2-0 pcaMethods_1.84.0
## [59] KEGGREST_1.32.0 tibble_3.1.6
## [61] ggrepel_0.9.1 ape_5.6-2
## [63] listenv_0.8.0 Biostrings_2.60.2
## [65] png_0.1-7 future_1.25.0
## [67] ipred_0.9-12 withr_2.5.0
## [69] bitops_1.0-7 RBGL_1.68.0
## [71] plyr_1.8.7 mzID_1.30.0
## [73] hardhat_0.2.0 e1071_1.7-9
## [75] pROC_1.18.0 pillar_1.7.0
## [77] GlobalOptions_0.1.2 cachem_1.0.6
## [79] flexmix_2.3-17 kernlab_0.9-30
## [81] scatterplot3d_0.3-41 GetoptLong_1.0.5
## [83] vctrs_0.4.1 ellipsis_0.3.2
## [85] generics_0.1.2 SSPA_2.30.0
## [87] lava_1.6.10 tools_4.1.1
## [89] foreign_0.8-82 faahKO_1.32.0
## [91] munsell_0.5.0 proxy_0.4-26
## [93] DelayedArray_0.18.0 abind_1.4-5
## [95] fastmap_1.1.0 compiler_4.1.1
## [97] plotly_4.10.0 BBmisc_1.12
## [99] GenomeInfoDbData_1.2.6 prodlim_2019.11.13
## [101] gridExtra_2.3 edgeR_3.34.1
## [103] lattice_0.20-45 utf8_1.2.2
## [105] BiocFileCache_2.0.0 recipes_0.2.0
## [107] jsonlite_1.8.0 affy_1.70.0
## [109] kBET_0.99.6 scales_1.2.0
## [111] graph_1.70.0 carData_3.0-5
## [113] lazyeval_0.2.2 genefilter_1.74.1
## [115] car_3.0-13 doParallel_1.0.17
## [117] latticeExtra_0.6-29 DiscriMiner_0.1-29
## [119] missForest_1.5 checkmate_2.1.0
## [121] rmarkdown_2.14 rARPACK_0.11-0
## [123] webshot_0.5.3 mixOmics_6.16.3
## [125] softImpute_1.4-1 igraph_1.3.1
## [127] survival_3.3-1 yaml_2.3.5
## [129] systemfonts_1.0.4 prabclus_2.3-2
## [131] htmltools_0.5.2 memoise_2.0.1
## [133] modeltools_0.2-23 locfit_1.5-9.5
## [135] seriation_1.3.5 IRanges_2.26.0
## [137] viridisLite_0.4.0 digest_0.6.29
## [139] assertthat_0.2.1 rappdirs_0.3.3
## [141] futile.options_1.0.1 registry_0.5-1
## [143] RSQLite_2.2.13 future.apply_1.9.0
## [145] VennDiagram_1.7.3 data.table_1.14.2
## [147] blob_1.2.3 futile.logger_1.4.3
## [149] preprocessCore_1.54.0 splines_4.1.1
## [151] Formula_1.2-4 fpc_2.2-9
## [153] Cairo_1.5-15 RCurl_1.98-1.6
## [155] broom_0.8.0 hms_1.1.1
## [157] colorspace_2.0-3 base64enc_0.1-3
## [159] BiocManager_1.30.17 GenomicRanges_1.44.0
## [161] shape_1.4.6 nnet_7.3-17
## [163] sass_0.4.1 mclust_5.4.9
## [165] RANN_2.6.1 circlize_0.4.14
## [167] ropls_1.24.0 fansi_1.0.3
## [169] tzdb_0.3.0 Nozzle.R1_1.1-1
## [171] parallelly_1.31.1 ModelMetrics_1.2.2.2
## [173] R6_2.5.1 grid_4.1.1
## [175] lifecycle_1.0.1 formatR_1.12
## [177] itertools_0.1-3 ggsignif_0.6.3
## [179] curl_4.3.2 affyio_1.62.0
## [181] jquerylib_0.1.4 robustbase_0.95-0
## [183] fastcluster_1.2.3 Matrix_1.4-1
## [185] qvalue_2.24.0 NetSAM_1.31.1
## [187] RColorBrewer_1.1-3 iterators_1.0.14
## [189] stringr_1.4.0 gower_1.0.0
## [191] htmlwidgets_1.5.4 biomaRt_2.48.3
## [193] purrr_0.3.4 rvest_1.0.2
## [195] ComplexHeatmap_2.8.0 MALDIquant_1.21
## [197] mgcv_1.8-40 globals_0.14.0
## [199] htmlTable_2.4.0 codetools_0.2-18
## [201] matrixStats_0.62.0 lubridate_1.8.0
## [203] GO.db_3.13.0 FNN_1.1.3
## [205] randomForest_4.7-1 prettyunits_1.1.1
## [207] dbplyr_2.1.1 RSpectra_0.16-1
## [209] GenomeInfoDb_1.28.4 gtable_0.3.0
## [211] DBI_1.1.2 dynamicTreeCut_1.63-1
## [213] highr_0.9 httr_1.4.2
## [215] stringi_1.7.6 progress_1.2.2
## [217] reshape2_1.4.4 diptest_0.76-0
## [219] annotate_1.70.0 fdrtool_1.2.17
## [221] timeDate_3043.102 xml2_1.3.3
## [223] boot_1.3-28 WGCNA_1.71
## [225] readr_2.1.2 DEoptimR_1.0-11
## [227] bit_4.0.4 jpeg_0.1-9
## [229] MatrixGenerics_1.4.3 pkgconfig_2.0.3
## [231] rstatix_0.7.0 bootstrap_2019.6
## [233] knitr_1.39