Introduction

High-throughput technologies such as RNA-Seq and mass spectrometry-based proteomics are increasingly being applied to large sample cohorts, which creates vast amount of quantitative data for genes and proteins. Many algorithms, software, and pipelines have been developed to analyze these data. However, how to select optimal algorithms, software, and parameters for analyzing a specific omics dataset remains a significant challenge. To address this challenge, we have developed an R package named OmicsEV, which is dedicated to compare and evaluate different data matrices generated from the same omics dataset using different tools, algorithms, or parameter settings. In OmicsEV, we have implemented more than 15 evaluation metrics and all the evaluation results are included in an HTML-report for intuitive browsing. OmicsEV is easy to install and use. Only one function is needed to perform the whole evaluation process. A GUI based on R shiny is also implemented.

Example data

A few examples can be downloaded at https://github.com/bzhanglab/OmicsEV. One of the examples contains 6 data matrices generated from the same RNA dataset using different normalization methods. In addition, a proteomics data matrix and a sample list are also included. How to run this example is shown below.

Running OmicsEV

Preparing inputs

The two major inputs files are the omics data tables and a sample annotation file. More details can be found below.

Running evaluation process

In OmicsEV, Only one function (run_omics_evaluation) is needed to perform the whole evaluation process. An example is showing below:

library(OmicsEV)
run_omics_evaluation(data_dir = "datasets/",
                     sample_list = "sample_list.tsv",
                     x2 = "protein.tsv",
                     cpu=6,
                     data_type="gene",
                     class_for_ml="sample_ml.tsv")

In general, only a few parameters have to be set:

data_dir: a folder contains the datasets (data tables) in tsv format to be evaluated. All datasets must be the same format. In these files, the first column must be gene ID or protein ID. The expression value of gene or protein must be in non-log scale. Missing values must be present as “NA”. If there is only one data table, it must also be put in a folder. An example input dataset looks like below:

An example of input dataset
ID	sample_1	sample_2	sample_3	sample_4	sample_5	sample_6
A1BG	0.699	1.022	0.256	1.322	0.854	0.525
A2M	3.085	0.392	0.681	0.540	0.930	0.757
A2ML1	1.398	1.106	0.981	0.954	1.869	0.790
A4GALT	0.364	1.340	2.035	3.158	1.725	0.280
AAAS	0.802	1.019	1.634	0.695	0.308	0.829
AACS	0.689	0.505	0.420	1.069	0.266	0.333
AADAT	1.312	2.429	2.344	1.491	0.983	0.467
AAED1	2.800	1.263	0.935	0.716	0.201	2.055
AAGAB	0.230	1.149	0.634	0.599	1.753	0.810
AAK1	2.317	0.713	1.407	1.410	1.336	0.617

sample_list: a file in tsv format contains sample class, batch and order information. This file must contain sample ID, sample class, batch and order information. An example sample list file looks like below. If there is no batch design in the study, the batch for all samples can be set to 1. The order is typically the order of data generation for samples. It should start from 1 and the max number is the number of samples in the table. The order is only used for sorting samples in some plots. It is not used for any quantitative metrics calculation. So, if users don’t know the data generation order, arbitrary order can be assigned to the samples. If there are QC samples in the data table, the class for these QC samples should be set as “QC”. OmicsEV will generate metrics based on these QC samples. Please note QC samples are optional in the data tables. If there is no QC sample in the data table, QC-related metrics will not be generated. In the sample list file, each row is a sample with a unique sample ID. That means there shouldn’t have samples in different rows with the same sample ID.

An example of sample list
sample	class	batch	order
sample_1	T	1	1
sample_3	T	1	2
sample_5	C	1	3
sample_2	T	2	4
sample_4	C	2	5
sample_6	C	2	6

data_type: the quantification data type in folder data_dir: protein, gene. Default is protein.

All other parameters are optional. When input data tables for parameter data_dir are protein expression data and users also have gene expression data for the same samples, users can set parameter x2 as a file contains the gene expression data in tsv format, and vice versa. If parameter x2 is not NULL, sample wise and gene wise correlation analysis will be performed. See ?run_omics_evaluation for a more in-depth description of all its arguments.

The parameter class_for_ml is also set in above example. This parameter is used to specify the class information for class prediction. A sample list file or a character vector such as class_for_ml=c(“T”,“C”) is supported. If this is a sample list file, it must have the same format with the parameter “sample_list”. This is useful when the class users want to predict is different from the one in the file for parameter “sample_list”. OmicsEV uses an R S3 data class object to store data table and sample annotation data so it also needs to have batch and order as this is format requirement although order and batch are not used in class prediction. This file can be modified from the file for parameter “sample_list” by only updating the class to what users want for class prediction. If users want to predict the class present in the file for parameter “sample_list”, then only a character vector to specify the class name is needed, such as class_for_ml=c(“T”,“C”). If sample class prediction is not needed, then don’t set anything to the parameter class_for_ml.

When the function is finished successfully, an HTML-based report that contains different evaluation metrics will be generated. Example reports are available at https://github.com/bzhanglab/OmicsEV.

Evaluation metrics implemented in OmicsEV

So far, more than 15 evaluation metrics have been implemented in OmicsEV and the evaluation result is organized in the following structure:

Introduction
Overview
Data depth
1. Study-wise (#identified features, #quantifiable features)
2. Sample-wise
3. Missing value distribution (Non-missing value percentage in the data table)
Data normalization
1. Boxplot (Data distribution similarity)
2. Density plot
Batch effect
1. Silhouette width (silhouette width)
2. PCA with batch annotation (pcRegscale)
3. Correlation heatmap
Biological signal
1. Correlation among protein complex members (complex_ks)
2. Gene function prediction (func_auc)
3. Sample class prediction (class_auc)
4. PCA with sample class annotation
5. Unsupervised clustering
Platform reproducibility (optional with QC sample)
1. Coefficient of variation distribution (median CV)
Multi-omics concordance (optional with two omics)
1. Gene-wise mRNA-protein correlation (gene wise cor)
2. Sample-wise mRNA-protein correlation (sample wise cor)

OmicsEV evaluation report

A few example evaluation reports are available at https://github.com/bzhanglab/OmicsEV.

Session information

All software and respective versions used to produce this document are listed below.

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] formattable_0.2.1   kableExtra_1.3.4    dplyr_1.0.9        
##  [4] R.utils_2.11.0      R.oo_1.24.0         R.methodsS3_1.8.1  
##  [7] OmicsEV_0.99        xcms_3.14.1         MSnbase_2.18.0     
## [10] ProtGenerics_1.24.0 S4Vectors_0.30.2    mzR_2.26.1         
## [13] Rcpp_1.0.8.3        Biobase_2.52.0      BiocGenerics_0.38.0
## [16] BiocParallel_1.26.2 BiocStyle_2.20.2   
## 
## loaded via a namespace (and not attached):
##   [1] Hmisc_4.7-0                 svglite_2.1.0              
##   [3] corpcor_1.6.10              class_7.3-20               
##   [5] DiffCorr_0.4.2              foreach_1.5.2              
##   [7] crayon_1.5.1                MASS_7.3-57                
##   [9] CAMERA_1.48.0               nlme_3.1-157               
##  [11] backports_1.4.1             sva_3.40.0                 
##  [13] ellipse_0.4.2               impute_1.66.0              
##  [15] rlang_1.0.4                 XVector_0.32.0             
##  [17] caret_6.0-92                ROCR_1.0-11                
##  [19] limma_3.48.3                filelock_1.0.2             
##  [21] xgboost_1.6.0.1             rjson_0.2.21               
##  [23] bit64_4.0.5                 glue_1.6.2                 
##  [25] pheatmap_1.0.12             rngtools_1.5.2             
##  [27] AnnotationDbi_1.54.1        vsn_3.60.0                 
##  [29] R2HTML_2.3.2                coop_0.6-3                 
##  [31] tidyselect_1.1.2            SummarizedExperiment_1.22.0
##  [33] XML_3.99-0.9                tidyr_1.2.0                
##  [35] ggpubr_0.4.0                MassSpecWavelet_1.58.0     
##  [37] xtable_1.8-4                magrittr_2.0.3             
##  [39] MsCoreUtils_1.4.0           evaluate_0.15              
##  [41] ncdf4_1.19                  ggplot2_3.3.5              
##  [43] cli_3.3.0                   zlibbioc_1.38.0            
##  [45] rstudioapi_0.13             doRNG_1.8.2                
##  [47] bslib_0.3.1                 rpart_4.1.16               
##  [49] pls_2.8-0                   lambda.r_1.2.4             
##  [51] prettydoc_0.4.1             xfun_0.30                  
##  [53] clue_0.3-60                 multtest_2.48.0            
##  [55] cluster_2.1.3               metaX_2.0.0                
##  [57] TSP_1.2-0                   pcaMethods_1.84.0          
##  [59] KEGGREST_1.32.0             tibble_3.1.6               
##  [61] ggrepel_0.9.1               ape_5.6-2                  
##  [63] listenv_0.8.0               Biostrings_2.60.2          
##  [65] png_0.1-7                   future_1.25.0              
##  [67] ipred_0.9-12                withr_2.5.0                
##  [69] bitops_1.0-7                RBGL_1.68.0                
##  [71] plyr_1.8.7                  mzID_1.30.0                
##  [73] hardhat_0.2.0               e1071_1.7-9                
##  [75] pROC_1.18.0                 pillar_1.7.0               
##  [77] GlobalOptions_0.1.2         cachem_1.0.6               
##  [79] flexmix_2.3-17              kernlab_0.9-30             
##  [81] scatterplot3d_0.3-41        GetoptLong_1.0.5           
##  [83] vctrs_0.4.1                 ellipsis_0.3.2             
##  [85] generics_0.1.2              SSPA_2.30.0                
##  [87] lava_1.6.10                 tools_4.1.1                
##  [89] foreign_0.8-82              faahKO_1.32.0              
##  [91] munsell_0.5.0               proxy_0.4-26               
##  [93] DelayedArray_0.18.0         abind_1.4-5                
##  [95] fastmap_1.1.0               compiler_4.1.1             
##  [97] plotly_4.10.0               BBmisc_1.12                
##  [99] GenomeInfoDbData_1.2.6      prodlim_2019.11.13         
## [101] gridExtra_2.3               edgeR_3.34.1               
## [103] lattice_0.20-45             utf8_1.2.2                 
## [105] BiocFileCache_2.0.0         recipes_0.2.0              
## [107] jsonlite_1.8.0              affy_1.70.0                
## [109] kBET_0.99.6                 scales_1.2.0               
## [111] graph_1.70.0                carData_3.0-5              
## [113] lazyeval_0.2.2              genefilter_1.74.1          
## [115] car_3.0-13                  doParallel_1.0.17          
## [117] latticeExtra_0.6-29         DiscriMiner_0.1-29         
## [119] missForest_1.5              checkmate_2.1.0            
## [121] rmarkdown_2.14              rARPACK_0.11-0             
## [123] webshot_0.5.3               mixOmics_6.16.3            
## [125] softImpute_1.4-1            igraph_1.3.1               
## [127] survival_3.3-1              yaml_2.3.5                 
## [129] systemfonts_1.0.4           prabclus_2.3-2             
## [131] htmltools_0.5.2             memoise_2.0.1              
## [133] modeltools_0.2-23           locfit_1.5-9.5             
## [135] seriation_1.3.5             IRanges_2.26.0             
## [137] viridisLite_0.4.0           digest_0.6.29              
## [139] assertthat_0.2.1            rappdirs_0.3.3             
## [141] futile.options_1.0.1        registry_0.5-1             
## [143] RSQLite_2.2.13              future.apply_1.9.0         
## [145] VennDiagram_1.7.3           data.table_1.14.2          
## [147] blob_1.2.3                  futile.logger_1.4.3        
## [149] preprocessCore_1.54.0       splines_4.1.1              
## [151] Formula_1.2-4               fpc_2.2-9                  
## [153] Cairo_1.5-15                RCurl_1.98-1.6             
## [155] broom_0.8.0                 hms_1.1.1                  
## [157] colorspace_2.0-3            base64enc_0.1-3            
## [159] BiocManager_1.30.17         GenomicRanges_1.44.0       
## [161] shape_1.4.6                 nnet_7.3-17                
## [163] sass_0.4.1                  mclust_5.4.9               
## [165] RANN_2.6.1                  circlize_0.4.14            
## [167] ropls_1.24.0                fansi_1.0.3                
## [169] tzdb_0.3.0                  Nozzle.R1_1.1-1            
## [171] parallelly_1.31.1           ModelMetrics_1.2.2.2       
## [173] R6_2.5.1                    grid_4.1.1                 
## [175] lifecycle_1.0.1             formatR_1.12               
## [177] itertools_0.1-3             ggsignif_0.6.3             
## [179] curl_4.3.2                  affyio_1.62.0              
## [181] jquerylib_0.1.4             robustbase_0.95-0          
## [183] fastcluster_1.2.3           Matrix_1.4-1               
## [185] qvalue_2.24.0               NetSAM_1.31.1              
## [187] RColorBrewer_1.1-3          iterators_1.0.14           
## [189] stringr_1.4.0               gower_1.0.0                
## [191] htmlwidgets_1.5.4           biomaRt_2.48.3             
## [193] purrr_0.3.4                 rvest_1.0.2                
## [195] ComplexHeatmap_2.8.0        MALDIquant_1.21            
## [197] mgcv_1.8-40                 globals_0.14.0             
## [199] htmlTable_2.4.0             codetools_0.2-18           
## [201] matrixStats_0.62.0          lubridate_1.8.0            
## [203] GO.db_3.13.0                FNN_1.1.3                  
## [205] randomForest_4.7-1          prettyunits_1.1.1          
## [207] dbplyr_2.1.1                RSpectra_0.16-1            
## [209] GenomeInfoDb_1.28.4         gtable_0.3.0               
## [211] DBI_1.1.2                   dynamicTreeCut_1.63-1      
## [213] highr_0.9                   httr_1.4.2                 
## [215] stringi_1.7.6               progress_1.2.2             
## [217] reshape2_1.4.4              diptest_0.76-0             
## [219] annotate_1.70.0             fdrtool_1.2.17             
## [221] timeDate_3043.102           xml2_1.3.3                 
## [223] boot_1.3-28                 WGCNA_1.71                 
## [225] readr_2.1.2                 DEoptimR_1.0-11            
## [227] bit_4.0.4                   jpeg_0.1-9                 
## [229] MatrixGenerics_1.4.3        pkgconfig_2.0.3            
## [231] rstatix_0.7.0               bootstrap_2019.6           
## [233] knitr_1.39

OmicsEV: A tool for large scale omics data tables evaluation

Bo Wen

2022-09-15