Introduction

High-throughput technologies such as RNA-Seq and mass spectrometry-based proteomics are increasingly being applied to large sample cohorts, which creates vast amount of quantitative data for genes and proteins. Many algorithms, software, and pipelines have been developed to analyze these data. However, how to select optimal algorithms, software, and parameters for analyzing a specific omics dataset remains a significant challenge. To address this challenge, we have developed an R package named OmicsEV, which is dedicated to compare and evaluate different data matrices generated from the same omics dataset using different tools, algorithms, or parameter settings. In OmicsEV, we have implemented more than 20 evaluation metrics and all the evaluation results are included in an HTML-report for intuitive browsing. OmicsEV is easy to install and use. Only one function is needed to perform the whole evaluation process. A GUI based on R shiny is also implemented.

Example data

An example data can be downloaded at https://github.com/bzhanglab/OmicsEV. This example contains 6 data matrices generated from the same proteomics dataset using different normalization methods. In addition, an RNA-Seq data matrix and a sample list are also included.

Running OmicsEV

Preparing inputs

Running evaluation process

In OmicsEV, Only one function (run_omics_evaluation) is needed to perform the whole evaluation process. An example is showing below:

In general, only a few parameters have to be set:

  • data_dir: a folder contains the datasets in tsv format to be evaluated. All datasets must be the same format. In these files, the first column must be gene ID or protein ID. The expression value of gene or protein must be in non-log scale. An example input dataset looks like below:
An example of input dataset
ID sample_1 sample_2 sample_3 sample_4 sample_5 sample_6
A1BG 0.699 1.022 0.256 1.322 0.854 0.525
A2M 3.085 0.392 0.681 0.540 0.930 0.757
A2ML1 1.398 1.106 0.981 0.954 1.869 0.790
A4GALT 0.364 1.340 2.035 3.158 1.725 0.280
AAAS 0.802 1.019 1.634 0.695 0.308 0.829
AACS 0.689 0.505 0.420 1.069 0.266 0.333
AADAT 1.312 2.429 2.344 1.491 0.983 0.467
AAED1 2.800 1.263 0.935 0.716 0.201 2.055
AAGAB 0.230 1.149 0.634 0.599 1.753 0.810
AAK1 2.317 0.713 1.407 1.410 1.336 0.617
  • sample_list: a file in tsv format contains sample class, batch and order information. This file must contain sample ID, sample class, batch and order information. An example sample list file looks like below:
An example of sample list
sample class batch order
sample_1 T 1 1
sample_3 T 1 2
sample_5 C 1 3
sample_2 T 2 4
sample_4 C 2 5
sample_6 C 2 6
  • data_type: the quantification data type in folder data_dir: protein, gene. Default is protein.

All other parameters are optional. When input datasets for parameter data_dir are protein expression data and users also have gene expression data for the same samples, users can set parameter x2 as a file contains the gene expression data in tsv format, and vice versa. If parameter x2 is not NULL, sample wise and gene wise correlation analysis will be performed. See ?run_omics_evaluation for a more in-depth description of all its arguments.

When the function is finished successfully, an HTML-based report that contains different evaluation metrics will be generated. An example report is available at https://github.com/bzhanglab/OmicsEV.

Evaluation metrics implemented in OmicsEV

So far, more than 20 evaluation metrics have been implemented in OmicsEV.

  1. Identified proteins/genes;
  2. Quantified proteins/genes;
  3. Overlap genes/proteins of all datasets;
  4. Protein/gene number distribution across samples;
  5. Protein or gene expression distribution: boxplot;
  6. Protein or gene expression distribution: density plot;
  7. Sample correlation heatmap;
  8. Batch effect evaluation using kBET (Büttner et al. 2018);
  9. Batch effect evaluation using silhouette width;
  10. Batch effect evaluation based on PCA regression;
  11. Batch effect evaluation using pca score plot
  12. Protein or gene coefficient of variation (CV) distribution;
  13. Missing value distribution;
  14. Unsupervised analysis of samples: PCA;
  15. Unsupervised analysis of samples: cluster analysis;
  16. Correlation based on complexes;
  17. Correlation between mRNA and protein: gene wise;
  18. Correlation between mRNA and protein: sample wise;
  19. Phenotype prediction;
  20. Co-expression network based function prediction.

OmicsEV evaluation report

A few example evaluation reports are available at https://github.com/bzhanglab/OmicsEV.

Session information

All software and respective versions used to produce this document are listed below.

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] formattable_0.2.0.1 kableExtra_1.1.0    dplyr_0.8.3        
##  [4] R.utils_2.9.0       R.oo_1.22.0         R.methodsS3_1.7.1  
##  [7] OmicsEV_0.99        xcms_3.6.1          MSnbase_2.10.1     
## [10] ProtGenerics_1.16.0 S4Vectors_0.22.0    mzR_2.18.0         
## [13] Rcpp_1.0.1          BiocParallel_1.18.0 Biobase_2.44.0     
## [16] BiocGenerics_0.30.0 BiocStyle_2.12.0   
## 
## loaded via a namespace (and not attached):
##   [1] prabclus_2.3-1         ModelMetrics_1.2.2     tidyr_0.8.3           
##   [4] missForest_1.4         ggplot2_3.2.0          acepack_1.4.1         
##   [7] bit64_0.9-7            knitr_1.23             data.table_1.12.2     
##  [10] rpart_4.1-15           RCurl_1.95-4.12        doParallel_1.0.14     
##  [13] generics_0.0.2         preprocessCore_1.46.0  lambda.r_1.2.3        
##  [16] RSQLite_2.1.1          mixOmics_6.8.0         RANN_2.6.1            
##  [19] bit_1.1-14             httpuv_1.5.1           webshot_0.5.1         
##  [22] xml2_1.2.0             lubridate_1.7.4        assertthat_0.2.1      
##  [25] viridis_0.5.1          gower_0.2.1            xfun_0.8              
##  [28] hms_0.5.0              promises_1.0.1         evaluate_0.14         
##  [31] TSP_1.1-7              progress_1.2.2         DEoptimR_1.0-8        
##  [34] dendextend_1.12.0      caTools_1.17.1.2       igraph_1.2.4.1        
##  [37] DBI_1.0.0.9001         htmlwidgets_1.3        futile.logger_1.4.3   
##  [40] rARPACK_0.11-0         purrr_0.3.2            RSpectra_0.15-0       
##  [43] backports_1.1.4        annotate_1.62.0        biomaRt_2.40.3        
##  [46] vctrs_0.2.0            ROCR_1.0-7             prettydoc_0.3.0       
##  [49] caret_6.0-84           withr_2.1.2            kBET_0.99.5           
##  [52] itertools_0.1-3        robustbase_0.93-5      checkmate_1.9.4       
##  [55] gclus_1.3.2            prettyunits_1.0.2      fdrtool_1.2.15        
##  [58] mclust_5.4.5           ropls_1.16.0           softImpute_1.4        
##  [61] cluster_2.1.0          ape_5.3                lazyeval_0.2.2        
##  [64] crayon_1.3.4           ellipse_0.4.1          genefilter_1.66.0     
##  [67] edgeR_3.26.5           recipes_0.1.6          pkgconfig_2.0.2       
##  [70] nlme_3.1-140           seriation_1.2-7        nnet_7.3-12           
##  [73] rlang_0.4.0            diptest_0.75-7         pls_2.7-1             
##  [76] registry_0.5-1         DiscriMiner_0.1-29     affyio_1.54.0         
##  [79] MassSpecWavelet_1.50.0 VennDiagram_1.6.20     tcltk_3.6.0           
##  [82] randomForest_4.6-14    matrixStats_0.54.0     graph_1.62.0          
##  [85] Matrix_1.2-17          boot_1.3-23            base64enc_0.1-3       
##  [88] GlobalOptions_0.1.0    pheatmap_1.0.12        png_0.1-7             
##  [91] viridisLite_0.3.0      rjson_0.2.20           bootstrap_2019.6      
##  [94] bitops_1.0-6           KernSmooth_2.23-15     pROC_1.15.0           
##  [97] R2HTML_2.3.2           blob_1.2.0             shape_1.4.4           
## [100] stringr_1.4.0          qvalue_2.16.0          robust_0.4-18.1       
## [103] readr_1.3.1            faahKO_1.24.0          scales_1.0.0          
## [106] memoise_1.1.0          magrittr_1.5           plyr_1.8.4            
## [109] gplots_3.0.1.1         gdata_2.18.0           zlibbioc_1.30.0       
## [112] compiler_3.6.0         RColorBrewer_1.1-2     pcaMethods_1.76.0     
## [115] clue_0.3-57            rrcov_1.4-7            affy_1.62.0           
## [118] htmlTable_1.13.1       formatR_1.7            Formula_1.2-3         
## [121] WGCNA_1.68             MASS_7.3-51.4          mgcv_1.8-28           
## [124] tidyselect_0.2.5       vsn_3.52.0             stringi_1.4.3         
## [127] highr_0.8              yaml_2.2.0             locfit_1.5-9.1        
## [130] NetSAM_2.0.5           latticeExtra_0.6-28    MALDIquant_1.19.3     
## [133] grid_3.6.0             tools_3.6.0            circlize_0.4.6        
## [136] rstudioapi_0.10        foreach_1.4.4          foreign_0.8-71        
## [139] gridExtra_2.3          prodlim_2018.04.18     scatterplot3d_0.3-41  
## [142] mzID_1.22.0            digest_0.6.20          BiocManager_1.30.4    
## [145] shiny_1.3.2            FNN_1.1.3              lava_1.6.5            
## [148] fpc_2.2-3              later_0.8.0            ncdf4_1.16.1          
## [151] httr_1.4.0             Nozzle.R1_1.1-1        AnnotationDbi_1.46.0  
## [154] ComplexHeatmap_2.0.0   kernlab_0.9-27         colorspace_1.4-1      
## [157] rvest_0.3.4            XML_3.98-1.20          IRanges_2.18.1        
## [160] splines_3.6.0          RBGL_1.60.0            xgboost_0.82.1        
## [163] multtest_2.40.0        flexmix_2.3-15         plotly_4.9.0          
## [166] fit.models_0.5-14      xtable_1.8-4           jsonlite_1.6          
## [169] futile.options_1.0.1   BBmisc_1.11            dynamicTreeCut_1.63-1 
## [172] corpcor_1.6.9          timeDate_3043.102      UpSetR_1.3.3          
## [175] zeallot_0.1.0          modeltools_0.2-22      ipred_0.9-9           
## [178] R6_2.4.0               Hmisc_4.2-0            mime_0.7              
## [181] pillar_1.4.2           htmltools_0.3.6        glue_1.3.1            
## [184] class_7.3-15           codetools_0.2-16       DiffCorr_0.4.1        
## [187] pcaPP_1.9-73           mvtnorm_1.0-11         lattice_0.20-38       
## [190] tibble_2.1.3           sva_3.32.1             SSPA_2.24.0           
## [193] gtools_3.8.1           metaX_2.0.0            GO.db_3.8.2           
## [196] CAMERA_1.40.0          survival_2.44-1.1      limma_3.40.2          
## [199] rmarkdown_1.14         munsell_0.5.0          fastcluster_1.1.25    
## [202] e1071_1.7-2            GetoptLong_0.1.7       iterators_1.0.10      
## [205] impute_1.58.0          reshape2_1.4.3         gtable_0.3.0          
## [208] coop_0.6-2

References

Büttner, Maren, Zhichao Miao, F Alexander Wolf, Sarah A Teichmann, and Fabian J Theis. 2018. “A Test Metric for Assessing Single-Cell Rna-Seq Batch Correction.” Nature Methods. Nature Publishing Group, 1.