1 Introduction

In this evaluation, there are a total of 1 data tables. Evaluation metrics from the OmicsEV package for these data tables are included in this report, beginning with a summary of the data. The sample distribution by class for each data table is shown in the table below.

class paper
Basal 11
Her2 8
LumA 12
LumB 18
None 14

Detailed information for each sample included in all data tables is shown below.

sample class batch order
TCGA.AO.A12D None 1 1
TCGA.C8.A131 Basal 1 2
TCGA.AO.A12B None 1 3
TCGA.E2.A10A LumA 1 4
TCGA.C8.A130 LumB 1 5
TCGA.C8.A138 Her2 1 6
TCGA.E2.A154 LumA 1 7
TCGA.A8.A09I LumB 1 8
TCGA.C8.A12L Her2 1 9
TCGA.A2.A0EX LumA 1 10
TCGA.AN.A04A None 1 11
TCGA.BH.A0AV Basal 1 12
TCGA.A2.A0D0 Basal 1 13
TCGA.C8.A12T Her2 1 14
TCGA.A8.A06Z LumB 1 15
TCGA.A2.A0D1 None 1 16
TCGA.A2.A0CM Basal 1 17
TCGA.A2.A0YI LumA 1 18
TCGA.A2.A0EQ Her2 1 19
TCGA.AR.A0TY LumB 1 20
TCGA.AR.A0U4 None 1 21
TCGA.BH.A0HP LumA 1 22
TCGA.BH.A0EE Her2 2 23
TCGA.AO.A0J9 None 2 24
TCGA.AN.A0FK LumA 2 25
TCGA.AO.A0J6 None 2 26
TCGA.A7.A13F LumB 2 27
TCGA.A7.A0CE Basal 2 28
TCGA.A2.A0YC LumA 2 29
TCGA.AO.A0JC None 2 30
TCGA.AR.A0TX Her2 2 31
TCGA.D8.A13Y LumB 2 32
TCGA.A8.A076 LumB 2 33
TCGA.AO.A126 None 2 34
TCGA.C8.A12P Her2 2 35
TCGA.BH.A0C1 LumA 2 36
TCGA.A2.A0EY LumB 2 37
TCGA.AR.A1AW LumB 2 38
TCGA.AR.A1AV LumA 2 39
TCGA.C8.A135 Her2 2 40
TCGA.A2.A0EV LumA 2 41
TCGA.AN.A0AM LumB 2 42
TCGA.D8.A142 Basal 2 43
TCGA.AN.A0FL Basal 3 44
TCGA.AN.A0AS LumA 3 45
TCGA.AR.A0TV LumB 3 46
TCGA.C8.A12Z Her2 3 47
TCGA.AO.A0JJ None 3 48
TCGA.AO.A0JE None 3 49
TCGA.A2.A0T2 Basal 3 50
TCGA.AN.A0AJ LumB 3 51
TCGA.A7.A0CJ LumB 3 52
TCGA.AO.A12F None 3 53
TCGA.A2.A0YL LumA 3 54
TCGA.A2.A0T7 LumA 3 55
TCGA.C8.A12Q Her2 3 56
TCGA.A8.A079 LumB 3 57
TCGA.E2.A159 Basal 3 58
TCGA.A2.A0T3 LumB 3 59
TCGA.A2.A0YD LumA 3 60
TCGA.AR.A0TR LumA 3 61
TCGA.AO.A03O None 3 62
TCGA.AO.A12E None 3 63
TCGA.A8.A06N LumB 3 64
TCGA.A2.A0T1 Her2 3 65
TCGA.A2.A0YG LumB 3 66
TCGA.E2.A150 Basal 3 67
TCGA.A7.A0CD LumA 4 68
TCGA.C8.A12W LumB 4 69
TCGA.AN.A0AL Basal 4 70
TCGA.A2.A0T6 LumA 4 71
TCGA.AO.A0JM None 4 72
TCGA.C8.A12V Basal 4 73
TCGA.A2.A0D2 Basal 4 74
TCGA.C8.A12U LumB 4 75
TCGA.A8.A09G Her2 4 76
TCGA.C8.A134 Basal 4 77
TCGA.A2.A0YF LumA 4 78
TCGA.BH.A0E9 LumA 4 79
TCGA.AR.A0TT LumB 4 80
TCGA.AR.A1AQ Basal 4 81
TCGA.A2.A0SW LumB 4 82
TCGA.AO.A0JL None 4 83
TCGA.A2.A0YM Basal 4 84
TCGA.BH.A0C7 LumB 4 85
TCGA.A2.A0SX Basal 4 86

2 Overview

The table below provides an overview about all the quantitative metrics generated in the evaluation. For each metric, the value of the best data table is highlighted in bold and red. The details for each metric can be found in the corresponding sections below.

metric paper
#identified features 10062
(0.4936)
#quantifiable features 9227
(0.4526)
non_missing_value_ratio 0.9397
data_dist_similarity 0.9188
silhouette_width -0.4237
(0.5763)
pcRegscale 0.0000
(1.0000)
complex_auc 0.7368
func_auc 0.8630
class_auc 0.7418
gene_wise_cor 0.3784
sample_wise_cor 0.1783

The radar plot below summarizes results from the overview table above. To generate the radar plot, each metric is scaled from 0 to 1 such that higher values indicate better data quality if necessary. Scaled values are in parentheses in the table.

3 Data depth

3.1 Study-wise

The table below shows the number of identified and quantified proteins or genes for each data table. Identified proteins or genes are those with a measurement in any sample in a data table whereas quantified proteins or genes are those that remain after filtering out those with missing values in more than 50% of the samples in a data table. The values in parentheses are the percentage of proteins or genes identified or quantified based on the total number of proteins or genes (20386) in the study species.

data table #identified features #quantifiable features
paper 10062
(49.36%)
9227
(45.26%)

3.2 Sample-wise

The figures below show the number of proteins or genes identified/quantified (non-missing values) in each sample. Samples from different batches are coded with different shapes, and samples from different classes are coded with different colors. A separate figure is shown for each data table.

paper

3.3 Missing value distribution

The missing value distribution provides an overview of the completeness of the data. The table below shows the percent of missing values for all samples in each data table.

data table non_missing_value_ratio
paper 0.9397

The following barplots show missing value distributions for each data table as number (Y axis)/percentage (number above bar) of proteins or genes with missing values in each bin. Genes are binned by proportion of samples with missing values from 0.1 to 1 in increments of 0.1, where 0.1 indicates missing values in no more than 10% of the samples, and 1 indicates missing values in all samples.

paper

4 Data normalization

4.1 Boxplot

Normalized data is expected to be centered around a similar value and show similar distributions in all samples. The boxplots below show the protein or gene expression measurement distribution across samples in each data table, allowing for qualitative assessment of the normalized data. Samples in input order are indicated on the X axis. The Y axis shows log2 transformed protein or gene values. Samples from different classes are coded with different colors.

paper

To quantify the normalization effect, we tested for how well the data in the feature set can distinguish between each pair of samples. If the distribution is similar for the two samples in a given pair, the overall feature abundance (levels for all features in one sample vs the other) should not be sufficient to predict which sample is which. Therefore, for each pair of samples, an AUROC test was performed to quantify the ability of feature abundance to distinguish the two samples, and then a data_dist_similarity score was generated: 1-2*abs(AUROC-0.5). This score ranges from 0 to 1, and the higher the score is the better the normalized data quality is (no systematic difference between the two samples). The final metric for each data table is the median of scores from all sample pairs. The column ‘n’ shows the total number of sample pairs in the analysis.

data table data_dist_similarity n
paper 0.9188 1953

4.2 Density plot

The density plots below show the expression distributions for all samples (separate line) in each data table. The Y axis shows the density over the range of log2 transformed protein or gene expression values (X axis).

5 Batch effect

5.1 Silhouette width

The silhouette width s(i) ranges from –1 to 1, with s(i) -> 1 if two clusters are separate and s(i) -> −1 if two clusters overlap but have dissimilar variance. If s(i) -> 0, both clusters have roughly the same structure. Thus, we use the absolute value |s| as an indicator for the presence or absence of batch effects (the greater |s| is, the higher the batch effect is). This analysis is done using the function batch_sil from the R package kBET.

data table silhouette_width
paper -0.4237

5.2 PCA with batch annotation

For each principal component (PC) from PCA, we calculate the Pearson’s correlation coefficient for that PC with batch covariate b:

ri =corr(PCi,b)

In a linear model with a single dependent, as is the case here for correlation of a given PC to a batch covariate, the coefficient of determination for batch b on PCi, R2, is the squared Pearson’s correlation coefficient:

R2(PCi,b) = ri2

The table below shows correlation coefficients for each PC for the first 10 PCs in each data table. The significance of the correlation coefficient was estimated either with a t-test or a one-way ANOVA. R2 values highlighted with red indicate significant correlation (p-value <= 0.05) between batch and the corresponding PC. This analysis is done using the function pcRegression from the R package kBET.

PC paper
1 0.007
2 0.044
3 0.004
4 0.018
5 0.001
6 0.053
7 0.034
8 0
9 0.001
10 0.009

The percentage of variance explained for each PC is shown in the table below:

PC paper
1 11.6
2 8.2
3 7.2
4 4.0
5 4.0
6 3.4
7 2.6
8 2.4
9 2.3
10 2.3

Greater batch effect is more likely to be present when a PC that explains a higher percentage of variance shows significant correlation with the batch covariate. Therefore, we use the ‘Scaled PC regression’ metric (pcRegscale), i.e. the total variance of PCs which correlate significantly with batch covariate (FDR<0.05) scaled by the total variance of 10 PCs, to quantify the batch effect:

data table pcRegscale
paper 0

The figures below show the PCA score plots for the top three PCs for each data table. Samples from different batches are coded with different colors in the plots.

5.3 Correlation heatmap

Another way to qualitatively assess batch effect is to visualize the correlations for measurements between samples from the same batch to those in samples from different batches using heatmaps. The following figures show Spearman correlation heatmaps for all pairs of samples (all samples included in both rows and columns) for each data table. The color indicates the correlation between samples. The samples are ordered by batches. Concentration of high correlation values (red color) for pairs of samples from the same batch block compared to other batches indicates the presence of batch effect.

paper

6 Biological signal

6.1 Correlation among protein complex members

Members of the same protein complex often show greater correlation in gene and protein expression (IntraComplex correlation) than genes or proteins that are in different complexes (InterComplex correlation). Thus, one way to evaluate the quality of the biological signal present in a data table is to compare IntraComplex correlation to InterComplex correlation. Furthermore, because of the need to preserve stoichiometry between protein complex members, the difference between IntraComplex correlation and InterComplex correlation is often greater at the protein level than at the RNA data. If both RNA and protein data tables are available, observing that this difference is more pronounced in the protein data table than the RNA data table serves as an indicator for the quality of the protein data. We use the protein complexes from the CORUM database in this analysis.

The boxplots below show the distributions and ranges for pairwise correlations between genes or proteins from the same complex and for genes and proteins from different complexes for each data table.

The table below shows a summary of the evaluation. ‘diff’ is Cor(intra) - Cor(inter). ‘complex_auc’ is the AUROC value based on correlation of protein pairs from different groups.

data table InterComplex IntraComplex diff complex_auc
paper 0.0156 0.2157 0.2002 0.7368
RNA 0.0188 0.1465 0.1277 0.6571

6.2 Gene function prediction

Previous studies have shown that expression correlation is often higher for functionally related genes or proteins than for unrelated genes or proteins and that this correlation is greater when considering protein data than when considering RNA data (Wang, Jing, et al. Molecular & Cellular Proteomics 16.1 (2017): 121-134.). Therefore, we can also evaluate the biological signal present in a data table by evaluating functional category predictions made using a co-expression network generated from each data table.

In this evaluation, each data table was used to build a co-expression network. For a selected network and a selected functional category (such as a selected category from GO or KEGG), proteins/genes annotated to the category and also included in the network were defined as a positive protein/gene set, and other proteins/genes in the network constituted the negative protein/gene set for the category. For a selected functional category, a subset of the proteins/genes were used as seed proteins/genes for random walk through the network to calculate scores for other proteins/genes. A higher score for a protein/gene represents a closer relationship between the protein/gene and the seed proteins/genes. The table below shows AUROCs of the prediction performance using this score for each selected functional category.

paper RNA
ABC transporters 0.762 0.735
Acute myeloid leukemia 0.802 0.507
Adherens junction 0.781 0.633
Adipocytokine signaling pathway 0.751 0.596
Alanine, aspartate and glutamate metabolism 0.66 0.603
Aldosterone-regulated sodium reabsorption 0.862 0.574
Allograft rejection 0.938 0.986
Alzheimers disease 0.781 0.719
Amino sugar and nucleotide sugar metabolism 0.744 0.619
Aminoacyl-tRNA biosynthesis 0.779 0.746
Amoebiasis 0.791 0.723
Amyotrophic lateral sclerosis (ALS) 0.662 0.622
Antigen processing and presentation 0.849 0.845
Arachidonic acid metabolism 0.705 0.601
Arginine and proline metabolism 0.648 0.604
Arrhythmogenic right ventricular cardiomyopathy (ARVC) 0.837 0.64
Autoimmune thyroid disease 0.922 0.986
Axon guidance 0.708 0.604
B cell receptor signaling pathway 0.735 0.543
Bacterial invasion of epithelial cells 0.756 0.588
Base excision repair 0.667 0.712
beta-Alanine metabolism 0.745 0.694
Bile secretion 0.811 0.658
Biosynthesis of unsaturated fatty acids 0.884 0.684
Bladder cancer 0.616 0.523
Butanoate metabolism 0.673 0.62
Calcium signaling pathway 0.729 0.552
Carbohydrate digestion and absorption 0.945 0.741
Cardiac muscle contraction 0.873 0.691
Cell adhesion molecules (CAMs) 0.805 0.796
Cell cycle 0.79 0.743
Chagas disease (American trypanosomiasis) 0.765 0.538
Chemokine signaling pathway 0.799 0.588
Chronic myeloid leukemia 0.62 0.612
Citrate cycle (TCA cycle) 0.937 0.817
Colorectal cancer 0.586 0.614
Complement and coagulation cascades 0.899 0.903
Cysteine and methionine metabolism 0.845 0.617
Cytokine-cytokine receptor interaction 0.686 0.762
Cytosolic DNA-sensing pathway 0.673 0.561
Dilated cardiomyopathy 0.785 0.594
DNA replication 0.711 0.84
Drug metabolism - cytochrome P450 0.756 0.748
Drug metabolism - other enzymes 0.674 0.673
ECM-receptor interaction 0.832 0.832
Endocytosis 0.619 0.526
Endometrial cancer 0.762 0.557
Epithelial cell signaling in Helicobacter pylori infection 0.624 0.591
ErbB signaling pathway 0.797 0.518
Ether lipid metabolism 0.647 0.657
Fatty acid elongation in mitochondria 0.925 0.756
Fatty acid metabolism 0.766 0.629
Fc epsilon RI signaling pathway 0.809 0.529
Fc gamma R-mediated phagocytosis 0.784 0.534
Focal adhesion 0.813 0.656
Fructose and mannose metabolism 0.903 0.595
Galactose metabolism 0.775 0.653
Gap junction 0.775 0.6
Gastric acid secretion 0.885 0.601
Glioma 0.695 0.615
Glutathione metabolism 0.748 0.608
Glycerolipid metabolism 0.593 0.726
Glycine, serine and threonine metabolism 0.799 0.623
Glycolysis / Gluconeogenesis 0.822 0.626
Glyoxylate and dicarboxylate metabolism 0.863 0.695
GnRH signaling pathway 0.693 0.603
Graft-versus-host disease 0.936 0.999
Hedgehog signaling pathway 0.787 0.57
Hematopoietic cell lineage 0.741 0.743
Hepatitis C 0.761 0.61
Histidine metabolism 0.658 0.592
Huntingtons disease 0.833 0.743
Hypertrophic cardiomyopathy (HCM) 0.785 0.595
Inositol phosphate metabolism 0.645 0.585
Insulin signaling pathway 0.779 0.582
Jak-STAT signaling pathway 0.693 0.615
Leishmaniasis 0.754 0.619
Leukocyte transendothelial migration 0.799 0.608
Long-term depression 0.79 0.618
Long-term potentiation 0.823 0.555
Lysine degradation 0.693 0.583
Lysosome 0.682 0.561
Malaria 0.725 0.74
MAPK signaling pathway 0.679 0.567
Melanogenesis 0.891 0.676
Melanoma 0.669 0.562
Metabolic pathways 0.713 0.603
Metabolism of xenobiotics by cytochrome P450 0.881 0.751
mTOR signaling pathway 0.76 0.568
N-Glycan biosynthesis 0.743 0.753
Natural killer cell mediated cytotoxicity 0.764 0.627
Neurotrophin signaling pathway 0.644 0.56
Nicotinate and nicotinamide metabolism 0.692 0.615
NOD-like receptor signaling pathway 0.678 0.562
Non-small cell lung cancer 0.759 0.562
Notch signaling pathway 0.637 0.592
One carbon pool by folate 0.616 0.708
Oocyte meiosis 0.769 0.549
Osteoclast differentiation 0.764 0.586
Oxidative phosphorylation 0.888 0.823
p53 signaling pathway 0.567 0.628
Pancreatic cancer 0.649 0.601
Pancreatic secretion 0.807 0.586
Parkinsons disease 0.895 0.801
Pathogenic Escherichia coli infection 0.785 0.627
Pathways in cancer 0.689 0.555
Pentose and glucuronate interconversions 0.78 0.665
Peroxisome 0.737 0.59
Phagosome 0.768 0.668
Phosphatidylinositol signaling system 0.677 0.607
Porphyrin and chlorophyll metabolism 0.618 0.599
PPAR signaling pathway 0.723 0.622
Primary immunodeficiency 0.881 0.83
Prion diseases 0.833 0.704
Progesterone-mediated oocyte maturation 0.794 0.614
Propanoate metabolism 0.924 0.649
Prostate cancer 0.611 0.55
Protein digestion and absorption 0.843 0.859
Protein export 0.933 0.845
Protein processing in endoplasmic reticulum 0.751 0.743
Pyrimidine metabolism 0.682 0.586
Pyruvate metabolism 0.865 0.611
Regulation of actin cytoskeleton 0.805 0.617
Renal cell carcinoma 0.735 0.623
Rheumatoid arthritis 0.75 0.652
Ribosome 0.978 0.834
RIG-I-like receptor signaling pathway 0.724 0.635
RNA transport 0.626 0.651
Salivary secretion 0.764 0.642
Shigellosis 0.756 0.511
Small cell lung cancer 0.575 0.626
SNARE interactions in vesicular transport 0.776 0.711
Sphingolipid metabolism 0.666 0.62
Staphylococcus aureus infection 0.923 0.922
Starch and sucrose metabolism 0.804 0.672
Systemic lupus erythematosus 0.953 0.812
T cell receptor signaling pathway 0.677 0.522
Terpenoid backbone biosynthesis 0.761 0.696
TGF-beta signaling pathway 0.684 0.64
Tight junction 0.842 0.55
Toll-like receptor signaling pathway 0.637 0.572
Toxoplasmosis 0.697 0.568
Tryptophan metabolism 0.738 0.612
Type I diabetes mellitus 0.904 0.954
Type II diabetes mellitus 0.635 0.648
Tyrosine metabolism 0.803 0.794
Ubiquitin mediated proteolysis 0.645 0.653
Valine, leucine and isoleucine degradation 0.766 0.721
Vascular smooth muscle contraction 0.853 0.59
Vasopressin-regulated water reabsorption 0.76 0.614
VEGF signaling pathway 0.712 0.533
Vibrio cholerae infection 0.676 0.587
Viral myocarditis 0.844 0.73
Wnt signaling pathway 0.707 0.602

The rank boxplots below summarize the relative performance of the data tables in the functional prediction analysis. For each functional category, a rank is assigned to each data table based on its AUROC compared to the other data tables, where the best functional prediction rank is 1 and the poorest rank is the number of data tables.

Comparison of each protein (RNA) data table to a designated RNA (protein) data table is also summarized in the scatter plots below. For each point, the AUROC for a given category in the RNA data is plotted on the X-axis whereas the corresponding AUROC in the protein data table is plotted on the Y-axis. The number of categories for which the protein data table outperforms the RNA data table (AUROC(protein) > 1.1 * AUROC (RNA); red dots) and vice versa (AUROC(RNA) > 1.1 * AUROC (protein); blue dots) are also shown.

paper

6.3 Sample class prediction

OmicsEV also allows for assessment of how well each data table can predict a user specified class for each sample. For each data table, machine learning models are built to predict sample class: LumA,LumB. In OmicsEV, random forest models are built, and the models are evaluated using repeated 5 fold cross validation (20 times). Please note, depending on the class specified, this metric may or may not provide an indication of data quality. The results of AUROC analysis performed using the models are summarized in the table and boxplots below.

dataSet mean_ROC median_ROC sd_ROC
paper 0.7418 0.7442 0.0174
RNA 0.9894 0.9904 0.0037

6.4 PCA with sample class annotation

Another approach for assessing how well each data table can distinguish between classes is to determine how well each class can be separated by principal component analysis (PCA). In PCA score plots for each data table below, each point is a sample that is colored by class and that has a shape reflecting the batch. For a given sample, the PC2 score is plotted on the Y-axis whereas the PC1 score is plotted on the X-axis. Ellipses highlighting clusters of samples in each class are colored by corresponding class, and the separation between these ellipses indicates how well the variances captured by the first two PCs can distinguish between samples from different classes.

paper

6.5 Unsupervised clustering

Unsupervised hierarchical clustering can reveal patterns in the data (clusters of genes or samples that behave more similarly to each other than to other genes or samples). Each heatmap below shows the results of hierarchical clustering for a given data table using ComplexHeatmap. Genes/proteins are in rows, while samples are in columns and labeled with corresponding class to visualize any potential associations between classes and clusters.

paper

7 Multi-omics concordance

The concordance between the protein data and RNA data can be used to assess data quality when both RNA and protein data tables are available. Here, we evaluate gene- and sample-wise correlations between the protein and RNA data tables.

7.1 Gene-wise mRNA-protein correlation

The table below shows the number of genes with measurements (n) in each data table as well as the median of all gene-wise Spearman correlations between mRNA and protein measurements. The columns n5, n6, n7 and n8 show the number of genes with correlation greater than 0.5, 0.6, 0.7 and 0.8, respectively.

data table n n5 n6 n7 n8 gene_wise_cor
paper 8893 2824 1508 541 70 0.3784

Spearman correlation results are also shown for each gene/protein in the boxplot below.

Another way to visualize the differences between the distributions of all gene-wise RNA-protein correlations is with the cumulative distribution function (CDF) plot shown below. Here each line shows the cumulative distribution for the gene-wise correlations. The further the distribution function is shifted to the right, the more highly correlated the RNA-protein data is.

The histograms below provide another way to visualize the distribution of correlations for each protein (or RNA) data table with the RNA (or protein) data. Here the bars showing binned frequencies of positive correlations are in red, while negative correlations are shown in the blue bins, and summary statistics are also provided.

paper

7.2 Sample-wise mRNA-protein correlation

Sample-wise RNA-protein correlations are summarized in the table below as the median of Spearman correlations for matched protein and RNA data from all pairs of samples for each data table, while the violin plots below show the distributions of these correlations for each data table.

data table sample_wise_cor
paper 0.1783