1 Introduction

In this evaluation, there are a total of 6 data tables. Evaluation metrics from the OmicsEV package for these data tables are included in this report, beginning with a summary of the data. The sample distribution by class for each data table is shown in the table below.

class	d1	d2	d3	d4	d5	d6
Basal	17	17	17	17	17	17
Her2	12	12	12	12	12	12
LumA	19	19	19	19	19	19
LumB	22	22	22	22	22	22
None	16	16	16	16	16	16

Detailed information for each sample included in all data tables is shown below.

sample	class	batch	order
TCGA.A2.A0CM	Basal	1	1
TCGA.A2.A0D0	Basal	1	2
TCGA.A2.A0D1	None	1	3
TCGA.A2.A0D2	Basal	1	4
TCGA.A2.A0EQ	Her2	1	5
TCGA.A2.A0EV	LumA	1	6
TCGA.A2.A0EX	LumA	1	7
TCGA.A2.A0EY	LumB	1	8
TCGA.A2.A0SW	LumB	1	9
TCGA.A2.A0SX	Basal	1	10
TCGA.A2.A0T1	Her2	1	11
TCGA.A2.A0T2	Basal	1	12
TCGA.A2.A0T6	LumA	1	13
TCGA.A2.A0T7	LumA	1	14
TCGA.A2.A0YC	LumA	1	15
TCGA.A2.A0YD	LumA	1	16
TCGA.A2.A0YF	LumA	1	17
TCGA.A2.A0YG	LumB	1	18
TCGA.A2.A0YI	LumA	1	19
TCGA.A2.A0YL	LumA	1	20
TCGA.A2.A0YM	Basal	1	21
TCGA.A7.A0CD	LumA	1	22
TCGA.A7.A0CE	Basal	1	23
TCGA.A7.A0CJ	LumB	1	24
TCGA.A8.A06N	LumB	1	25
TCGA.A8.A06Z	LumB	1	26
TCGA.A8.A076	LumB	1	27
TCGA.A8.A079	LumB	1	28
TCGA.A8.A09G	Her2	1	29
TCGA.A8.A09I	LumB	1	30
TCGA.AN.A04A	None	1	31
TCGA.AN.A0AJ	LumB	1	32
TCGA.AN.A0AL	Basal	1	33
TCGA.AN.A0AM	LumB	1	34
TCGA.AN.A0AS	LumA	1	35
TCGA.AN.A0FK	LumA	1	36
TCGA.AN.A0FL	Basal	1	37
TCGA.AO.A03O	None	1	38
TCGA.AO.A0J6	None	1	39
TCGA.AO.A0J9	None	1	40
TCGA.AO.A0JC	None	1	41
TCGA.AO.A0JE	None	1	42
TCGA.AO.A0JJ	None	1	43
TCGA.AO.A0JL	None	1	44
TCGA.AO.A0JM	None	1	45
TCGA.AO.A126	None	1	46
TCGA.AO.A12B	None	1	47
TCGA.AO.A12E	None	1	48
TCGA.AR.A0TR	LumA	1	49
TCGA.AR.A0TT	LumB	1	50
TCGA.AR.A0TV	LumB	1	51
TCGA.AR.A0TX	Her2	1	52
TCGA.AR.A0U4	None	1	53
TCGA.BH.A0EE	Her2	1	54
TCGA.BH.A0HP	LumA	1	55
TCGA.A2.A0T3	LumB	2	56
TCGA.A7.A13F	LumB	2	57
TCGA.AO.A12D	None	2	58
TCGA.AO.A12F	None	2	59
TCGA.AR.A0TY	LumB	2	60
TCGA.AR.A1AQ	Basal	2	61
TCGA.AR.A1AV	LumA	2	62
TCGA.AR.A1AW	LumB	2	63
TCGA.BH.A0AV	Basal	2	64
TCGA.BH.A0C1	LumA	2	65
TCGA.BH.A0C7	LumB	2	66
TCGA.BH.A0E9	LumA	2	67
TCGA.C8.A12L	Her2	2	68
TCGA.C8.A12P	Her2	2	69
TCGA.C8.A12Q	Her2	2	70
TCGA.C8.A12T	Her2	2	71
TCGA.C8.A12U	LumB	2	72
TCGA.C8.A12V	Basal	2	73
TCGA.C8.A12W	LumB	2	74
TCGA.C8.A12Z	Her2	2	75
TCGA.C8.A130	LumB	2	76
TCGA.C8.A131	Basal	2	77
TCGA.C8.A134	Basal	2	78
TCGA.C8.A135	Her2	2	79
TCGA.C8.A138	Her2	2	80
TCGA.D8.A13Y	LumB	2	81
TCGA.D8.A142	Basal	2	82
TCGA.E2.A10A	LumA	2	83
TCGA.E2.A150	Basal	2	84
TCGA.E2.A154	LumA	2	85
TCGA.E2.A159	Basal	2	86

2 Overview

The table below provides an overview about all the quantitative metrics generated in the evaluation. For each metric, the value of the best data table is highlighted in bold and red. The details for each metric can be found in the corresponding sections below.

metric	d1	d2	d3	d4	d5	d6
#identified features	18845 (0.9244)	18845 (0.9244)	18845 (0.9244)	18845 (0.9244)	18845 (0.9244)	18845 (0.9244)
#quantifiable features	17416 (0.8543)	17416 (0.8543)	17416 (0.8543)	17416 (0.8543)	17416 (0.8543)	17416 (0.8543)
non_missing_value_ratio	0.9780	0.9780	0.9780	0.9780	0.9780	0.9780
data_dist_similarity	0.9739	1.0000	0.9864	0.9863	0.9621	0.9766
silhouette_width	0.0145 (0.9855)	-0.0009 (0.9991)	0.0139 (0.9861)	0.0144 (0.9856)	0.0200 (0.9800)	0.0214 (0.9786)
pcRegscale	0.1682 (0.8318)	0.2771 (0.7229)	0.1720 (0.8280)	0.1731 (0.8269)	0.0954 (0.9046)	0.0000 (1.0000)
complex_auc	0.6520	0.6340	0.6815	0.6819	0.6320	0.6536
func_auc	0.7758	0.7740	0.7871	0.7984	0.8072	0.7918
class_auc	0.9900	0.9943	0.9887	0.9898	0.9913	0.9940
gene_wise_cor	0.3294	0.3354	0.3374	0.3384	0.3207	0.3283
sample_wise_cor	0.1421	0.1421	0.1421	0.1421	0.1421	0.1383

The radar plot below summarizes results from the overview table above. To generate the radar plot, each metric is scaled from 0 to 1 such that higher values indicate better data quality if necessary. Scaled values are in parentheses in the table.

3 Data depth

3.1 Study-wise

The table below shows the number of identified and quantified proteins or genes for each data table. Identified proteins or genes are those with a measurement in any sample in a data table whereas quantified proteins or genes are those that remain after filtering out those with missing values in more than 50% of the samples in a data table. The values in parentheses are the percentage of proteins or genes identified or quantified based on the total number of proteins or genes (20386) in the study species.

data table	#identified features	#quantifiable features
d1	18845 (92.44%)	17416 (85.43%)
d2	18845 (92.44%)	17416 (85.43%)
d3	18845 (92.44%)	17416 (85.43%)
d4	18845 (92.44%)	17416 (85.43%)
d5	18845 (92.44%)	17416 (85.43%)
d6	18845 (92.44%)	17416 (85.43%)

The upset chart below shows overlap between proteins or genes identified in each data table. Numbers of proteins or genes commonly identified in different combinations of data tables are indicated in the top bar chart, and the specific combinations of data tables containing those proteins or genes are indicated with solid points below the bar chart. Total identifications for each data table are indicated on the right as ‘Set size’.

3.2 Sample-wise

The figures below show the number of proteins or genes identified/quantified (non-missing values) in each sample. Samples from different batches are coded with different shapes, and samples from different classes are coded with different colors. A separate figure is shown for each data table.

d1d2d3d4d5d6

3.3 Missing value distribution

The missing value distribution provides an overview of the completeness of the data. The table below shows the percent of missing values for all samples in each data table.

data table	non_missing_value_ratio
d1	0.978
d2	0.978
d3	0.978
d4	0.978
d5	0.978
d6	0.978

The following barplots show missing value distributions for each data table as number (Y axis)/percentage (number above bar) of proteins or genes with missing values in each bin. Genes are binned by proportion of samples with missing values from 0.1 to 1 in increments of 0.1, where 0.1 indicates missing values in no more than 10% of the samples, and 1 indicates missing values in all samples.

d1d2d3d4d5d6

4 Data normalization

4.1 Boxplot

Normalized data is expected to be centered around a similar value and show similar distributions in all samples. The boxplots below show the protein or gene expression measurement distribution across samples in each data table, allowing for qualitative assessment of the normalized data. Samples in input order are indicated on the X axis. The Y axis shows log2 transformed protein or gene values. Samples from different classes are coded with different colors.

d1d2d3d4d5d6

To quantify the normalization effect, we tested for how well the data in the feature set can distinguish between each pair of samples. If the distribution is similar for the two samples in a given pair, the overall feature abundance (levels for all features in one sample vs the other) should not be sufficient to predict which sample is which. Therefore, for each pair of samples, an AUROC test was performed to quantify the ability of feature abundance to distinguish the two samples, and then a data_dist_similarity score was generated: 1-2*abs(AUROC-0.5). This score ranges from 0 to 1, and the higher the score is the better the normalized data quality is (no systematic difference between the two samples). The final metric for each data table is the median of scores from all sample pairs. The column ‘n’ shows the total number of sample pairs in the analysis.

data table	data_dist_similarity	n
d1	0.9739	3655
d2	1.0000	3655
d3	0.9864	3655
d4	0.9863	3655
d5	0.9621	3655
d6	0.9766	3655

4.2 Density plot

The density plots below show the expression distributions for all samples (separate line) in each data table. The Y axis shows the density over the range of log2 transformed protein or gene expression values (X axis).

5 Batch effect

5.1 Silhouette width

The silhouette width s(i) ranges from –1 to 1, with s(i) -> 1 if two clusters are separate and s(i) -> −1 if two clusters overlap but have dissimilar variance. If s(i) -> 0, both clusters have roughly the same structure. Thus, we use the absolute value |s| as an indicator for the presence or absence of batch effects (the greater |s| is, the higher the batch effect is). This analysis is done using the function batch_sil from the R package kBET.

data table	silhouette_width
d1	0.0145
d2	-0.0009
d3	0.0139
d4	0.0144
d5	0.0200
d6	0.0214

5.2 PCA with batch annotation

For each principal component (PC) from PCA, we calculate the Pearson’s correlation coefficient for that PC with batch covariate b:

r_i =corr(PC_i,b)

In a linear model with a single dependent, as is the case here for correlation of a given PC to a batch covariate, the coefficient of determination for batch b on PC_i, R², is the squared Pearson’s correlation coefficient:

R²(PC_i,b) = r_i²

The table below shows correlation coefficients for each PC for the first 10 PCs in each data table. The significance of the correlation coefficient was estimated either with a t-test or a one-way ANOVA. R² values highlighted with red indicate significant correlation (p-value <= 0.05) between batch and the corresponding PC. This analysis is done using the function pcRegression from the R package kBET.

PC	d1	d2	d3	d4	d5	d6
1	0.007	0.012	0.01	0.01	0	0.008
2	0.015	0	0.001	0.001	0.041	0.001
3	0.048	0.092	0.013	0.013	0.027	0.006
4	0.1	0.118	0.11	0.11	0.116	0.026
5	0	0.002	0.002	0.002	0.003	0.001
6	0.093	0.007	0.121	0.119	0.076	0.005
7	0.006	0.038	0.005	0.003	0.003	0.006
8	0.01	0.228	0.035	0.038	0.011	0
9	0.033	0.028	0.029	0.027	0.025	0.005
10	0.001	0.002	0.004	0.003	0.015	0.001

The percentage of variance explained for each PC is shown in the table below:

PC	d1	d2	d3	d4	d5	d6
1	10.8	10.8	10.9	11.0	11.2	11.0
2	7.9	6.7	7.3	7.3	9.1	7.8
3	5.4	5.1	4.9	4.9	5.7	5.3
4	4.5	4.7	4.6	4.6	4.4	4.3
5	4.1	4.0	4.0	4.0	3.9	4.2
6	3.0	2.9	2.9	3.0	3.1	2.9
7	2.8	2.7	2.8	2.9	2.7	2.8
8	2.2	2.2	2.2	2.2	2.1	2.2
9	2.0	2.1	2.1	2.1	2.0	2.0
10	1.9	2.1	1.9	1.9	1.9	1.9

Greater batch effect is more likely to be present when a PC that explains a higher percentage of variance shows significant correlation with the batch covariate. Therefore, we use the ‘Scaled PC regression’ metric (pcRegscale), i.e. the total variance of PCs which correlate significantly with batch covariate (FDR<0.05) scaled by the total variance of 10 PCs, to quantify the batch effect:

data table	pcRegscale
d1	0.1682
d2	0.2771
d3	0.1720
d4	0.1731
d5	0.0954
d6	0.0000

The figures below show the PCA score plots for the top three PCs for each data table. Samples from different batches are coded with different colors in the plots.

5.3 Correlation heatmap

Another way to qualitatively assess batch effect is to visualize the correlations for measurements between samples from the same batch to those in samples from different batches using heatmaps. The following figures show Spearman correlation heatmaps for all pairs of samples (all samples included in both rows and columns) for each data table. The color indicates the correlation between samples. The samples are ordered by batches. Concentration of high correlation values (red color) for pairs of samples from the same batch block compared to other batches indicates the presence of batch effect.

d1d2d3d4d5d6

6 Biological signal

6.1 Correlation among protein complex members

Members of the same protein complex often show greater correlation in gene and protein expression (IntraComplex correlation) than genes or proteins that are in different complexes (InterComplex correlation). Thus, one way to evaluate the quality of the biological signal present in a data table is to compare IntraComplex correlation to InterComplex correlation. Furthermore, because of the need to preserve stoichiometry between protein complex members, the difference between IntraComplex correlation and InterComplex correlation is often greater at the protein level than at the RNA data. If both RNA and protein data tables are available, observing that this difference is more pronounced in the protein data table than the RNA data table serves as an indicator for the quality of the protein data. We use the protein complexes from the CORUM database in this analysis.

The boxplots below show the distributions and ranges for pairwise correlations between genes or proteins from the same complex and for genes and proteins from different complexes for each data table.

The table below shows a summary of the evaluation. ‘diff’ is Cor(intra) - Cor(inter). ‘complex_auc’ is the AUROC value based on correlation of protein pairs from different groups.

data table	InterComplex	IntraComplex	diff	complex_auc
d1	0.0413	0.1634	0.1220	0.6520
d2	0.0050	0.1064	0.1013	0.6340
d3	0.0219	0.1715	0.1496	0.6815
d4	0.0201	0.1697	0.1496	0.6819
d5	0.0712	0.1825	0.1112	0.6320
d6	0.0410	0.1621	0.1210	0.6536
Protein	0.0047	0.1722	0.1676	0.6426

6.2 Gene function prediction

Previous studies have shown that expression correlation is often higher for functionally related genes or proteins than for unrelated genes or proteins and that this correlation is greater when considering protein data than when considering RNA data (Wang, Jing, et al. Molecular & Cellular Proteomics 16.1 (2017): 121-134.). Therefore, we can also evaluate the biological signal present in a data table by evaluating functional category predictions made using a co-expression network generated from each data table.

In this evaluation, each data table was used to build a co-expression network. For a selected network and a selected functional category (such as a selected category from GO or KEGG), proteins/genes annotated to the category and also included in the network were defined as a positive protein/gene set, and other proteins/genes in the network constituted the negative protein/gene set for the category. For a selected functional category, a subset of the proteins/genes were used as seed proteins/genes for random walk through the network to calculate scores for other proteins/genes. A higher score for a protein/gene represents a closer relationship between the protein/gene and the seed proteins/genes. The table below shows AUROCs of the prediction performance using this score for each selected functional category.

	d1	d2	d3	d4	d5	d6	Protein
Acute myeloid leukemia	0.543	0.587	0.641	0.595	0.546	0.543	0.588
Adherens junction	0.619	0.627	0.584	0.542	0.615	0.551	0.606
Adipocytokine signaling pathway	0.651	0.577	0.584	0.582	0.595	0.62	0.62
Alanine, aspartate and glutamate metabolism	0.651	0.58	0.573	0.6	0.702	0.658	0.679
Aldosterone-regulated sodium reabsorption	0.658	0.683	0.535	0.628	0.622	0.606	0.646
Allograft rejection	1	1	1	1	1	1	0.916
Alzheimers disease	0.708	0.646	0.699	0.72	0.724	0.684	0.733
Amino sugar and nucleotide sugar metabolism	0.645	0.641	0.604	0.601	0.66	0.608	0.638
Aminoacyl-tRNA biosynthesis	0.75	0.749	0.778	0.766	0.687	0.739	0.627
Amoebiasis	0.598	0.635	0.598	0.591	0.629	0.597	0.779
Amyotrophic lateral sclerosis (ALS)	0.661	0.607	0.7	0.652	0.648	0.609	0.58
Antigen processing and presentation	0.858	0.803	0.86	0.865	0.773	0.782	0.573
Apoptosis	0.615	0.596	0.584	0.563	0.559	0.63	0.563
Arachidonic acid metabolism	0.563	0.589	0.635	0.662	0.714	0.677	0.58
Arginine and proline metabolism	0.641	0.67	0.586	0.6	0.587	0.606	0.611
Arrhythmogenic right ventricular cardiomyopathy (ARVC)	0.637	0.63	0.638	0.592	0.625	0.676	0.693
Autoimmune thyroid disease	1	0.999	1	1	1	1	0.904
Axon guidance	0.667	0.629	0.604	0.59	0.633	0.615	0.56
B cell receptor signaling pathway	0.599	0.58	0.522	0.568	0.526	0.624	0.674
Bacterial invasion of epithelial cells	0.615	0.556	0.548	0.556	0.564	0.574	0.767
Basal transcription factors	0.627	0.487	0.519	0.522	0.588	0.556	0.669
Base excision repair	0.668	0.619	0.612	0.625	0.691	0.721	0.757
beta-Alanine metabolism	0.591	0.586	0.687	0.587	0.682	0.658	0.69
Bile secretion	0.609	0.593	0.594	0.602	0.684	0.587	0.646
Biosynthesis of unsaturated fatty acids	0.679	0.679	0.552	0.735	0.651	0.6	0.83
Bladder cancer	0.621	0.635	0.57	0.542	0.61	0.633	0.621
Butanoate metabolism	0.722	0.626	0.619	0.647	0.648	0.624	0.736
Calcium signaling pathway	0.609	0.535	0.585	0.613	0.606	0.6	0.61
Carbohydrate digestion and absorption	0.637	0.683	0.686	0.661	0.689	0.617	0.761
Cardiac muscle contraction	0.704	0.654	0.688	0.703	0.683	0.72	0.878
Cell adhesion molecules (CAMs)	0.796	0.786	0.788	0.815	0.812	0.795	0.77
Cell cycle	0.737	0.728	0.742	0.729	0.692	0.732	0.747
Chagas disease (American trypanosomiasis)	0.594	0.6	0.63	0.571	0.542	0.566	0.581
Chemokine signaling pathway	0.576	0.561	0.564	0.561	0.61	0.535	0.715
Citrate cycle (TCA cycle)	0.637	0.768	0.769	0.739	0.643	0.658	0.866
Collecting duct acid secretion	0.688	0.733	0.674	0.658	0.626	0.688	0.722
Colorectal cancer	0.547	0.528	0.576	0.573	0.589	0.596	0.646
Complement and coagulation cascades	0.867	0.842	0.78	0.867	0.838	0.874	0.907
Cysteine and methionine metabolism	0.61	0.615	0.593	0.653	0.705	0.712	0.622
Cytokine-cytokine receptor interaction	0.701	0.626	0.772	0.756	0.776	0.739	0.639
Cytosolic DNA-sensing pathway	0.65	0.617	0.687	0.727	0.673	0.702	0.583
Dilated cardiomyopathy	0.613	0.592	0.625	0.605	0.677	0.623	0.699
DNA replication	0.723	0.758	0.787	0.775	0.843	0.8	0.866
Drug metabolism - cytochrome P450	0.762	0.747	0.68	0.743	0.776	0.731	0.658
Drug metabolism - other enzymes	0.63	0.67	0.616	0.672	0.631	0.642	0.662
ECM-receptor interaction	0.857	0.851	0.82	0.816	0.883	0.866	0.916
Endocytosis	0.564	0.563	0.568	0.585	0.602	0.59	0.601
Endometrial cancer	0.593	0.539	0.596	0.551	0.57	0.531	0.631
Epithelial cell signaling in Helicobacter pylori infection	0.549	0.565	0.562	0.541	0.568	0.533	0.706
ErbB signaling pathway	0.579	0.6	0.62	0.591	0.576	0.553	0.535
Ether lipid metabolism	0.685	0.7	0.675	0.66	0.806	0.71	0.61
Fatty acid elongation in mitochondria	0.69	0.535	0.701	0.671	0.68	0.633	0.637
Fatty acid metabolism	0.652	0.642	0.6	0.628	0.684	0.543	0.717
Fc epsilon RI signaling pathway	0.555	0.564	0.636	0.533	0.567	0.616	0.68
Fc gamma R-mediated phagocytosis	0.584	0.671	0.65	0.659	0.61	0.638	0.74
Focal adhesion	0.669	0.632	0.666	0.671	0.672	0.686	0.75
Fructose and mannose metabolism	0.641	0.612	0.582	0.659	0.622	0.652	0.651
Galactose metabolism	0.623	0.605	0.642	0.599	0.639	0.554	0.641
Gap junction	0.595	0.584	0.565	0.619	0.551	0.622	0.706
Gastric acid secretion	0.596	0.686	0.572	0.635	0.608	0.579	0.618
Glioma	0.587	0.629	0.559	0.601	0.599	0.624	0.536
Glutathione metabolism	0.614	0.641	0.648	0.608	0.632	0.638	0.637
Glycerolipid metabolism	0.593	0.546	0.545	0.605	0.572	0.653	0.637
Glycerophospholipid metabolism	0.596	0.554	0.592	0.562	0.636	0.617	0.619
Glycine, serine and threonine metabolism	0.551	0.563	0.63	0.628	0.688	0.739	0.583
Glycolysis / Gluconeogenesis	0.602	0.565	0.622	0.637	0.595	0.624	0.743
Glyoxylate and dicarboxylate metabolism	0.797	0.677	0.648	0.698	0.694	0.641	0.677
GnRH signaling pathway	0.628	0.685	0.705	0.671	0.678	0.669	0.578
Graft-versus-host disease	1	1	1	1	1	1	0.885
Hematopoietic cell lineage	0.741	0.766	0.809	0.759	0.791	0.717	0.727
Hepatitis C	0.534	0.553	0.565	0.583	0.62	0.642	0.612
Huntingtons disease	0.717	0.67	0.72	0.745	0.741	0.717	0.796
Hypertrophic cardiomyopathy (HCM)	0.62	0.598	0.625	0.574	0.691	0.644	0.657
Inositol phosphate metabolism	0.587	0.633	0.543	0.632	0.573	0.566	0.578
Intestinal immune network for IgA production	0.759	0.764	0.906	0.902	0.898	0.813	0.79
Jak-STAT signaling pathway	0.681	0.703	0.617	0.641	0.674	0.619	0.577
Leishmaniasis	0.69	0.696	0.676	0.685	0.708	0.68	0.733
Leukocyte transendothelial migration	0.603	0.628	0.64	0.656	0.622	0.576	0.794
Long-term depression	0.553	0.553	0.586	0.567	0.727	0.616	0.539
Long-term potentiation	0.55	0.607	0.559	0.561	0.598	0.663	0.569
Lysine degradation	0.73	0.646	0.607	0.626	0.587	0.641	0.619
Lysosome	0.539	0.541	0.585	0.567	0.528	0.516	0.637
Malaria	0.769	0.73	0.774	0.798	0.822	0.806	0.825
Melanogenesis	0.605	0.643	0.674	0.686	0.607	0.608	0.59
Melanoma	0.53	0.533	0.596	0.599	0.547	0.642	0.643
Metabolic pathways	0.589	0.592	0.602	0.593	0.603	0.598	0.627
Metabolism of xenobiotics by cytochrome P450	0.754	0.731	0.787	0.79	0.81	0.758	0.682
Mismatch repair	0.73	0.721	0.755	0.741	0.748	0.746	0.862
mRNA surveillance pathway	0.566	0.56	0.511	0.567	0.636	0.576	0.75
mTOR signaling pathway	0.626	0.577	0.611	0.58	0.596	0.578	0.705
N-Glycan biosynthesis	0.749	0.678	0.761	0.797	0.772	0.62	0.702
Natural killer cell mediated cytotoxicity	0.607	0.53	0.644	0.573	0.563	0.577	0.788
NOD-like receptor signaling pathway	0.603	0.562	0.624	0.648	0.553	0.658	0.603
Non-small cell lung cancer	0.613	0.574	0.522	0.55	0.55	0.568	0.6
Notch signaling pathway	0.578	0.594	0.659	0.55	0.554	0.608	0.749
Nucleotide excision repair	0.61	0.589	0.588	0.665	0.689	0.665	0.743
Oocyte meiosis	0.628	0.613	0.606	0.643	0.622	0.656	0.537
Osteoclast differentiation	0.646	0.675	0.625	0.573	0.652	0.588	0.688
Oxidative phosphorylation	0.79	0.696	0.78	0.799	0.8	0.788	0.877
p53 signaling pathway	0.532	0.638	0.636	0.59	0.617	0.612	0.575
Pancreatic cancer	0.611	0.543	0.594	0.612	0.624	0.619	0.544
Pancreatic secretion	0.607	0.574	0.604	0.6	0.575	0.573	0.629
Parkinsons disease	0.782	0.712	0.781	0.775	0.771	0.784	0.883
Pathogenic Escherichia coli infection	0.602	0.647	0.658	0.62	0.598	0.566	0.686
Pentose phosphate pathway	0.64	0.669	0.682	0.623	0.628	0.728	0.705
Peroxisome	0.546	0.559	0.659	0.611	0.703	0.534	0.598
Phagosome	0.7	0.703	0.699	0.712	0.77	0.703	0.713
PPAR signaling pathway	0.558	0.651	0.629	0.611	0.618	0.572	0.572
Prion diseases	0.687	0.641	0.637	0.67	0.63	0.67	0.76
Progesterone-mediated oocyte maturation	0.611	0.635	0.629	0.582	0.614	0.61	0.637
Propanoate metabolism	0.782	0.626	0.702	0.642	0.629	0.753	0.681
Prostate cancer	0.511	0.587	0.582	0.584	0.532	0.571	0.614
Proteasome	0.86	0.794	0.86	0.848	0.816	0.857	0.808
Protein digestion and absorption	0.89	0.897	0.886	0.88	0.867	0.847	0.868
Protein processing in endoplasmic reticulum	0.722	0.698	0.742	0.742	0.727	0.722	0.633
Purine metabolism	0.607	0.594	0.575	0.621	0.549	0.574	0.619
Pyrimidine metabolism	0.545	0.534	0.552	0.599	0.61	0.571	0.633
Pyruvate metabolism	0.643	0.584	0.643	0.609	0.611	0.722	0.603
Regulation of actin cytoskeleton	0.636	0.616	0.63	0.621	0.621	0.618	0.652
Retinol metabolism	0.768	0.733	0.727	0.631	0.617	0.685	0.756
Rheumatoid arthritis	0.698	0.639	0.72	0.699	0.71	0.717	0.663
Ribosome	0.927	0.835	0.891	0.898	0.91	0.914	0.924
Ribosome biogenesis in eukaryotes	0.682	0.722	0.753	0.737	0.649	0.744	0.805
RIG-I-like receptor signaling pathway	0.562	0.475	0.625	0.598	0.652	0.65	0.627
RNA degradation	0.611	0.628	0.597	0.626	0.665	0.658	0.789
RNA polymerase	0.683	0.685	0.634	0.605	0.607	0.701	0.8
RNA transport	0.669	0.666	0.669	0.668	0.682	0.679	0.752
Salivary secretion	0.658	0.568	0.606	0.524	0.659	0.623	0.668
Shigellosis	0.574	0.596	0.589	0.592	0.639	0.545	0.673
Small cell lung cancer	0.541	0.6	0.556	0.549	0.574	0.606	0.67
SNARE interactions in vesicular transport	0.697	0.669	0.606	0.684	0.651	0.655	0.773
Sphingolipid metabolism	0.564	0.634	0.686	0.601	0.661	0.67	0.629
Spliceosome	0.749	0.743	0.76	0.78	0.733	0.754	0.763
Staphylococcus aureus infection	0.889	0.878	0.896	0.894	0.849	0.9	0.839
Starch and sucrose metabolism	0.643	0.656	0.676	0.692	0.634	0.602	0.65
Systemic lupus erythematosus	0.812	0.78	0.832	0.816	0.803	0.803	0.84
T cell receptor signaling pathway	0.542	0.582	0.55	0.57	0.577	0.594	0.606
Terpenoid backbone biosynthesis	0.64	0.672	0.734	0.73	0.655	0.69	0.84
TGF-beta signaling pathway	0.619	0.617	0.591	0.637	0.601	0.64	0.587
Tight junction	0.578	0.583	0.572	0.601	0.612	0.623	0.538
Toll-like receptor signaling pathway	0.585	0.575	0.58	0.529	0.596	0.561	0.665
Toxoplasmosis	0.581	0.594	0.547	0.603	0.565	0.599	0.626
Tryptophan metabolism	0.591	0.704	0.696	0.684	0.685	0.589	0.646
Type I diabetes mellitus	0.868	0.905	0.871	0.913	0.849	0.917	0.729
Tyrosine metabolism	0.762	0.768	0.748	0.742	0.717	0.711	0.61
Ubiquitin mediated proteolysis	0.691	0.664	0.634	0.676	0.649	0.622	0.608
Valine, leucine and isoleucine degradation	0.671	0.543	0.766	0.699	0.689	0.69	0.772
Vascular smooth muscle contraction	0.53	0.546	0.599	0.578	0.624	0.572	0.613
Vasopressin-regulated water reabsorption	0.639	0.681	0.529	0.605	0.613	0.623	0.634
VEGF signaling pathway	0.591	0.562	0.592	0.534	0.541	0.504	0.656
Vibrio cholerae infection	0.609	0.516	0.629	0.565	0.586	0.623	0.802
Viral myocarditis	0.698	0.71	0.742	0.764	0.747	0.745	0.673
Wnt signaling pathway	0.665	0.626	0.576	0.59	0.598	0.714	0.609

The rank boxplots below summarize the relative performance of the data tables in the functional prediction analysis. For each functional category, a rank is assigned to each data table based on its AUROC compared to the other data tables, where the best functional prediction rank is 1 and the poorest rank is the number of data tables.

Comparison of each protein (RNA) data table to a designated RNA (protein) data table is also summarized in the scatter plots below. For each point, the AUROC for a given category in the RNA data is plotted on the X-axis whereas the corresponding AUROC in the protein data table is plotted on the Y-axis. The number of categories for which the protein data table outperforms the RNA data table (AUROC(protein) > 1.1 * AUROC (RNA); red dots) and vice versa (AUROC(RNA) > 1.1 * AUROC (protein); blue dots) are also shown.

d1d2d3d4d5d6

6.3 Sample class prediction

OmicsEV also allows for assessment of how well each data table can predict a user specified class for each sample. For each data table, machine learning models are built to predict sample class: LumA,LumB. In OmicsEV, random forest models are built, and the models are evaluated using repeated 5 fold cross validation (20 times). Please note, depending on the class specified, this metric may or may not provide an indication of data quality. The results of AUROC analysis performed using the models are summarized in the table and boxplots below.

dataSet	mean_ROC	median_ROC	sd_ROC
d1	0.9900	0.9910	0.0054
d2	0.9943	0.9952	0.0029
d3	0.9887	0.9880	0.0035
d4	0.9898	0.9904	0.0031
d5	0.9913	0.9922	0.0033
d6	0.9940	0.9952	0.0026
Protein	0.7949	0.7972	0.0136

6.4 PCA with sample class annotation

Another approach for assessing how well each data table can distinguish between classes is to determine how well each class can be separated by principal component analysis (PCA). In PCA score plots for each data table below, each point is a sample that is colored by class and that has a shape reflecting the batch. For a given sample, the PC2 score is plotted on the Y-axis whereas the PC1 score is plotted on the X-axis. Ellipses highlighting clusters of samples in each class are colored by corresponding class, and the separation between these ellipses indicates how well the variances captured by the first two PCs can distinguish between samples from different classes.

d1d2d3d4d5d6

6.5 Unsupervised clustering

Unsupervised hierarchical clustering can reveal patterns in the data (clusters of genes or samples that behave more similarly to each other than to other genes or samples). Each heatmap below shows the results of hierarchical clustering for a given data table using ComplexHeatmap. Genes/proteins are in rows, while samples are in columns and labeled with corresponding class to visualize any potential associations between classes and clusters.

d1d2d3d4d5d6

7 Multi-omics concordance

The concordance between the protein data and RNA data can be used to assess data quality when both RNA and protein data tables are available. Here, we evaluate gene- and sample-wise correlations between the protein and RNA data tables.

7.1 Gene-wise mRNA-protein correlation

The table below shows the number of genes with measurements (n) in each data table as well as the median of all gene-wise Spearman correlations between mRNA and protein measurements. The columns n5, n6, n7 and n8 show the number of genes with correlation greater than 0.5, 0.6, 0.7 and 0.8, respectively.

data table	n	n5	n6	n7	n8	gene_wise_cor
d1	9120	1911	773	210	21	0.3294
d2	9120	1970	827	222	23	0.3354
d3	9120	2010	827	227	25	0.3374
d4	9120	2016	837	224	27	0.3384
d5	9120	1763	696	185	20	0.3207
d6	9120	1932	763	205	20	0.3283

Spearman correlation results are also shown for each gene/protein in the boxplot below.

Another way to visualize the differences between the distributions of all gene-wise RNA-protein correlations is with the cumulative distribution function (CDF) plot shown below. Here each line shows the cumulative distribution for the gene-wise correlations. The further the distribution function is shifted to the right, the more highly correlated the RNA-protein data is.

The histograms below provide another way to visualize the distribution of correlations for each protein (or RNA) data table with the RNA (or protein) data. Here the bars showing binned frequencies of positive correlations are in red, while negative correlations are shown in the blue bins, and summary statistics are also provided.

d1d2d3d4d5d6

7.2 Sample-wise mRNA-protein correlation

Sample-wise RNA-protein correlations are summarized in the table below as the median of Spearman correlations for matched protein and RNA data from all pairs of samples for each data table, while the violin plots below show the distributions of these correlations for each data table.

data table	sample_wise_cor
d1	0.1421
d2	0.1421
d3	0.1421
d4	0.1421
d5	0.1421
d6	0.1383

Omics data tables evaluation report

2022-08-30

1 Introduction

2 Overview

3 Data depth

3.1 Study-wise

3.2 Sample-wise

3.3 Missing value distribution

4 Data normalization

4.1 Boxplot

4.2 Density plot

5 Batch effect

5.1 Silhouette width

5.2 PCA with batch annotation

5.3 Correlation heatmap

6 Biological signal

6.1 Correlation among protein complex members

6.2 Gene function prediction

6.3 Sample class prediction

6.4 PCA with sample class annotation

6.5 Unsupervised clustering

7 Multi-omics concordance

7.1 Gene-wise mRNA-protein correlation

7.2 Sample-wise mRNA-protein correlation