Here we check this using one example to explore the effect. The annotation table functions.csv is directly from MetaLab (above 2.1, with the new function anotation databawe from EggNOG 5.0, see how it is generated )
In this example, we use KEGG.name column as an example. Check other columns/functions as you want. avaliable functions are:
“Gene_Ontology_id”
“Gene_Ontology_name”
“Gene_Ontology_namespace” “EC_id”
“EC_de”
“EC_an”
“EC_ca”
“KEGG_ko”
“KEGG.accession”
“KEGG.name”
“KEGG_Module”
“KEGG_Reaction”
“KEGG_rclass”
“BRITE”
“KEGG_TC”
“CAZy”
“BiGG_Reaction”
“COG.accession”
“COG.name”
“COG.category”
“NOG.accession”
“NOG.name”
“NOG.category”
“eggNOG_description”
functions.csv can be found here:
Psedo code: * readin the file, with the option of na.strings = "" to put NA for rows without matched function names * ony keep the selected function column and the intensity columns * remove rows without matched function names * split the funtion names, either keep all for the “flatten” data, or keep the first the “first” test * calculate the mean according to the function name
Some facts * using only the first one will loss quite a lot of function items * “first” data is a subset of the “flatten” data * the function items are listed without obvious orders
library(tidyverse)
library(reshape2)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.5
data_fun <- read.csv("functions.csv", header = TRUE, sep = ",",na.strings = "")
# data_fun <- data_fun[!duplicated(data_fun$Group_ID),] # if the table has multiple lines from teh same protein group,
target_column_name <- "KEGG.name" # choose the column to test, here we use KEGG.name for testing
df_func_intensity_flatten_mean <- data_fun %>%
select(c(target_column_name, starts_with("Intensity"))) %>%
drop_na(any_of(target_column_name)) %>%
tidytext::unnest_tokens(.,function_name, target_column_name, token = "regex", pattern = ",\\s*") %>%
select(-1) %>%
group_by(function_name) %>%
summarise_all(mean)
# only use the first one, aggregate
df_func_intensity_first_mean<- data_fun %>%
select(c(target_column_name, starts_with("Intensity"))) %>%
drop_na(any_of(target_column_name)) %>%
separate(target_column_name, "function_name", extra = 'drop',sep = ",") %>%
mutate(function_name = tolower(function_name)) %>%
group_by(function_name) %>%
summarise_all(mean)
Now we have the 2 data table
For better visualization/comparision of the consistency, log10 value is used and data is scaled by row
# check the overlap of the two method
func_compare_intensity <- inner_join(df_func_intensity_flatten_mean, df_func_intensity_first_mean, by = c("function_name"),suffix = c(".flatten", ".first"))
# prepare the matrix to plot
func_compare_intensity_plot <- func_compare_intensity %>%
column_to_rownames('function_name') %>%
rename_with(~gsub("Intensity.Experiment_","",.)) %>%
+1 %>%
log10() %>%
filter_all(any_vars(. != 0)) %>%
as.matrix()
# plot heatmap without column clustering
gplots::heatmap.2(func_compare_intensity_plot, scale = "row",Colv = FALSE, dendrogram = "row", trace = "none")
pheatmap::pheatmap(func_compare_intensity_plot,scale = "row", cluster_cols = FALSE)
# check what functions are only in flattern way
df_func_intensity_falltern_only <- df_func_intensity_flatten_mean[!df_func_intensity_flatten_mean$function_name %in% func_compare_intensity$function_name,]
df_func_intensity_falltern_only <- df_func_intensity_falltern_only %>%
column_to_rownames('function_name') %>%
rename_with(~gsub("Intensity.Experiment_","",.)) %>%
+1 %>%
#log10() %>%
filter_all(any_vars(. != 0)) %>%
as.matrix()
You will lose all the following flatten-only functions if you only use the first one for your analysis:
2-oxocarboxylic acid metabolism, acarbose and validamycin biosynthesis, aminobenzoate degradation, arachidonic acid metabolism, ascorbate and aldarate metabolism, aspartate and glutamate metabolism, atrazine degradation, base excision repair, biofilm formation - vibrio cholerae, biosynthesis of amino acids, biosynthesis of ansamycins, biosynthesis of secondary metabolites, biosynthesis of vancomycin group antibiotics, carbon fixation in photosynthetic organisms, carbon fixation pathways in prokaryotes, carbon metabolism, carotenoid biosynthesis, chloroalkane and chloroalkene degradation, d-arginine and d-ornithine metabolism, degradation of aromatic compounds, dioxin degradation, drug metabolism - cytochrome p450, drug metabolism - other enzymes, fatty acid metabolism, flavone and flavonol biosynthesis, fluorobenzoate degradation, glucosinolate biosynthesis, glycosphingolipid biosynthesis - ganglio series, glycosphingolipid biosynthesis - globo and isoglobo series, insect hormone biosynthesis, isoquinoline alkaloid biosynthesis, kanamycin and gentamicin biosynthesis, leucine and isoleucine biosynthesis, leucine and isoleucine degradation, limonene and pinene degradation, metabolism of xenobiotics by cytochrome p450, microbial metabolism in diverse environments, naphthalene degradation, neomycin, nitrogen metabolism, nonribosomal peptide structures, novobiocin biosynthesis, peptidoglycan biosynthesis, phenazine biosynthesis, phenylpropanoid biosynthesis, piperidine and pyridine alkaloid biosynthesis, prodigiosin biosynthesis, retinol metabolism, rna polymerase, secondary bile acid biosynthesis, serine and threonine metabolism, sphingolipid metabolism, sulfur relay system, tropane, type i polyketide structures, tyrosine and tryptophan biosynthesis, vancomycin resistance, various types of n-glycan biosynthesis, xylene degradation
Here the orignial intensity data is used to highlight the abundance, showing how many abuandant function items would be lost in this workflow
pheatmap::pheatmap(df_func_intensity_falltern_only, cluster_cols = FALSE)
# melt the aggretated matrix with intensity values
df_func_intensity_flatten_mean_melt <- melt(df_func_intensity_flatten_mean, id = "function_name")
df_func_intensity_first_mean_melt <- melt(df_func_intensity_first_mean, id = "function_name")
# join the two df
p <- inner_join(df_func_intensity_flatten_mean_melt, df_func_intensity_first_mean_melt, by = c("function_name", "variable"),suffix = c(".flatten", ".first")) %>%
mutate(value.flatten = value.flatten+1, value.first =value.first+1) %>%
mutate(variable = as.factor(variable)) %>%
ggplot() +
geom_point(aes(x = value.flatten, y=value.first)) +
scale_x_continuous(trans='log10') +
scale_y_continuous(trans='log10') +
ggtitle("Abundance Correlation between first item and all")
p
# facet plot
p+facet_wrap(vars(variable))
Some dots on x axis (with value.first as zero skewed the plot. Let’s replot with remove all these values
# join the two df
p <- inner_join(df_func_intensity_flatten_mean_melt, df_func_intensity_first_mean_melt, by = c("function_name", "variable"),suffix = c(".flatten", ".first")) %>%
filter(value.flatten>1, value.first >1) %>%
mutate(value.flatten = value.flatten+1, value.first =value.first+1) %>%
mutate(variable = as.factor(variable)) %>%
ggplot() +
geom_point(aes(x = value.flatten, y=value.first)) +
scale_x_continuous(trans='log10') +
scale_y_continuous(trans='log10') +
ggtitle("Abundance Correlation between first item and all")
p
p+facet_wrap(vars(variable))
let’s print all the plots for each function check this pdf file here
# print out all KEGG correlation plot
corr_intensity <- inner_join(df_func_intensity_flatten_mean_melt, df_func_intensity_first_mean_melt, by = c("function_name", "variable"),suffix = c(".flatten", ".first")) %>%
mutate(value.flatten = value.flatten+1, value.first =value.first+1) %>%
mutate(variable = as.factor(variable))
func_list <- corr_intensity$function_name %>% unique()
pdf("all_function_KEGG.pdf")
for(i in 1:length(func_list)){
p <- corr_intensity %>% filter(function_name == func_list[i]) %>% ggplot() +
geom_point(aes(x = value.flatten, y=value.first)) +
scale_x_continuous(trans='log10') +
scale_y_continuous(trans='log10')+
ggtitle(func_list[i])
print(p)
}
dev.off()
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidytext_0.3.1 reshape2_1.4.4 forcats_0.5.1 stringr_1.4.0
## [5] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
## [9] tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 lubridate_1.7.10 lattice_0.20-41 gtools_3.8.2
## [5] assertthat_0.2.1 digest_0.6.27 utf8_1.1.4 R6_2.5.0
## [9] cellranger_1.1.0 plyr_1.8.6 backports_1.2.1 reprex_1.0.0
## [13] evaluate_0.14 highr_0.8 httr_1.4.2 pillar_1.5.1
## [17] gplots_3.1.1 rlang_0.4.10 readxl_1.3.1 rstudioapi_0.13
## [21] jquerylib_0.1.3 Matrix_1.3-2 rmarkdown_2.7 pheatmap_1.0.12
## [25] munsell_0.5.0 broom_0.7.5 compiler_4.0.4 janeaustenr_0.1.5
## [29] modelr_0.1.8 xfun_0.22 pkgconfig_2.0.3 htmltools_0.5.1.1
## [33] tidyselect_1.1.0 fansi_0.4.2 crayon_1.4.1 dbplyr_2.1.0
## [37] withr_2.4.1 bitops_1.0-6 SnowballC_0.7.0 grid_4.0.4
## [41] jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1
## [45] magrittr_2.0.1 scales_1.1.1 tokenizers_0.2.1 KernSmooth_2.23-18
## [49] cli_2.3.1 stringi_1.5.3 farver_2.1.0 fs_1.5.0
## [53] xml2_1.3.2 bslib_0.2.4 ellipsis_0.3.1 generics_0.1.0
## [57] vctrs_0.3.6 RColorBrewer_1.1-2 tools_4.0.4 glue_1.4.2
## [61] hms_1.0.0 yaml_2.2.1 colorspace_2.0-0 caTools_1.18.1
## [65] rvest_1.0.0 knitr_1.33 haven_2.3.1 sass_0.3.1