1 Question:

The way we do function anntation for micrbiome is basically doing blast agaist know fasta sequence with known functions.
Currently one of the good quality annotation database is EggNOG mappper.
MetaLab also adopted this blast strategy
One detrimental problem of this workflow is that one seuqnece mostly likely has multiple function annotation. This brings out the practical question: shall I use the first matched function or use all of them?
- The simplest way is using the first one
- Using all items is not impossible, but requires a lot of data transformation.
- which one is right/better? what is the relationship between these two?

Here we check this using one example to explore the effect. The annotation table functions.csv is directly from MetaLab (above 2.1, with the new function anotation databawe from EggNOG 5.0, see how it is generated )

In this example, we use KEGG.name column as an example. Check other columns/functions as you want. avaliable functions are:

“Gene_Ontology_id”
“Gene_Ontology_name”
“Gene_Ontology_namespace” “EC_id”
“EC_de”
“EC_an”
“EC_ca”
“KEGG_ko”
“KEGG.accession”
“KEGG.name”
“KEGG_Module”
“KEGG_Reaction”
“KEGG_rclass”
“BRITE”
“KEGG_TC”
“CAZy”
“BiGG_Reaction”
“COG.accession”
“COG.name”
“COG.category”
“NOG.accession”
“NOG.name”
“NOG.category”
“eggNOG_description”

2 Prepare data

functions.csv can be found here:

Psedo code: * readin the file, with the option of na.strings = "" to put NA for rows without matched function names * ony keep the selected function column and the intensity columns * remove rows without matched function names * split the funtion names, either keep all for the “flatten” data, or keep the first the “first” test * calculate the mean according to the function name

Some facts * using only the first one will loss quite a lot of function items * “first” data is a subset of the “flatten” data * the function items are listed without obvious orders

library(tidyverse)
library(reshape2)
library(tidytext)

## Warning: package 'tidytext' was built under R version 4.0.5

data_fun <- read.csv("functions.csv", header = TRUE, sep = ",",na.strings = "")

# data_fun <- data_fun[!duplicated(data_fun$Group_ID),] # if the table has multiple lines from teh same protein group, 

target_column_name <- "KEGG.name" # choose the column to test, here we use KEGG.name for testing

df_func_intensity_flatten_mean <- data_fun %>% 
  select(c(target_column_name, starts_with("Intensity"))) %>% 
  drop_na(any_of(target_column_name)) %>% 
  tidytext::unnest_tokens(.,function_name, target_column_name, token = "regex", pattern = ",\\s*") %>% 
  select(-1) %>% 
  group_by(function_name) %>% 
  summarise_all(mean)
  

# only use the first one, aggregate
df_func_intensity_first_mean<- data_fun %>% 
  select(c(target_column_name, starts_with("Intensity"))) %>% 
  drop_na(any_of(target_column_name)) %>% 
  separate(target_column_name, "function_name", extra =  'drop',sep = ",") %>% 
  mutate(function_name = tolower(function_name)) %>% 
  group_by(function_name) %>% 
  summarise_all(mean)

Now we have the 2 data table

df_func_intensity_flatten_mean has 152 unique function items
df_func_intensity_first_mean has 93 unique function items

3 Overal profile of the overlapped data

For better visualization/comparision of the consistency, log10 value is used and data is scaled by row

# check the overlap of the two method 
func_compare_intensity <- inner_join(df_func_intensity_flatten_mean, df_func_intensity_first_mean, by = c("function_name"),suffix = c(".flatten", ".first"))


# prepare the matrix to plot
func_compare_intensity_plot <- func_compare_intensity %>% 
  column_to_rownames('function_name') %>% 
  rename_with(~gsub("Intensity.Experiment_","",.)) %>% 
  +1 %>% 
  log10() %>% 
  filter_all(any_vars(. != 0)) %>% 
  as.matrix()

# plot heatmap without  column clustering
gplots::heatmap.2(func_compare_intensity_plot, scale = "row",Colv = FALSE, dendrogram  = "row", trace  = "none")

pheatmap::pheatmap(func_compare_intensity_plot,scale = "row", cluster_cols = FALSE)

4 Profile of the flatten only

# check what functions are only in flattern way
df_func_intensity_falltern_only <- df_func_intensity_flatten_mean[!df_func_intensity_flatten_mean$function_name %in% func_compare_intensity$function_name,]

df_func_intensity_falltern_only <- df_func_intensity_falltern_only %>% 
  column_to_rownames('function_name') %>% 
  rename_with(~gsub("Intensity.Experiment_","",.)) %>% 
  +1 %>% 
  #log10() %>% 
  filter_all(any_vars(. != 0)) %>% 
  as.matrix()

You will lose all the following flatten-only functions if you only use the first one for your analysis:

2-oxocarboxylic acid metabolism, acarbose and validamycin biosynthesis, aminobenzoate degradation, arachidonic acid metabolism, ascorbate and aldarate metabolism, aspartate and glutamate metabolism, atrazine degradation, base excision repair, biofilm formation - vibrio cholerae, biosynthesis of amino acids, biosynthesis of ansamycins, biosynthesis of secondary metabolites, biosynthesis of vancomycin group antibiotics, carbon fixation in photosynthetic organisms, carbon fixation pathways in prokaryotes, carbon metabolism, carotenoid biosynthesis, chloroalkane and chloroalkene degradation, d-arginine and d-ornithine metabolism, degradation of aromatic compounds, dioxin degradation, drug metabolism - cytochrome p450, drug metabolism - other enzymes, fatty acid metabolism, flavone and flavonol biosynthesis, fluorobenzoate degradation, glucosinolate biosynthesis, glycosphingolipid biosynthesis - ganglio series, glycosphingolipid biosynthesis - globo and isoglobo series, insect hormone biosynthesis, isoquinoline alkaloid biosynthesis, kanamycin and gentamicin biosynthesis, leucine and isoleucine biosynthesis, leucine and isoleucine degradation, limonene and pinene degradation, metabolism of xenobiotics by cytochrome p450, microbial metabolism in diverse environments, naphthalene degradation, neomycin, nitrogen metabolism, nonribosomal peptide structures, novobiocin biosynthesis, peptidoglycan biosynthesis, phenazine biosynthesis, phenylpropanoid biosynthesis, piperidine and pyridine alkaloid biosynthesis, prodigiosin biosynthesis, retinol metabolism, rna polymerase, secondary bile acid biosynthesis, serine and threonine metabolism, sphingolipid metabolism, sulfur relay system, tropane, type i polyketide structures, tyrosine and tryptophan biosynthesis, vancomycin resistance, various types of n-glycan biosynthesis, xylene degradation

Here the orignial intensity data is used to highlight the abundance, showing how many abuandant function items would be lost in this workflow

pheatmap::pheatmap(df_func_intensity_falltern_only, cluster_cols = FALSE)

5 see Abundance Correlation between “flatten” and “first” workflow

# melt the aggretated matrix with intensity values
df_func_intensity_flatten_mean_melt <- melt(df_func_intensity_flatten_mean, id = "function_name")
df_func_intensity_first_mean_melt <- melt(df_func_intensity_first_mean, id = "function_name")

# join the two df
p <- inner_join(df_func_intensity_flatten_mean_melt, df_func_intensity_first_mean_melt, by = c("function_name", "variable"),suffix = c(".flatten", ".first")) %>%
  mutate(value.flatten = value.flatten+1, value.first =value.first+1) %>% 
  mutate(variable = as.factor(variable)) %>% 
  ggplot() + 
  geom_point(aes(x = value.flatten, y=value.first)) + 
  scale_x_continuous(trans='log10') +
  scale_y_continuous(trans='log10') +
  ggtitle("Abundance Correlation between first item and all")

# facet plot
p+facet_wrap(vars(variable))

Some dots on x axis (with value.first as zero skewed the plot. Let’s replot with remove all these values

# join the two df
p <- inner_join(df_func_intensity_flatten_mean_melt, df_func_intensity_first_mean_melt, by = c("function_name", "variable"),suffix = c(".flatten", ".first")) %>%
  filter(value.flatten>1, value.first >1) %>% 
  mutate(value.flatten = value.flatten+1, value.first =value.first+1) %>% 
  mutate(variable = as.factor(variable)) %>% 
  ggplot() + 
  geom_point(aes(x = value.flatten, y=value.first)) + 
  scale_x_continuous(trans='log10') +
  scale_y_continuous(trans='log10') +
  ggtitle("Abundance Correlation between first item and all")

p+facet_wrap(vars(variable))

let’s print all the plots for each function check this pdf file here

# print out all KEGG correlation plot

corr_intensity <- inner_join(df_func_intensity_flatten_mean_melt, df_func_intensity_first_mean_melt, by = c("function_name", "variable"),suffix = c(".flatten", ".first")) %>%
  mutate(value.flatten = value.flatten+1, value.first =value.first+1) %>% 
  mutate(variable = as.factor(variable))

func_list <- corr_intensity$function_name %>% unique()

pdf("all_function_KEGG.pdf")

for(i in 1:length(func_list)){
 p <-  corr_intensity %>% filter(function_name == func_list[i]) %>% ggplot() +
  geom_point(aes(x = value.flatten, y=value.first)) +
  scale_x_continuous(trans='log10') +
  scale_y_continuous(trans='log10')+
   ggtitle(func_list[i])
 print(p)
 
}

dev.off()

6 sessionInfo

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidytext_0.3.1  reshape2_1.4.4  forcats_0.5.1   stringr_1.4.0  
##  [5] dplyr_1.0.5     purrr_0.3.4     readr_1.4.0     tidyr_1.1.3    
##  [9] tibble_3.1.0    ggplot2_3.3.3   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6         lubridate_1.7.10   lattice_0.20-41    gtools_3.8.2      
##  [5] assertthat_0.2.1   digest_0.6.27      utf8_1.1.4         R6_2.5.0          
##  [9] cellranger_1.1.0   plyr_1.8.6         backports_1.2.1    reprex_1.0.0      
## [13] evaluate_0.14      highr_0.8          httr_1.4.2         pillar_1.5.1      
## [17] gplots_3.1.1       rlang_0.4.10       readxl_1.3.1       rstudioapi_0.13   
## [21] jquerylib_0.1.3    Matrix_1.3-2       rmarkdown_2.7      pheatmap_1.0.12   
## [25] munsell_0.5.0      broom_0.7.5        compiler_4.0.4     janeaustenr_0.1.5 
## [29] modelr_0.1.8       xfun_0.22          pkgconfig_2.0.3    htmltools_0.5.1.1 
## [33] tidyselect_1.1.0   fansi_0.4.2        crayon_1.4.1       dbplyr_2.1.0      
## [37] withr_2.4.1        bitops_1.0-6       SnowballC_0.7.0    grid_4.0.4        
## [41] jsonlite_1.7.2     gtable_0.3.0       lifecycle_1.0.0    DBI_1.1.1         
## [45] magrittr_2.0.1     scales_1.1.1       tokenizers_0.2.1   KernSmooth_2.23-18
## [49] cli_2.3.1          stringi_1.5.3      farver_2.1.0       fs_1.5.0          
## [53] xml2_1.3.2         bslib_0.2.4        ellipsis_0.3.1     generics_0.1.0    
## [57] vctrs_0.3.6        RColorBrewer_1.1-2 tools_4.0.4        glue_1.4.2        
## [61] hms_1.0.0          yaml_2.2.1         colorspace_2.0-0   caTools_1.18.1    
## [65] rvest_1.0.0        knitr_1.33         haven_2.3.1        sass_0.3.1

How to deal with function redundancy?

Suggestions to techteam.metalab@gmail.com

Report generated @2021-05-25 12:32:41