Bioinformatics

Overview
Previous kjresearch has identified the abnormal enrichment of Fusobacterium nucleatum in colorectal cancer (CRC) patients .^[1-5] In our current project, we employed bioinformatics approaches to analyze the dataset downloaded from the GEO database, identifying key genes characterizing the interaction of Fn with CRC patient cells. We then established a scoring model based on these genes and applied it to a large cohort of CRC patient data from the TCGA database, classifying them into high and low groups. We observed significant prognostic differences between the groups, revealing that the presence of Fusobacterium nucleatum substantially impacts CRC patient disease progression and prognosis.
Guided by these findings, our iGEM project designed a dual-targeting peptide against both CRC and Fn, aiming to simultaneously eliminate both harmful factors and provide a more effective treatment strategy for this subset of CRC patients.

1 Background
Previous research has demonstrated the enrichment level of the Fn genus is highest among various microbial communities in CRC through the analysis of 16sRNA sequence data. ^[6] The classification overview of microbial genera (>0.1%) identified in CRC (T) and adjacent normal (N) tissues at the family level was analyzed. It can be observed that the microbial genus represented by the color red, Fusobacterium, exhibited significant differences between CRC and adjacent normal tissues.

Figure 1.Taxonomic Profile of the Microbiota at Genus level identified in colorectal carcinoma (T) and adjacent normal tissue (N).

Figure 2.Microbiome Landscape of Fusobacterium.

The microbial community landscape of Fusobacterium. The two-dimensional kernel density estimation (2D-kde) displays the abundance of Fusobacterium (log10-transformed relative abundance data) in CRC and adjacent normal (NA) tissues. The z-axis represents the estimated kernel density in a three-dimensional perspective view, showing the population frequency of Fusobacterium in CRC (y-axis) and normal (x-axis) tissues. The abundance gradually increases from normal samples to CRC samples in this population.The microbial community landscape of Fusobacterium. The two-dimensional kernel density estimation (2D-kde) displays the abundance of Fusobacterium (log10-transformed relative abundance data) in CRC and adjacent normal (NA) tissues. The z-axis represents the estimated kernel density in a three-dimensional perspective view, showing the population frequency of Fusobacterium in CRC (y-axis) and normal (x-axis) tissues. The abundance gradually increases from normal samples to CRC samples in this population.

2 Gene Selection

Figure 3. Flowchart

Figure 3. illustrates the workflow chart of data preparation, processing, analysis, and validation. The Sample GSM2417927 and Series GSE90944 were downloaded from the Gene Expression Omnibus Database (https://www.ncbi.nlm.nih.gov/geo/), they were datasets that contained the bulk-RNA sequence data of human colorectal cancer cells with and without cultured F.nucleatum treatment. The RNA expression profiles of Colon Cancer and Colon and Rectal Cancer, from the TCGA database (https://portal.gdc.cancer.gov/). We selected the genes representing the effect of Fn on CRC from GEO datasets by DEG and WGCNA analysis, then established a scoring model based on the TCGA patients dataset. Last, we analyze the different scoring groups with CRC patients’ survival, carrying out the functional enrichment analysis and immune microenvironment analysis to further explore possible mechanisms.

Figure 4. PCA map of GSM2417927 and GSE90944 after removing batch effects

To elucidate whether F. Nucleatum plays a role in colorectal cancer tumorigenesis, RNA-seq analysis was performed to compare the gene expression profiles of F. Nucleatum treated HT-29 cell lines or not in both databases. Furthermore, we use the SCV package in R to remove batch effects ^[7].

Figure 5.The volcano map of DEGs

Setting the criterion of |log2(FC)|>0.1 and adj.P.Value<0.1,1812 DEGs were screened out in the collected CRC datasets via Limma method ^[8], including 449 up-regulated genes and 1363 down-regulated genes. Relevant results were displayed in the volcano plot in Figure 5.

Figure 6. Analysis of the scale-free index for various soft-threshold powers (β) and mean connectivity for various soft-threshold powers.

Figure 7.Clustering and merging of co-expression modules

Figure 8. Correlation heatmap of module genes and clinical traits.

Figure 9. Scatterplot of correlation between modules and clinical traits.

The most correlated modules to Fn were identified through WGCNA ^[9]. We selected 8 as soft threshold, on account of average connectivity and scale independence (Figure 6). Clustering dendrogram of Fn. and control was illustrated in Figure 7. 72 co-expression modules of genes were generated in Figure 8 in different colors and correlations between Fn and co-expression modules were presented in Figure 9. The brown module (2528 genes) expressed the highest correlation with Fn, with a correlation coefficient of 0.62 and p-value lower than 1e-200, and was therefore identified to be a critical module for the following analysis.

Figure 10.10 Venn diagram

On the foundation of intersection of the DEGs and the screened module genes of Fn relating to CRC gene differential expression, we eventually selected 990 genes as our hub genes, and the intersection is demonstrated in Figure 10.

3 Model Establishment
We performed a univariate linear regression analysis on the dataset obtained from TCGA to identify the most significant factor influencing the outcome variable. By estimating the intercept and slope of the best-fitting line, we can determine the strength and direction of the relationship between the predictor and response variables, as well as assess the model's goodness of fit using the coefficient of determination (R²). Table 1 demonstrated the results of summarizing Gene, Hazard ratio, Confidence interval, and P-value, where we selected the genes with P-value<0.1, to further carry out the Lasso analysis.

Figure 11.The variation characteristics of the coefficient of variables;

Figure 12.the selection process of the optimum value of the parameter λ in the Lasso regression model by cross-validation method.

The LASSO (The least absolute shrinkage and selection operator algorithm) method ^[10], which is suitable for the regression of high-dimensional data, was used to select the most useful predictive variables from the primary data set. The “glmnet” package was used to perform the LASSO Cox regression model analysis. We utilized LASSO regression to select the most contributing element affecting the CRC from Fn.

Figure 13.Cox Multiple Factors Analysis

TheCox Multiple Factors Analysis picked out 11 underlying prospective biomarkers (AADAC, SIM2, SC5D, RGS12, PRSS1, LRFN1, HOXB8, HOXB13, HES4, FOXD4, DKK1). Then, we built up a Multivariate Cox model. After analyzing, we chose four genes whose P_value were under 0.01 including DKK1, HOXB13, LRFN1, and RGS12) and evaluated the accuracy of our model by bootstrap test (Fig.14)

4 Prognostic Analysis

Figure 14.ROC curve analysis

We conducted a receiver operating characteristic (ROC) curve to estimate the prediction effectiveness of the Fn risk signature in terms of survival rates from 3 to 5 years. Moreover, the areas under the ROC curve (AUC) ranged from 0.634770319255454 to 0.740664408073566, demonstrating a considerable predictive significance (Fig. 15)

Figure 15.Risk scores distribution analysis

Additionally, we examined the relationship between risk score and survival status. As illustrated in Fig. 14, Our findings revealed that in the low-risk cohort, the number of living statuses substantially elevated in contrast with that of the high-risk cohort.

Figure 16.Survival analysis for salivary in different groups

To assess the prognostic associations of Fn affecting subtypes with overall survival (OS), Kaplan-Meier analysis was conducted to explore the prognostic associations using the “survival” package (version 3.1–11) in R and the Kaplan-Meier curves were plotted by the “service” package (version 0.4.6). The differences in OS among different subtypes were detected by the log-rank test and a P < 0.05 was considered statistically significant.

5 FEA

Figure 17.Gene ontology enrichment analysis of selected genes.

Figure 18. Kyoto Encyclopedia of Genes and Genomes enrichment

On the foundation of the 11 screened-out genes from LASSO, we performed functional enrichment analysis(FEA). KEGG/GO analysis reveals that the “DNA-binding transcription repressor activity, RNA polymerase II-specific process terms”, ''anterior/posterior pattern specification'' (Figure. 17), and “Cytosolic DNA-sensing pathway” (Figure. 18) of molecular function were the pathways that hub genes primarily enriched respectively.

6 TIME

Figure 19. Heatmap of immune infiltration

We further applied immune infiltration based on the CIBERSORT ^[11]to analyse the tumor immune microenvironment(TIME). The “CIBERSORT” package of R was used to quantify the relative proportion of 22 infiltrating immune cells. The heatmap revealed that T cells and monocytes were enriched in the high-risk group of CRC patients, whereas the mast cells activated were up-regulated in the low-risk group, which may suggest the mechanism from an immune perspective.