Background

In our experiment, we intend to use bacteria as drug carriers targeting lung cancer. Different organs host a variety of bacterial species, attributed to the unique environments they present. Our main objective is to identify the ideal chassis bacterium for drug delivery. The optimal chassis bacterium should be compatible with the pulmonary environment, exhibit resilience, demonstrate strong targeting capabilities towards tumor cells, and possess low pathogenicity to minimize potential adverse effects.

Challenges

1.Each bacterium functions as an independent species, precluding uniform analysis or assessment of gene interactions using methods such as GO or KEGG.

2.The availability of RNAseq data for bacteria in specific environments is limited.

Objectives and Methods

To do this, we construct a machine learning model to analyze the differences between bacteria collected from the lung microbiome of individuals and those same bacteria cultured in vitro. Additionally, we aim to investigate the disparities between bacteria collected from cancerous environments and the same bacteria when cultured in vitro. Our objective is to identify distinctive features, functions, and relevant genes of bacteria in both cancerous and normal lung tissues.

For this endeavor, we'll aggregate transcriptomic data of bacteria derived from RNAseq results into a matrix, incorporating sample information, features, and expression levels. Every sample will also be tagged with a categorical label. Following this, we'll deploy a random forest model for data training. Through classification training, we anticipate identifying patterns inherent to both the pulmonary and tumor environments. This understanding will guide our efforts to match these patterns to the most suitable bacteria, designating our chosen chassis bacterium.

Templates

Data Template

This is a segment of our dataset structure related to lung data. There are 18723 rows in total.

Fig 1A. Overview of the data template

Code Template

This is a segment of our machine learning code related to lung data. The complete codes can be accessed at this GitLab link.

Fig 1B. Overview of the code template

Results

Accuracy Validation

In verifying the model accuracy and precision, we choose two ways to verify.

To verify the model accuracy and precision from a mathematical perspective, we use the ROC AUC plot and the confusion matrix. The ROC (Receiver Operating Characteristic) is a graphical representation showcasing the performance of a binary classification model across all classification thresholds. Specifically, the y-axis of the ROC represents the True Positive Rate (TPR, also known as sensitivity), and the x-axis represents the False Positive Rate (FPR, or 1-specificity). The curve illustrates how the model's TPR and FPR change with varying classification thresholds. The AUC (Area Under Curve) is essentially the area beneath the ROC curve. AUC values range between 0 and 1, where 1 signifies a perfect classifier, and 0.5 indicates a model that performs no better than random guessing.

The confusion matrix, on the other hand, is a specific table layout useful for visualizing and evaluating the performance of supervised learning algorithms. For binary classification problems, this matrix is of size 2x2, representing the actual versus predicted categories. The four primary components of a confusion matrix are True Positives (TP), False Negatives (FN), True Negatives (TN), and False Positives (FP). These components help delineate the types of errors made by the classifier.

The rationale behind choosing the ROC AUC and confusion matrix for model evaluation is multi-fold. The ROC AUC provides a comprehensive metric of the model's performance, independent of a specific classification threshold. This is especially beneficial when comparing the performance of multiple models, allowing for an intuitive understanding of which model outperforms across various thresholds. The confusion matrix offers a clear depiction of the types of classification errors made by the model. These metrics offer insights into the model's performance on both positive and negative classes and guide potential model refinement and optimization.

Furthermore, to validate our model in a biological way, we performed a differential expression analysis (DEG) using RNAseq results from bacteria that were not included in the training set. We identified all the genes marked as being present in a specific environment from our prediction data. These genes represent the specific genes identified for that environment through machine learning. At the same time, our DEG (differentially expressed gene) dataset represents genes that show differential expression in bacteria when they are in that specific environment. We performed an overlap analysis between these two sets of genes, and the higher the overlap value, the greater the biological accuracy and reliability of our model.

Lung Data

The ROC AUC plot notably leans towards the top-left corner, which is emblematic of optimal classifier performance, and indicates an AUC of 0.99 for the RandomForestClassifier. This suggests outstanding classification capabilities. Further validation is provided by the confusion matrix, which unveils high accuracy rates for both true positives and true negatives. Concluding the analysis, when juxtaposed with actual DEG results, the model's predictive score for the "Acinetobacter" bacterial identification exhibits a striking alignment with the empirical data, thus showcasing a marked consistency between the model's predictions and the real DEGs.

Fig 2A. ROC AUC plot of the model with lung data

Fig 2B. Confusion matrix of the model with lung data

Fig 2C. DEG match result of the model with lung data

Tumor Data

The ROC AUC plot prominently favors the upper-left corner, indicating an exemplary classifier performance, with a calculated AUC of 0.99 for the RandomForestClassifier. This underscores its exceptional classification capabilities. Additionally, the confusion matrix confirms high accuracy rates for true positives and true negatives (98.17% overall). In conclusion, when compared to actual DEG results, the model's predictive score for identifying 'Acinetobacter' bacteria highlights a substantial consistency (55%) between the model's predictions and real DEGs.

Fig 3A. ROC AUC plot of the model with tumor data

Fig 3B. Confusion matrix of the model with tumor data

Fig 3C. DEG match result of the model with tumor data

From mathematic perspective, our model has a very high accuracy, however, when proving using From biological perspective, our overlap gene has been demonstrated to have a certain level of accuracy, but compared to mathematical validation, the accuracy is relatively lower. To deal with this, we decide to use a language model to further study the description of each gene, which may find more biological link between different gene of different bacteria. This will improve our model accuracy and the reliability in biological perspective.

Matching Results

In the match of engineered bacteria, we select three bacteria, S.typhimurium, Pseudomonas aeruginosa and Escherichia coli that is used to be report as the engineering bacteria for cancer immunotherapy.

Lung Data

Fig 4A. Identification of lung data

Tumor Data

Fig 4B. Identification of tumor data

Based on a comprehensive analysis of the research findings, it is evident that in the lung environment dataset, both E. coli and Pseudomonas perform exceptionally well, while in the tumor environment dataset, E. coli and S. typhimurium exhibit outstanding performance. These observations are not only crucial for selecting the appropriate engineering bacterium but also provide key insights for our wet lab experiments. Therefore, taking into consideration all the results, we draw a conclusion that E. coli is the most suitable chassis bacterium and should be chosen for our wet lab experiments.

Discussion

In our experiment, after analyzing multiple potential carrier bacteria, we have determined that E. coli (Escherichia coli) is the ideal engineered bacterium for drug delivery in the context of lung cancer. This selection holds significant importance for lung cancer drug delivery due to several key reasons. Firstly, E. coli exhibits robust survivability and adaptability across different environments, which is crucial for its ability to thrive in the lung environment and facilitate drug delivery to tumor cells. Additionally, the well-documented biological characteristics of E. coli, stemming from extensive genomic research, enable us to better understand and control this bacterium, ensuring precise drug delivery and tumor cell targeting. Furthermore, E. coli's engineering potential is well-established, with highly tunable metabolic pathways and gene expression systems, allowing for easy genetic editing and engineering to achieve specific functionalities required for drug release and cellular targeting. Lastly, E. coli generally possesses lower pathogenicity compared to other bacteria, reducing the potential risks to patients during the drug delivery process. In summary, the choice of E. coli as the carrier bacterium for lung cancer drug delivery is a reasoned one, given its exceptional survivability, biological characteristics, engineering capabilities, and safety profile. This choice is expected to advance research and development in the field of lung cancer treatment, offering more effective therapeutic options for patients.

To deal with the challenges mentioned before, using language model to learn description of each gene can be an alternative method to obtain the link between different gene. It will strongly enhance the biological significance and accuracy of our model.

Our study has some limitations, including insufficient data for bacteria in specific environments and the simplification of our machine learning model. These limitations could impact the comprehensiveness of our conclusions regarding the ideal chassis bacterium for lung cancer drug delivery, as they do not encompass the full spectrum of bacterial diversity and might overlook certain biological nuances within complex systems.

As we look to the future, the potential integration of machine learning methodologies becomes increasingly relevant. Developing a machine learning model is envisioned as a valuable avenue to assess and quantify the significance of both lung and tumor data comprehensively. This holistic approach is expected to greatly enhance our ability to pinpoint and select the most appropriate bacteria for our research endeavors, ultimately leading to higher levels of accuracy and precision.

References

[1] scikit-learn developers. (2023). RandomForestClassifier. In scikit-learn: Machine Learning in Python. Retrieved from https://scikit-learn.org/1.0/modules/generated/sklearn.ensemble.RandomForestClassifier.html

[2] Gálvez, Eric J C et al. (2020) “Distinct Polysaccharide Utilization Determines Interspecies Competition between Intestinal Prevotella spp.” Cell host & microbe vol. 28,6: 838-852.e6. doi:10.1016/j.chom.2020.09.012

[3] Galeano Niño, Jorge Luis et al. (2022) “Effect of the intratumoral microbiota on spatial and cellular heterogeneity in cancer.” Nature vol. 611,7937: 810-817. doi:10.1038/s41586-022-05435-0

[4] Simonte, Francesca M et al. (2017) “Investigation on the anaerobic propionate degradation by Escherichia coli K12.” Molecular microbiology vol. 103,1: 55-66. doi:10.1111/mmi.13541

[5] Helliwell, Emily et al. “Environmental influences on Streptococcus sanguinis membrane vesicle biogenesis.” The ISME journal vol. 17,9: 1430-1444. doi:10.1038/s41396-023-01456-3

[6] McNulty, Ryan et al. (2023) “Probe-based bacterial single-cell RNA sequencing predicts toxin regulation.” Nature microbiology vol. 8,5: 934-945. doi:10.1038/s41564-023-01348-4

[7] Dötsch, Andreas et al.(2015)“The Pseudomonas aeruginosa Transcriptional Landscape Is Shaped by Environmental Heterogeneity and Genetic Variation.” mBio vol. 6,4 e00749. 30 Jun. 2015, doi:10.1128/mBio.00749-15

[8] Sun, Zhenxin et al.(2023)“A virulence activator of a surface attachment protein in Burkholderia pseudomallei acts as a global regulator of other membrane-associated virulence factors.” Frontiers in microbiology vol. 13 1063287. 16 Jan. 2023, doi:10.3389/fmicb.2022.1063287

[9] Langouët-Astrié, Christophe et al. (2022) “The influenza-injured lung microenvironment promotes MRSA virulence, contributing to severe secondary bacterial pneumonia.” Cell reports vol. 41,9: 111721. doi:10.1016/j.celrep.2022.111721

[10] Eijkelkamp, Bart A et al.(2014)“Comparative analysis of surface-exposed virulence factors of Acinetobacter baumannii.” BMC genomics vol. 15,1 1020. 25 Nov. 2014, doi:10.1186/1471-2164-15-1020

[11] Eckweiler, D. and Häussler, S. (2018) 'Antisense transcription in Pseudomonas aeruginosa', Microbiology (Reading, England), 164(6), pp. 889-895. Available at: https://doi.org/10.1099/mic.0.000664.

[12] McClary, J.S. and Boehm, A.B. (2018) 'Transcriptional Response of Staphylococcus aureus to Sunlight in Oxic and Anoxic Conditions', Frontiers in Microbiology, 9, p. 249. Available at: https://doi.org/10.3389/fmicb.2018.00249.

[13] Salmonella enterica Genome sequencing and assembly | Ag Data Commons (no date). Available at: https://data.nal.usda.gov/dataset/salmonella-enterica-genome-sequencing-and-assembly (Accessed: 11 October 2023).

[14] Schmitz-Esser, S. and Wagner, M. (2014) 'Genome sequencing of Listeria monocytogenes', Methods in Molecular Biology (Clifton, N.J.), 1157, pp. 223-232. Available at: https://doi.org/10.1007/978-1-4939-0703-8_19.