INTEGRATED HP

responsible and good for the world

Overview
NJU-China devotes ourselves to creating an easy-to-use AI model and aparadigm specifically for synthetic biology research, and then proposes new solutions to the challenge of applying Artificial Intelligence broadly into synthetic biology. In the Integrated Human Practice page, we show how we have carefully considered whether our project is responsible and good for the world throughout the whole lifecycle.During the process, we address both how our project responds to such considerations and how our proposed solution is implemented responsibly and reflectively.The public survey gave us an overall picture of the topic we focus.Through communications with scholars, we clarified the specific problem to work on and obtained professional guidance and opinionsfor the design and implementation. We also got inspiration from enterprises to take application scenarios, customer needs and expert knowledge on feasibility into consideration, and we are cheerful to see our model is successfully utilized in their production and brings benefits. Thanks to these professionals in different fields, our project manages to open a new window for the future of synthetic biology and gratifying progress to the development of the society.
Topic Research

1. Background Research

Leveraging its immense computational power and intelligent algorithms, AI provides researchers with unprecedented insights. AI can handle vast amounts of biological data, such as genomes and protein interaction networks, accelerating drug development and disease diagnosis. The impact is staggering, with AI speeding up gene identification and analysis by at least 100 times. In protein folding prediction, AI achieves an accuracy rate of 90%, greatly reducing the time and resources required compared to traditional methods. Additionally, AI excels in medical imaging diagnostics, with an impressive 96% accuracy in breast cancer detection, surpassing human doctors' assessment capabilities. These remarkable numbers highlight the enormous potential of AI in biology, revolutionizing medical research and healthcare management, and make our team to think, how can AI be used generally for synthetic biology?

2. Stakeholder Analysis

We hope to identify the problem together with potential stakeholders, and screen the initial idea with them. Therefore, our team primarily listed all the actors who could be relevant to our project in different fields by brainstorming, and determined the order of interaction in order to plan and design our projects from the shallower to the deeper.

Based on the spiral line, we engaged with our stakeholders step by step, and gradually constructed and improved our project design according to their suggestions and feedback. The process of the communications and their impact on us will be shown in detail below. At the same time, as clarifying exactly what project tends to do, we were continuously adding and refiningour stakeholders list and manage them through a power-interest matrix, which helps to prioritize the values of the most relevant stakeholders. All stakeholders are grouped based on Power (their ability to influence our project and our strategy) and Interest (how interested they are in our project succeeding). For high power, high interested group, we fully engage with them mainly, through discussing all the choices we make and the progress we book with our project. We also consideredother stakeholdersin respective ways, and contact them when we require expertise on a specific topic.

3. Public Survey

Tofully understand the real needs of society and create new value in a targeted manner, public opinion is very essential, which will determine whether our project can actually benefit society and what we are supposed to focus on.Therefore, before implementing new technology, we conducted extensive public questionnaire survey and detailed analyses.In the past year, we have collected 265 questionnaires from all over the China, covering different kinds of educational background and occupation, which ensures the universality of our investigation.

Given that the topic of AI for synthetic biology is highly specialized, our questionnaire was divided into three parts.In the first part we’d like to find out the public’s awareness and attitude towards AI application in work and daily life.The results show that about 28% of people never use AI in daily life and only 7% of them use AI at a high frequency. The reason why some of them never use AI or barely use AI including the high threshold for the use of AI, concerns that AI will not meet individual needs, invade personal privacy or provide false information or there is no need to use AI.

Figure 1. Frequency of using AI in daily life


Figure 2. Reasons for never or less use of AI

When asking about the level of mastery of AI technology, 44% of people only use AI as a tool and 17% of them know about the principle of AI, which shows that public’s understanding of AI is very shallow. Based on the situation, we asked respondents about the obstacles to learning AI, about 60% of them agreed that it is hard to find AI learning resources and tool resources, and learning AI is difficult which we need to invest a lot of time in.


Figure 3. Mastery degree of artificial intelligence


Figure 4. Obstacles of learning AI

These results show that the application of AI is still relatively limited, and further research about AI technology is needed to effectively solve problems in specific fields, and we believe this is also true in the field of synthetic biology. In addition, the high difficulty of professional knowledge limits people's further study of the application of AI in specific fields, which also reminds us that it is necessary and helpful to carry out popular science and education activities about AI for synthetic biology to better promote them to the public.

Then we asked users and developers of AI tools respectively that which aspect of AI they will focus on. About 70% of them have the request of accuracy, 61% of them pursue ease operation of AI and about 50% of them expect AI to have data privacy and fast processing speed, which shows that if AI want to be widespread, it must both to be accurate and easy to understand, just like web page technology. Developers concentrated most on the efficiency and accuracy, as well as their ability to transfer to multiple problems, which provides a guidance on our AI model design.

Figure 5. Focus as a user on aspects of AI models or algorithms


Figure 6. Focus as a developer on aspects of AI models or algorithms

In terms of the function of AI, we found it widely used in various aspects, and nearly 92% of people believe AI will have positive effect on the fields they work in. However, it has relatively few applications in scientific research, thus we believe it is our value to improve the new application of AI technology in synthetic biology research.

Figure 7. Goals of using AI


Figure 8. The impact AI have on the field you work in

In the next part, we aimed to further explore the public perceptions of the promise of AI for synthetic biology. We were upset to find that half the respondents have never heard of synthetic biology before, and over 16% of people who know synthetic biology still have little knowledge about the application of AI in the field. Obviously, it is necessary for us to introduce and propagate AI for synthetic biology to the public in a more efficient and suitable way.

Figure 9. Degree of understanding of synthetic biology


Figure 10. The extent to which AI is used in the field of synthetic biology

Above all, for those who know the application of AI in the field of synthetic biology, it’s generally believed that the application of artificial intelligence in the synthetic biology industry has bright prospects. As we considered about the specific convenience or revolution that AI can bring to biology, improving efficiency and shortening research cycle won the highest score among the few options we have listed. For the current challenges or constraints which limits the development of AI for synthetic biology, nearly 70% of them agreed that it lacks uniform standards and specifications (e.g. data formats, sharing platforms, etc.). About 60% of them believe that the scarce of AI expertise and skills in biological researchers and lack of synthetic biology data with high quality and quantity are also main problems.

Figure 11. The extent to which AI can help synthetic biology


Figure 12. the application prospect of AI in synthetic biologycompared with other industries(Assuming an average score of 5)


Figure 13. The importance of the different changes that AI bring to synthetic biology


Figure 14. Challenges of the application of AIin synthetic biology

After anextensive literature research, we chose to pre-train the AI on existing models and adapt the parameters to a specific problem. We asked the public for their opinion and were stimulated to find that most people think this approach makes practical sense.

Figure 15. (left)the practical value of pre-train and fine-tune AI based on an existing model(compared with building an AI model from scratch)

Figure 16. (right)The meaning of applying transfer learning to solvespecific problems in synthetic biology

In a word, we are glad to find that the public is optimistic about the application of artificial intelligence in the field of synthetic biology, which gives us great motivation on finding more possibilities of artificial intelligence applied to synthetic biology on the basis of predecessors. The survey also provides new ideas on our model design and education activities to improve public understanding.

Exploration

1. The Interview of Prof. Ma

Professor Ma Lijia is currently working at Westlake University, focusing on genomics and systems biology, and has in-depth research on data mining and AI applications in regulatory sequences.

In the application of biology, what direction is the most significant problem of data limitation? Keeping this question in mind, we went to Westlake University to have an in-depth exchange with Professor Ma Lijia. "The regulatory sequence is critical. In fact, 90% of the human genome is regulatory sequences, and our research group is currently working on the characterization of regulatory sequences. In my opinion, specific regulatory sequences, such as promoter sequences, are currently suffering the most data-scarcity, which is mainly limited by experimental techniques for selecting and characterizing." A crucial keyword in synthetic biology is expression, which is also closely related to regulatory sequences. Building on the important professional insights provided by Professor Ma Lijia, we finally set our sights on regulating the most direct and ubiquitous functional regulatory sequences—promoters, and put it as the specific direction of our project.

2. The Interview of Prof. Ding

During our communication and promotion efforts with various stakeholders, we have encountered some skepticism. Professor Bi Ding from Fudan University questioned the significance of our project during a discussion, stating that the prevalent use of active learning in the field of biology and AI suggests that data may not be as limiting as we claim. This has prompted us to further contemplate the significance of our project and how to convince more people. We have further confirmed that the availability of data is not only limited by technical development but also constrained by costs. In fact, there are instances where obtaining a sufficient amount of high-quality data is not impossible, but the corresponding high economic and time costs cannot be justified by the expected output.

3. AI Model and Paradigm

After determining our project direction, choosing the appropriate large-scale model is the most important problem to consider. We went to Nanjing GenScript Biotechnology Company to conduct an exchange interview with Sheng Xia, a senior scientist in bioinformatics. Mr. Sheng Xia has been engaged in biological data mining and analysis for a long time, and has quite mature professional experience in AI application. After understanding the relevant situation of our project, Sheng Xia believes that what we need to deal with is genetic data, which is generally applicable to language models, the most popular of which are GPT and Bert models.

4. Data Source

When conducting preliminary dry experiment training with selected 30,000,000-scale dataset, continuous attempts and optimization of parameters could not obtain good results, and the focus was on the characteristics of the training data through inter-team communication. The students of the dry experiment found that the result plotting showed that there was a large number of data in the complete data whose actual intensity deviated from the reasonable experimental results, and after communicating with the author of the literature, that is, the data contributor, and learning his team's method to screen the data, the effect increased significantly. Since then, we have maintained communication with the author team, which has played an important role in promoting the optimization and improvement of our dry lab result.

5. Yeast Expression System

After deciding to use yeast as our expression system, we conducted an exchange interview with Professor Sheng Xia at Nanjing GenScript Biotechnology Company. Professor Sheng agreed with our choice and mentioned that yeast has a slower cultivation speed and it is challenging to achieve high expression levels compared to some commonly used engineered bacteria in industrial fermentation processes. He emphasized that if our project could provide a solution to this issue, it would be extremely helpful. Professor Sheng further advised us that if we plan to express proteins from prokaryotes or viruses in yeast, it is advisable to optimize the yeast source. He provided us with a web platform from GenScript for yeast sequence optimization, which played a crucial role in facilitating our subsequent project.

The next challenge in protein expression within yeast is how to separate, purify, and quantify the proteins. In fact, due to the cell wall of yeast, it is relatively difficult to separate and purify the expressed proteins. Through discussions with Dr. Yiling Hu, a postdoctoral researcher at the School of Life Sciences, Nanjing University, who specializes in yeast cultivation, Dr. Hu provided us with a feasible solution. It involves dding a His-tag to the protein or fusing the protein with a fluorescent proteina to enable purification and quantification through Western blot analysis after yeast lysis. Dr. Hu further mentioned that it is worth noting that there is no precedent for detecting His-tag in yeast, so if our project can detect His-tag in the yeast lysate, it would be a significant advancement.

In the process of further exploring our project, we also keep close contact with other iGEM teams, and shared the progress and challenges with each other. Through these collaborations and partnerships, we get peer support and review, enhance creativity and advanced our project. For more information, please click the Partnership link to see how the interaction with other teams has influenced our project.

Application

1. The Visit to Nanjing Yiweisen Biotechnology Co., LTD

After gaining recognition from relevant researchers, we wanted to further explore the industrial significance of our project. We visited Nanjing Yiweisen Biotechnology Co. and had a discussion with Professor Zhongchang Wang. Professor Wang established Nanjing Yiwesen Biotechnology Co., Ltd as the main founder. The company is a high-tech enterprise based on artificial intelligence technology in the field of synthetic biology, incubated by Artificial Intelligence Biomedical Research of Nanjing University. It primarily focuses on microbial genetic modification and downstream industrial applications, researching and developing high-value natural active substances for use in sectors such as food, cosmetics, plant protection, and biopharmaceuticals. Professor Wang acknowledged the significance of our project and pointed out that the current traditional screening methods for selecting target strains from a large number of randomly mutated strains are often costly, time-consuming, and limited in scope. The application of artificial intelligence in synthetic biology has greatly improved the accuracy of mutagenic strains and reduced the cost of screening. Additionally, he provided us with some potential application directions from a business perspective.

2. The visit to Carbon Silicon Institute of Artificial Intelligence Biomedical Research

We hoped that our project could provide solutions to the most realistic and important human health or environmental issues, demonstrating responsibility and a positive impact on the world throughout its entire lifecycle. During our visit and discussions, a possibility that had never been considered before caught our attention—the mucosal vaccine.

Professor Chao Yan, from the institute, has extensive experience in drug development and provided profound insights into using AI for precision drug discovery. He first acknowledged that the pre-training + fine-tuning model we proposed can not only be applied to synthetic biology but also to predictive drug development. He further emphasized that data limitations are not just technical barriers but are greatly influenced by the spatiotemporal factors of data generation, which have high heterogeneity and are difficult to utilize. As an example, he mentioned that AI image recognition models are trained on datasets in the order of billions, while in the field of biology, measurements are typically limited to dozens of patients at a time, with only a few dozen data points per dimension. The disparity between the two is significant. Additionally, he pointed out the issue of the dimensionality of biological data. Biological data often have a high number of dimensions but a low sample size, which is not conducive to leveraging the strengths of AI. For example, AI excels at image processing with low dimensionality and large sample sizes, where there are data points in each dimension, resulting in better predictive models. However, in biology, such as genomics, each gene represents a dimension, but the samples for each gene are relatively scarce.

Professor Yan's interview deepened our understanding of the significance and value of our project and made us more fully aware of the multiple aspects of data limitations. Additionally, we received affirmation and support from him regarding our plans for mucosal vaccine production.

3. Enhance LTB Expression

After receiving such affirmation, we needed to find suitable mucosal vaccine-related proteins as the target product for our project. Our attention turned to LTB (Heat-Labile Enterotoxin B subunit). Through literature review and understanding its production and functions, we discovered that LTB plays a crucial role in mucosal vaccines and indeed faces expression limitations. If our project can address the challenges associated with LTB expression, it would greatly promote the production and dissemination of mucosal vaccines.

4. Feedback

After formulating such a concept, we further engaged in discussions with the aforementioned researchers. Many of them raised concerns about the inconsistency between our project's training data and the downstream product. Specifically, we obtained data from fluorescent protein expression for training, but intended to use the trained sequences for the production of a different protein. As they pointed out, the strength of the promoter is highly likely to be influenced by downstream genes. This prompted us to consider providing feedback from wet lab experiments, specifically the expression data obtained for LTB, to the AI involved in dry lab experiments. This feedback would facilitate the development of a more targeted model for LTB expression.

Ethic

We believe that the use and training of AI models must adhere to ethical guidelines for data, meaning that only publicly available datasets should be used for training. It is important to respect the owners' rights and avoid using private data without permission. When it comes to the pre-training and fine-tuning paradigms, copyright issues pertaining to pre-trained models must be taken into consideration. It is advisable to utilize publicly available pre-trained models. Moreover, during the AI for Synbio Seminar, we engaged in discussions with other teams and reached a consensus that data transparency and sharing of AI model algorithms are essential, particularly for our iGEM team. Other teams should also ensure that they make use of publicly available datasets when collecting data.

In addition, one of the primary concerns related to AI safety is that the training of AI models should be focused on tasks that are beneficial to humans and compliant with legal regulations within that specific domain.

Environment

After achieving some experimental results, we engaged in discussions with AI pharmaceutical companies to seek guidance on the subsequent industrialization process. They recognized the potential utility of our model in industrial production but also highlighted certain considerations. Firstly, for yeast production and larger-scale experiments, it is crucial to pay special attention to preventing biological contamination. Secondly, due to the high level of pollution associated with yeast production, wastewater treatment should be prioritized. In fact, the production of 1 ton of dry yeast can generate 150 tons of wastewater. Building upon these considerations, Zhang Xiaotong, the supervisor of the wet lab experiments, guided the innovative development of a biological wastewater treatment device (Patent No: CN215886592U) and a bioreactor to prevent contamination from miscellaneous bacteria (Patent No: CN216141527U). These developments have prepared us for the industrial production phase of our project.

Entrepreneurship

Based on Pymaker, we plan to develop a software for designing cis-regulatory elements in yeast promoter regions based on deep learning. We have already reached intentions for collaboration and signed agreements with Nanjing LianDu Biological Technology Co., Ltd. and Nanjing YiWeiSen Biological Technology Co., Ltd. Through the collaboration, our research results and developed models can be used to guide the design and optimization of yeast fermentation production lines, significantly reducing the cost of industrial production screening and validation, and ultimately bringing high-quality products to the market.