Motivation
One of the essences of synthetic biology is a new set of design methodologies where researchers can take advantage of existing functional parts to customize a system or part more easily. Therefore, standardized parts lay the foundation for synthetic biology. The iGEM community has been collecting standard biological parts from projects every year, enabling future teams to access and reuse these parts.
However, the massive Biobrick database is a double-edged sword. It makes it difficult for us to select the Biobricks that meet our needs from the vast database. To address this, we have developed Ask NOX. By using natural language to describe the functions of Biobricks, users can easily find the Biobricks that they need with the help of Ask NOX.
Try it Now!
The release version of our website has been available HERE. Describe your requirements in detail , press "Enter". It takes only a few seconds for our model to produce iGEM part names that satisfy your needs. Additionally, you can click on the part name and jump to the page https://parts.igem.org for more detailed information.
Click to use Ask NOX.
Describe your requirements for Biobricks in a detailed sentence, press "Enter". It may take a few seconds for our model to find the Biobricks that satisfy your needs. Additionally, you can click on the part name and jump to the page for on https://parts.igem.org for more detailed information.
If you have no idea about the input, try copying the example prompts below:
- ATP-independent luciferase, very small (only 19 kDa), from a deep-sea shrimp.
- Quorum sensing, transcription factor, combine AHL.
- Quorum sensing, Synthesizes 3OC6HSL.
- Interacts with p-hydroxybenzoic acid, dual-directional.
For detailed usage, please see User Manual section.
Project Structure
Software Structure
Behind the scenes, our project can be split into three main modules:
-
Data Retrieval and Processing
We retrieve the names and corresponding descriptions of BioBricks from the parts.igem.org website.
We have incorporated the Large Language Model LLama2 into our workflow, allowing it to play the role of a professional biologist. Standing in the perspective of a synthetic biology researcher, LLama2 could generate natural language queries for Biobricks based on their descriptions.
-
BERT Encoder
The BERT encoder serves as the core of this project. It encodes Biobricks based on the natural language queries. By training on a dataset of existing natural language queries for Biobricks, it builds a semantic space for the Biobricks. The user's query in natural language is then encoded by BERT and mapped onto this semantic space. Based on proximity relationships in the semantic space, the most relevant Biobricks are returned.
-
User Interface
We have developed a user-friendly interfaces in webpage form with the trained model.
Users just need to describe the Biobricks they want. Then they will encounter a prioritized list in order of priority and their concise summaries on the webpage. Desired BioBricks will appear within this list, and users can click on the corresponding links to access further details.
Iterations of Our Project
Cycle 1
Instead of building a LLM from scratch, we fine-tune a pre-trained model based on a existing reverse dictionary project (MultiRD) to construct the reverse search model for Biobrick.
Specifically, Based on the BERT deep learning model pre-trained on a substantial amount of unlabeled text, we fine-tune this pre-trained model using the BioBrick data mentioned before. By encoding natural language descriptions of BioBricks, the BERT model establishes a mapping between BioBricks and their respective descriptions through the training process. As a result, it enables the capability to input BioBrick descriptions and obtain corresponding BioBrick results as output.
The first version of the dataset was constructed using the 7B Llama model. For each Biobrick, a query statement was used as the training set to fine-tune the BERT model.
Cycle 2
Based on Cycle 1, we tested and tried out Ask NOX's query performance for the Biobricks used in our NOX project. After multiple trials and tests, we realized that the training set used had some deficiencies in terms of quantity and quality.
Therefore, according to the test results, we switched to using a larger Llama model (13B), and tried multiple times to adjust and obtain better prompts to help us generate a better training set. And to prevent overfitting, we generated multiple training samples for each Biobrick to obtain better training results.
We also standardized the model's interface, and introduced a web UI as the user interaction interface through the interface invoked in the web application, to facilitate more convenient use of the model.
Accuracy of the Model
To test the accuracy of the model, we divided all the summary data into two groups: the "seen" group (or the training set) and the "unseen" group (or the test set). Every BioBrick item is summarized by different Llama2 models or with different prompts. Some of the definitions are randomly selected into the "seen" group, and others are placed in the "unseen" group. The BERT model is trained with the "seen" group, and tested with input from the "seen" group and the "unseen" group. Every prompt inputted in the test corresponds to a BioBrick that generates the prompt, so the correct BioBrick name in the test serves as the "label". By sorting all the output BioBrick items by relevance, we tested the model's accuracy according to where the correct BioBrick name ranks: on top1, on top10, on top100, etc.
The test result of our reverse dictionary model for biobricks is as follows:
top1 hit rate | top10 hit rate | top100 hit rate | |
---|---|---|---|
seen test data | 0.992 | 1.0 | 1.0 |
unseen test data | 0.39 | 0.7 | 0.856 |
seen+unseen | 0.691 | 0.85 | 0.928 |
Top10 hit rate means the probability of the correct biobricks appears in the top ten items on the webpage. Others are similar.
To futher validate our model in a convincing way, we constructed a test set which is generated by llama2 to test the model, and asked the teammates of wetlab to evaluate the output, they thought the output matches well. What's more, we let wetlab teammates make several queries combined with their actual needs during their experiment. For most cases, they could find suitable biobricks in the first ten results.
Related Works
Many previous teams have also noticed the imperative need to find a BioBrick with certain functions efficiently, and novel methods have been developed thanks to their effort. The DiKST project (Leiden, 2021) built a database with Python and optimized its search algorithm so that users could search in a simple and elegant manner. The PartHub project (Fudan, 2022) enabled keyword searching for iGEM parts by ID, name and many other features, and also visualized the relationship of parts in an interactive way.
However, the advent of large language models (LLMs) like ChatGPT has revolutionized the way people search for information online, and NLP models like BERT open up new possibilities for summarizing information. Instead of typing in keywords, can we search for BioBricks with natural languages, just like how we interact with ChatGPT?
This is where we believe our model has a competitive edge over previous methods. By leveraging the power of LLMs, our model allows users to make queries in natural language, which could greatly ease the burn on user's minds.
This natural language search parallels how people interact with chatbots like ChatGPT, providing an intuitive and user-friendly way to explore the vast database of BioBricks. While previous databases have optimized keyword search, our model is uniquely positioned to understand the underlying meaning of natural language queries.
Future Work
Our project demonstrates a new methodology for BioBrick searching with natural language. Unfortunately, due to the tight schedule of this competition, there is still room for improvement in terms of result accuracy, model structure, and methods of evaluating model performance. Future work can be carried out in the following aspects:
-
Stronger Llama model for better training samples
We will utilize a Llama model with more parameters to extract the training set. Moreover, we will fine-tune the training set based on user experience. Additionally, we will consider incorporating more data sources if available.
Extra information like other parts in the description part shall also be taken into consideration, which indicates the relationship between different parts.
-
Explore better ways of evaluating the performance of the model instead of
manual testing.
We shall develop a new model that can update online. By providing a chat session with each user, our model shall be able to learn and update by communication with user.
Furthermore, we will develop a scoring program to evaluate the query results from our current model. By pitting this program against our model in adversarial settings for reinforcement learning, we aim to obtain improved training outcomes.
-
Apply the workflow of our model to other data base.
The workflow of our model is scalable and can be applied to many other realms of research. It's possible for future teams to apply our workflow to other databases like UniProt.
In the future, we will seek support and assistance from other database providers. By developing Ask NOX on more extensive datasets, we could substantially improve its query capabilities and accuracy.
- Modify the structure of NLP models like BERT and customize these models for the realm of BioBrick searching.
Appendix: User Manual
Our project is full open source. You could get our source code at our gitlab repo.
For general users, you could simply visit Ask NOX website.
If you are a experienced user and want to deploy Ask NOX locally, you shall clone our repo and prepare your environments following our instructions.
Reference
[1] Zhang, Lei, et al. Multi-Channel Reverse Dictionary Model. arXiv, 18 Dec. 2019. arXiv.org, http://arxiv.org/abs/1912.08441.
[2] Yan, Hang, et al. BERT for Monolingual and Cross-Lingual Reverse Dictionary. 1, arXiv, 30 Sept. 2020. arXiv.org, https://doi.org/10.48550/arXiv.2009.14790.