Software
The discovery of the first microRNAs (miRNAs) dates back to 1993; yet, it was not until the early 21st century that research on miRNAs gained significant momentum (Bhaskaran et al., 2014). Even with this emerging state, detailed information about each miRNA is scattered throughout scientific literature. Previous attempts by miRNA databases such as miRBase to extract and organize such information using text mining methods are insufficient, as their filtering and word cloud techniques only capture a fraction of the data from the literature (see Fig. 1) (Kozomara et al., 2019).
Today’s system of publishing literature is based on journals, which are the established way of academic communication and exchange of research findings. Publishing in journals started in the early 17th century to facilitate scientific information exchange between scholars. What started off as around 500 journals publishing 2-3 times per year has boomed into at least 30,000 journals today publishing more than 5 million papers each year (Tress Academic, 2019). Publishing through journals allows scholars quick access to high-quality peer-reviewed academic literature, but as the volume of papers published grew exponentially, it has become increasingly difficult to keep up with scientific literature and conduct literature reviews, even within a specialized field. This difficulty is especially relevant for the field of miRNAs as not only do researchers have to look through and compare numerous papers, but they also have to keep track of the different methods that other researchers use to measure miRNAs.
While researching miRNAs for CADlock, Lambert iGEM experienced firsthand the disorganization of data surrounding miRNAs. To confirm whether other individuals faced this issue, we consulted Dr. Christian Delles, a miRNA researcher from the University of Glasgow, who verified this lack of organization. Furthermore, Dr. Charles Searles, a cardiologist from Emory School of Medicine, reaffirmed the need for a database to streamline information on miRNAs related to coronary artery disease (CAD).
To address the need for a database that extracts functional miRNA information and organizes the vast volume of scientific literature regarding the subject, we developed CADmir: a miRNA search engine utilizing large language models to process and organize the extensive information about miRNAs. With CADmir, most of the relevant data about specific miRNAs are consolidated in one place, improving information accessibility and reducing the time barrier associated with obtaining high-quality scientific information. CADmir works alongside existing databases such as miRBase to provide links to relevant scientific literature and incorporate sequence and genomic data. CADmir aims to expand research in miRNAs and create a new precedent for the organization of vital scientific information.
In developing CADmir, Lambert iGEM had two goals: a user-friendly interface and an effective organization of the thousands of papers regarding microRNAs (miRNAs) and coronary artery disease (CAD). To achieve this, we leveraged new technologies, such as OpenAI’s GPT-3.5 Turbo large language model, to process and analyze thousands of articles (OpenAI, 2023). Furthermore, we employed modern user-experience design principles, so that CADmir is human-centric and intuitive. By combining powerful AI-powered tools with a carefully crafted design, CADmir empowers researchers to swiftly search for information regarding heart disease-related miRNAs and retrieve comprehensive details accompanied by proper citations.
CADmir creates an efficient way of interfacing with the vast amount of literature by extracting information from the existing scientific literature and embedding them in a format that’s easily accessible. Scientific literature, like all forms of text, has semantic meaning that is hard for computers to understand. While there may be numerous papers regarding the same topic, all of their information is spread across numerous papers, making it tedious to compare and compile. Sentence embeddings are numerical representations of sentences and can encode semantic information mathematically, allowing computers to understand the meaning of the text (Google, 2022). By using embeddings, CADmir can automate the tedious task of reading and comparing papers manually.
CADmir utilizes an automated pipeline to extract and organize data (see Fig. 3). First, CADmir uses a script to gather a curated collection of miRNA research papers focused on CAD. The script uses a web scraper to obtain PDFs from PubMed with the keywords “miRNA” and “coronary artery”. For each downloaded PDF, CADmir extracts the body of the paper and splits each paper into individual sentences. Next, each sentence is passed through OpenAI’s embedding model and transformed into an embedding, a mathematical representation that computers can understand, before being added to a specialized database called a vector database that can store and search embeddings (OpenAI, 2023; Schwaber-Cohen, 2023). Embeddings are useful for clustering and organizing large amounts of unstructured data since they capture semantic meaning and relationships between sentences. Embeddings that are closer to each other mathematically are more semantically similar; thus, they can be searched and synthesized to automate various natural language tasks (Google, 2022). OpenAI’s embedding model uses large language models’ understanding of natural language to intelligently embed sentences and mathematically represent their meaning (OpenAI, 2023). In total, CADmir processed 3,845 papers or 98,008 sentence embeddings.
After processing all of the CAD-related papers referencing miRNAs and storing them on a vector database, we built a search engine to allow researchers to quickly interface with CADmir’s database of information.
When users ask a question in the search bar, their search query will first be embedded into a vector with OpenAI’s embedding model. Once the query is in vector form, CADmir will use the k-nearest-neighbor (KNN) algorithm to find the top 15 pieces of information most similar to the search query (IBM, 2023). This query and the 15 search results will be used as context for OpenAI’s GPT-3.5 Turbo model to generate the final response in natural language form and present it to the user (see Fig. 4). We designed CADmir with information integrity in mind. Each piece of information in CADmir has a reference to its source that is peer-reviewed and credible. When generating a search response, CADmir will pinpoint the source that each piece of information came from and list the sources it used to generate a search response.
CADmir is a living database of miRNAs, and will be regularly updated with new papers. New papers added to the database will be automatically included in future search queries.
To get started, users can search for anything relevant to heart disease related miRNAs from the home page. The search will generate a response and sources to back up the information providing researchers with the flexibility to interface with CADmir’s repository of information flexibly without constraints. CADmir is available for anyone to use publicly here.
CADmir’s code is available opensource under a MIT license on github.
MicroRNA (miRNA) research is relatively new, making it difficult for researchers to find data regarding their functions. With CADmir, researchers can quickly find information regarding their desired miRNA. Lambert iGEM presented this database to miRNA researcher Dr. Charles Searles, a cardiologist at Emory Healthcare Hospital, who was excited about the increased efficiency CADmir brings. In the future, we plan on expanding CADmir to support more fields, not just miRNAs related to coronary artery disease.
Bhaskaran, M., & Mohan, M. (2014). MicroRNAs: history, biogenesis, and their evolving role in animal development and disease: History, biogenesis, and their evolving role in animal development and disease. Veterinary Pathology, 51(4), 759–774. https://doi.org/10.1177/0300985813502820
Google. (2022). Embeddings. Google for Developers. https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture
IBM. (2023). What is the k-nearest neighbors algorithm? | IBM. Ibm.com. https://www.ibm.com/topics/knn#:~:text=The%20k%2Dnearest%20neighbors%20algorithm%2C%20also%20known%20as%20KNN%20or,of%20an%20individual%20data%20point.
Kozomara, A., Birgaoanu, M., & Griffiths-Jones, S. (2019). miRBase: from microRNA sequences to function. Nucleic Acids Research, 47(D1), D155–D162. https://doi.org/10.1093/nar/gky1141
OpenAI. (2023). OpenAI GPT-3 API [gpt-3.5-turbo]. Available at: https://platform.openai.com/docs/models
Schwaber-Cohen, R. (2023). What is a Vector Database? | Pinecone. Pinecone.io. https://www.pinecone.io/learn/vector-database/
Tress Academic. (2019, June 4). #13: Writing journal papers: Pros and cons | Tress Academic. Tress Academic. https://tressacademic.com/writing-journal-papers-pros-and-cons/