Our AI models, developed using the pre-train and fine-tuning paradigms, were trained on datasets of different sizes: 300, 3,000, 30,000, 300,000, 3,000,000, 6,000,000, and 30,000,000 records. We also used data of the same size to train a separate AI model, which is specifically built for the raw data. Finally, we assessed the goodness of fit of the AI models using a dedicated test dataset. The results demonstrate that our pre-train and fine-tuning paradigm can effectively address the issue of limited sample size.
Please refer to the figure below for the specific results, where the goodness of fit is represented using Pearson's Correlation Coefficient (see the "Model" section for further details).
We fitted a curve to the data in the table using the ln function and performed visual analysis.
We found that models built by others based on the original dataset are almost unusable when the dataset is below 30,000. In contrast, our AI model constructed using pre-train and fine-tuning techniques exhibits good fit. Interestingly, regardless of the sample size used for training, our AI model consistently outperforms others in terms of fit, especially in the case of small sample sizes. In general, using someone else's AI model requires 10 times more data to achieve the same level of fit as our AI model.
Specifically, when the sample size is a common number like 3,000, our AI model achieves a fit of 0.701, whereas others' AI models only achieve a fit of 0.217. Thus, our AI model's predictions can be used as a reference, but other AI models are unusable. Furthermore, when the sample size is a common number like 30,000, our AI model's fit is 0.788, while others' AI models achieve a fit of 0.660. Our model's predictions are highly correlated with the actual results, but other models' results are not as good as our model trained with a sample size of 3,000.
When the AI models reach a commonly used high indicator of 0.85 for the Pearson Correlation Coefficient, our AI model requires approximately 75,000 data points, whereas others' AI models require around 745,000 data points. In other words, they would need 670,000 more data points, which is 10 times more than what our model requires.
The specific details for each sample size are as follows:
1. 30,000,000 Sample Size
For a sample size of 30,000,000, our AI model achieved a Pearson correlation coefficient of 0.963 on the low-throughput, high-accuracy test set. In comparison, the correlation coefficient for others' models was 0.959, and the coefficient reported in the original paper was 0.960. This suggests that we have successfully replicated the results from the paper. These findings demonstrate that our AI model, built using the pre-train and fine-tuning paradigm, exhibits slightly higher fit than others' carefully constructed AI models using the original data.
2. 6,000,000 Sample Size
For a sample size of 6,000,000, our AI model achieved a Pearson correlation coefficient of 0.954 on the low-throughput, high-accuracy test set. In comparison, the correlation coefficient for others' models was 0.936. Here, "N" represents the number of data points in the dedicated test dataset, which is 61,150. The "20%" indicates that the training sample size is 20% of the original data size, which is equivalent to 6,000,000. However, in order to simulate the low-throughput, high-accuracy scenario with small samples, we removed the integers from the original data and then extracted data samples for model training. In other words, the 6,000,000 data points used here do not contain any integers. Similarly, the smaller samples used below also do not contain any integers.
3. 3,000,000 Sample Size
For a sample size of 3,000,000, our AI model achieved a Pearson correlation coefficient of 0.943 on the low-throughput, high-accuracy test set. In comparison, the correlation coefficient for others' models was 0.884. Our findings indicate that even in the case of a large sample size, our AI model exhibits significantly higher fit compared to other AI models.
4. 300,000 Sample Size
For a sample size of 300,000, our AI model achieved a Pearson correlation coefficient of 0.898 on the low-throughput, high-accuracy test set. In comparison, the correlation coefficient for others' models was 0.892. It is worth noting that the fit of others' AI models trained with 1% of the data appears to be higher than those trained with 10% of the data, which contradicts theoretical expectations. So, we consider this to be an error point and do not consider it further.
5. 30,000 Sample Size
For a sample size of 30,000, our AI model achieved a Pearson correlation coefficient of 0.788 on the low-throughput, high-accuracy test set. In comparison, the correlation coefficient for others' models was 0.660.
6. 3,000 Sample Size
For a sample size of 3,000, our AI model achieved a Pearson correlation coefficient of 0.701 on the low-throughput, high-accuracy test set. In comparison, the correlation coefficient for others' models was 0.217. It is evident that our AI model, built using the pre-train and fine-tuning paradigm, exhibits a significant advantage in small sample sizes. Our AI model's predictions reflect the actual results, while other AI models are simply unusable. This demonstrates that our pre-train and fine-tuning paradigm can effectively address the issue of insufficient sample size.
7. 300 Sample Size
For a sample size of 300, our AI model achieved a Pearson correlation coefficient of 0.552 on the low-throughput, high-accuracy test set. In comparison, the correlation coefficient for others' models was 0.097. It is evident that in the case of an extremely small sample, the predictions of our AI model can provide some reference, while other AI models are simply unusable.
[1] Vaishnav E D, de Boer C G, Molinet J, et al. The evolution, evolvability and engineering of gene regulatory DNA[J]. Nature, 2022, 603(7901): 455-463.