Skip to main content

Software

CHEERS!

Gold Medal requirement of đź”—Excellence in: Software Tool is Accepted! Check our đź”—Awards page for details!

info

This page is mainly talking about the way we find the genes and weights for our models using Pyhon & Jupyter Notebook. We thought it is Software, but thanks to the judges who told us that it is Modeling. So we now added this line. If you're interested in the way we simulate the data, please check đź”—Simulation for more details.

We think the best part of our software is the way we Simplify the Weights to fit the wet lab experiments. It ensure that all weight geneted is ready to be used in wet lab experiments.

Also, don't miss the 2-row spot if you want to know about how we visualize our results.


We’ll call it AI to sell it, Machine Learning to build it. Here I'd love to call ours a Machine Learning Model.

Setting Up the Environment​

Choosing which tool to use depends on your specific needs.

If we need precise control over Python versions and package dependencies, conda is the best choice.

  • Installing packages directly with pip can clutter the root environment and make dependency management difficult.

  • Python venv environments isolate packages but don't allow fine-grained control over Python or package versions.

  • In contrast, conda lets you create separate environments with specific versions of Python and packages. This improves repeatability and avoids dependency issues.

Through some experimentation, we ultimately chose Micromamba as the Python environment management tool over Miniconda and Anaconda.

  • Anaconda includes many preinstalled packages and tools we don't need for this project. At over 3GB, it takes up considerable disk space compared to the more lightweight options.

  • In contrast, Micromamba contains only the bare minimum needed - the core conda components and Python. At around 30MB, it has a much smaller footprint.

  • Beyond size, Micromamba also provides faster dependency resolution and environment creation than Miniconda.

  • The biggest advantage of Micromamba over Miniconda is speed. Using Miniconda to install dependencies took 2 hours on a teammate's computer, whereas with Micromamba it took just 10-15 minutes.

  • Micromamba is faster at installing and updating packages due to its more performant package manager and parser. It also supports parallel installs for more flexible multi-environment management, which helps us a lot.

So for our purposes, Micromamba provides the reliable environment management of conda while being sufficiently lightweight. The speed advantage compared to Miniconda is the primary reason we selected it.

Model Training and Feature Selection​

Importing and Preprocessing Data​

First, download the corresponding miRNA expression data from TCGA using the TCGA identifier. Then we decompress the resulting TSV files and parse them into Pandas dataframes.

Next, standardize the expression data. Since the original dataset used log2 transformation to reduce skew, we normalize the data: Normalized expression = 2^expression - 1.

d = np.power(2, datafile) - 1  

To facilitate subsequent data training, we also add sample status (cancer 1/normal 0) to the normalized dataframe.

Through these preprocessing steps, we transform the data into a form suitable for machine learning and statistical analysis. These preprocessed and curated miRNA expression data will be used in subsequent analyses, such as differential expression identification, feature selection, and classification model training.

Feature Selection step 1​

We first select a subset of miRNAs based on the differential expression analysis.

  • First, calculate the average expression of each miRNA in cancer and normal samples in the preprocessed data, respectively.

    Then, screen for those miRNAs with large differential expression between cancer and normal groups by a chosen threshold (typically a fold change). We consider these highly differentially expressed miRNAs to be good candidates for distinguishing cancer from normal samples.

    Thus, the retained miRNA subset has expression in cancer that is significantly higher than in normal samples. By removing non-differential miRNAs, the feature space can be narrowed for more efficient modeling.

  • Next, we train classification models on the prepared training data, perform feature selection with regularization techniques, evaluate model performance on the test set, and select the best model.

    We train various models like LinearSVC, ElasticNet, LogisticRegression, Lasso using miRNA expression as features and status as labels.

    First, L1 regularization is used for feature selection, adjusting regularization strength to select an optimal number of features.

    L1 regularization automatically selects features by driving unimportant coefficients to zero: Lower regularization strength retains more features, higher values remove more features.

    L1 regularization provides an automated, data-driven way to select the most informative, non-redundant features for the model, eliminating noise and irrelevant features. This is critical for identifying small biomarker (miRNA) feature sets optimized for accuracy and generalizability.

Initializing Models​

Models are initialized by specifying L1 regularization, e.g.:

logisticRegr = LogisticRegression(penalty='l1') 
linearSVC = LinearSVC(penalty='l1', loss='squared_hinge')
lasso = Lasso(alpha=0.1)

Setting Regularization Parameters​

Regularization strength is controlled via hyperparameters like C or alpha. We prepare a range of values to train with varying regularization levels:

C_values = np.logspace(-9, 0, num=100)
alpha_values = np.logspace(-4, -1, num=100)

Model Training Loop​

In a loop, models are trained using different hyperparameter values:

for c in C_values:
logisticRegr.set_params(C=c)
logisticRegr.fit(X_train, y_train)

By adjusting regularization strength, coefficients of irrelevant features are driven to zero, removing them from the model and leaving only the most informative features with non-zero coefficients.

Thus, by training with varying regularization levels, the optimal number of features maximizing performance can be found. The selected miRNA subset can then be considered most predictive for the cancer classification task.

Simplifying Weights​

The weights indicate the relative importance of each selected miRNA biomarker for classification. Thus, for each model's selected miRNA subset, we analyze the coefficients to obtain simplified weights.

  • After getting selected feature coefficients (using only non-zero coefs), calculate mean and cap to integer range:

    coef = coef[coef != 0]
    weights_avg = np.mean(np.abs(coef))
    min_times = 0.5 / np.min(np.abs(coef))
    max_times = 25.5 / np.max(np.abs(coef))
  • Attempt to round each coef to allowed weight levels:

    for coef_i in coef:
    match_weight = min(allowed_weights, key=lambda x: abs(x - coef_i))
    coef_rounded.append(match_weight)
  • Keep rounded coefficients that minimize difference from original:

    error = np.mean(np.abs(coef - coef_rounded) / coef)
    best_rounding = rounded_coefs[np.argmin(errors)]

We successfully produce simplified integer weights that preserve relative feature importance while providing:

  1. Greater model interpretability by conveying each biomarker's contribution.

  2. More biological meaning than arbitrary coefficient values. They can indicate related mechanisms like miRNA regulatory strength.

  3. Model compression by reducing high-precision coefficients to a few discrete levels. This makes the model more portable for resource-constrained settings.

  4. The simplified model with integer weights can recreate predictions from the full model fairly well, allowing reproducibility in other contexts.

  5. Rounding coefficients provides some robustness against noise and overfitting.

Test Result of Simplified Model​

Simplifying is good, but it also means losing some sort of accuracy. So to avoid the loss of accuracy cause error, we added another layer to re-run all jobs, but with the newly-simplified weights. We call it reproduce.

We calculate the scores of each metrics , and only flag one set of simplified weights when all important metrics like acc, f1 score, auc is same or the difference is relatively small. Although our way of simplifying is advanced, we still have about 20 % of weight sets failed to pass the reproduce.

After the reproduce step, we're proud that all passed sets of weights should be reproduceable.

Model Selection and Analysis​

Model Evaluation​

Each trained model is evaluated on the test set using metrics like accuracy, AUC, F1 score, etc. This provides performance estimates for models trained with different regularization strengths. Metrics at each regularization level were recorded in the previous step to find the best model.

Evaluation is done primarily by comparing predictions to true labels to compute metrics like accuracy, AUC, F1 score, precision, recall, etc. These evaluation metrics quantify the model's generalization ability and provide unbiased estimates of real-world performance.

Visualization​

Model performance is visualized by generating plots like ROC curves, precision-recall curves, confusion matrices, score distributions, etc. This allows intuitive assessment of performance.

Below is an example with Elastic Net:

image-20231012172646725

Besides, we had campared between different ways of modeling like below:

image-20231012172618487

Plotting and Analysis​

Based on the model evaluation, we summarize the selected features and weights in a table for further analysis.

In the second engineering circle, we had also developed a notebook to create detailed figure of data, who can create images like below. It is opensource and everyone can fork your own. The below is the result of our selection in lung cancer data provided by TCGA. It is a combination of LUAD and LUSC, our system have signal when any cancer is found by our system.

image-20231012172819203

2-row spot​

In wet lab experiments, we measured the intensities of two types of fluorescence to comprehensively evaluate infection status.

Directly subtracting the two fluorescence intensities effectively compresses two-dimensional data into one dimension, resulting in information loss.

This insight led us to represent the data not as a simple arithmetic difference, but as X and Y axes in a chart. With this approach, different data points can be better separated, enabling new capabilities like identifying samples with neither bacterial nor viral infection.

Separately train (deprecated)​

The draft version of code and actual effects are as follows:

viral_score = cal_score(viral_feature, viral_weight,viral_times, gse_csv, viral_model, viral_intercept)
bacterial_score = cal_score(bacterial_feature, bacterial_weight,bacterial_times, gse_csv, bacterial_model, bacterial_intercept)
y = gse_csv['infection_status'].map({'viral': 1, 'non-infectious illness': 0, 'bacterial': 2})

fig, ax = plt.subplots( 1,1, figsize=(10,10))
ax.scatter(
viral_score,
bacterial_score,
c=y,
)

ax.plot(
[0, 1], [0, 1],
transform=ax.transAxes,
)
image-20231012180216043

At that time, we choose to use a independent modeling of 2 rows:

  • Get all pairs of infection status, including healthy people, viral, bacterial, and both.

  • Train one linear model to find whether one had catch by virus, one to find whether one had catch by bacterial, and only allow + weights.

    • virus: healthy people, bacterial vs viral, both
    • bacterial: healthy people, viral vs bacterial, both
  • Then calculate the genes, and try to find the connection.

  • It turned out that this is not that useful. So we gave up this for a while.

New version: Train together, plot in 2 rows​

Prior to selecting experiments, we found that many groups of genes and weights were identical across common data metrics, yet still differed.

Our initial 1D plots felt lacking, until we recalled previous work with 2D plots.

Sure enough, switching to 2D plots significantly improved separation between negative and positive results. The higher dimensionality enabled clearer distinctions between groups that seemed equivalent in 1D.

And below shows the difference.

image-20231012182731476

Results​

tip

Thanks to our teammate's ingenious design, the range of weight values we can select has greatly expanded, from ±4 to any number within 5 or any product of numbers within 5. This means it is now easier to obtain more accurate results.

Compare to earlier work​

A previous publication at https://doi.org/10.1038/s41565-023-01348-9 1 found that 99.6% of NSCLC patients and 85.2% of healthy individuals were correctly identified. Testing with the genes they screened also produced good agreement, as shown below:

image-20231012210415908

Building on their work, Through cross-validation and testing on independent datasets, we validated that the model maintains its high accuracy on new samples. Our model can correctly identify 99.6% of NSCLC patients, and even 94.7% healthy individuals. We're winner in all other metrics.

image-20231012172819203

Clinical deployment of our model could reduce invasive diagnostic procedures and costs for the majority of healthy individuals that are correctly identified as non-cancerous, but more effor for wet lab teams is needed. This demonstrates the translational potential of our approach to improve NSCLC screening and diagnosis.

Adaptation to other cancers​

Moreover, our method can be easily adapted to other cancers, such as kidney clear cell carcinoma (KIRC). We were able to achieve excellent results for KIRC as well, demonstrating the versatility of our platform and strong potential.

KIRC

Reference​

Another paper 2

Note

All of the following software are free and open source. You can find any source code at iGEM GitLab. If you have any questions about the software, please contact us.

Some note in the sourcecode notebook might be Chinese, but I believe most code could explain themselves.


  1. Yin, Fangfei, et al. “DNA-Framework-Based Multidimensional Molecular Classifiers for Cancer Diagnosis.” Nature Nanotechnology, vol. 18, no. 6, Springer Science and Business Media LLC, 27 Mar. 2023, pp. 677–686. Crossref, doi:10.1038/s41565-023-01348-9. https://doi.org/10.1038/s41565-023-01348-9↩
  2. Lopez, Randolph, et al. “A Molecular Multi-Gene Classifier for Disease Diagnostics.” Nature Chemistry, vol. 10, no. 7, Springer Science and Business Media LLC, 30 Apr. 2018, pp. 746–754. Crossref, doi:10.1038/s41557-018-0056-1. https://doi.org/10.1038/s41565-023-01348-9↩