Model
Introduction Video for SEPIA
A simplified example to explain how SEPIA predicts polypharmacy side effects using a hetero graph neural network.
Polypharmacy Dataset
We illustrated 2 datasets that provide information about polypharmacy.
The first dataset is a preprocessed train/test/validation dataset provided within the Decagon paper 1. The dataset incorporates human drug-drug interaction networks with side effects indicating polypharmacy interactions, all compiled from different sources.
The second dataset uses data from the nSides 2 project databank, which includes data for drug side effects (OFFSIDES) and drug-drug pair side effects (TWOSIDES). These data represent an update from the data released on adverse events reported to the FDA through the FDA Adverse Event Reporting System (FAERS) up to and including 2014, which contains information on adverse events and medication errors. The dataset incorporates clinical reports of the taken drugs with their side effects, which includes 1 to 49 drug combinations with reported side effects.
Dataset Preprocessing
Dacagon Dataset
This dataset has already been preprocessed, details in SimVec3
nSides Dataset
- Extract Chemical Structure Embeddings
We downloaded the entire STITCH databank version 5.04 with the names and SMILES5 strings of STITCH's chemicals. The SMILES were canonicalized before extracting the chemical structure embeddings from MolFolmer (cite needed); chemicals are filtered when the extracted embedding is NA. Thus, we stored the STITCH ID, drug names, and their chemical embedding in a table.
- Map Drug Names from nSides with STITCH
The drug concept names from the nSides dataset were mapped with the drug names from the table with STITCH ID, drug names, and their chemical embedding. Drugs that can't be mapped are filtered. Thus, we stored the processed drug category from nSides with STITCH ID, drug names, and their chemical embedding into a table.
- Quality Control and Dataset Splitting
In quality control, report IDs containing any drugs not in our drug category are removed, and we selected reports with 2 to 10 drug combination usage. After quality control, the dataset is split into train, validation, and test set in 0.8, 0.1, and 0.1 ratio by the report IDs.
Preprocessed Dataset
Decagon | nSides | |
---|---|---|
Number of Drugs | 645 | 2204 |
Number of Side Effects | 963 | 17552 |
Max number of Drug Combinations | 2 | 10 |
Polypharmacy Interactions for Training | 4512911 | 2878989 |
Polypharmacy Interactions for Validation | 19785 | 357294 |
Polypharmacy Interactions for Testing | 19842 | 357301 |
Graph Structure
We construct a knowledge graph (KG). The graph consists of nodes of two types: drugs and hypernodes. The edges of the graph correspond to different drug combinations. The node features of the drugs are based on chemical structures, and each hypernode feature signifies the side effects.
Chemical Structure Embeddings
The chemical structure embeddings are extracted using MolFolmer 5, a large-scale chemical language representation to capture molecular structure and properties published by IBM, which employs a linear attention mechanism, coupled with highly distributed training, on sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. The chemical language embedding was extracted for the nSides dataset, and the SMILES of the drugs were used to generate the embedding to provide meaningful biological information.
Introduction to SAGEConv Architecture
SAGEConv6 learns node representations by aggregating information from each node's neighborhood, allowing it to capture more complex features. SAGEConv uses a degree-normalized aggregation function that allows SAGEConv to capture more fine-grained information about the local neighborhood of each node, which can be particularly useful in tasks such as link prediction and graph classification.
Heterograph Construction
Nodes
- Drugs: Represent the drugs with chemical structure embedding.
graph TD
A((Drug))
style A fill:#FFFFFF,stroke:#00018,color:#000000
drug_node_construct | |
---|---|
- Hypernodes: Represented in polypharmacy cases, these nodes store aggregated information on side effects caused by drug combinations.
graph TD
H((Hypernode))
style H fill:#FFFFFF,stroke:#00018,color:#000000
hypernode_construct | |
---|---|
Edges
-
Hyperedge Part 1: Directed edges connecting drug nodes to hypernodes, indicating that a particular combination of drugs is associated with specific side effects stored in the hypernodes.
-
Hyperedge Part 2: Directed edges connecting hypernodes back to drug nodes. In conjunction with Hyperedge Part 1, these edges form hyperedges that encapsulate the relationships between drug combinations and their side effects.
graph TD
A((Drug))
H{{Hypernode}}
A ---> H
H ---> A
style A fill:#FFFFFF,stroke:#00018,color:#000000
style H fill:#FFFFFF,stroke:#00018,color:#000000
Graph Construct Example with Trio Drugs
graph TD
A((Drug_1))
B((Drug_2))
C((Drug_3))
H{{Hypernode}}
A --- H
B --- H
C --- H
style A fill:#FFFFFF,stroke:#00018,color:#000000
style B fill:#FFFFFF,stroke:#00018,color:#000000
style C fill:#FFFFFF,stroke:#00018,color:#000000
style H fill:#FFFFFF,stroke:#00018,color:#000000
Model Architecture
The model architecture, named SEPIA (Side Effect Prediction with Interaction Awareness), is designed for drug interaction and side effect prediction, leveraging the graph representation of polypharmacy data. The model consists of an Encoder and a Decoder.
Encoder:
-
Input:
Graph representation of the biological data, including drugs, proteins, and their interactions, along with identified side effects. -
Output:
Embeddings representing the drug combinational interactions and effects. -
Architecture:
It consists of SAGEConv instances for processing heterogeneous graph data and an embedding layer for side effects.
Decoder:
-
Input:
Embeddings from the encoder. -
Output:
Predicted side effects in the form of a probability distribution between 0 and 1. -
Architecture:
A neural network that maps embeddings to side effect predictions.
Negative Sampler:
-
Input:
Graph, Batch size, and number of side effects we want to generate. -
Output:
Randomly generated node IDs and side effects. -
Functionality:
Generate negative samples (Random Side effects) for training the SEPIA model.-
Compute Distribution: Calculates the distribution of hyperedge sizes in the graph using the hyperedge_size_distribution(graph) function.
-
Random Number of Nodes: Randomly selects several nodes to include in each negative sample based on the computed distribution. This means each negative sample could have a different number of nodes.
-
Generate Negative Samples: Each batch randomly selects a set of nodes and an effect to create a negative sample. The nodes are selected randomly from the "drugs" node type in the graph, and the effects are chosen randomly from the total number of possible effects.
-
Training Description
The training involves data/model loading, negative sampling, forward pass, loss computation, and model update and validation. The process is iterative, optimizing the model's parameters to predict the side effects of drug interactions better.
Step by Step Training:
1. Custom Graph Data loading:
Load the heterograph data in batches with the positive samples and the model.
2. Negative Sampling:
Generate a negative sample to add noise in model training and let the model separate important features from noise.
3. Forward Pass: The SEPIA model takes the positive/negative graphs, processes them through the encoder to get embeddings, and then through the decoder to get side effect predictions.
4. Loss Computation and Model Update:
Computes the loss using a binary cross-entropy loss function, considering both positive and negative samples. Then, using backpropagation to calculate gradients and update the model's parameters are updated using an optimizer.
The model is validated on a separate dataset to predict the side effects.
Key Model Features:
Polypharmacy Modelling:
The hypernode graph structure can handle true polypharmacy interactions.
Heterogeneous Graph Processing:
The model can handle graphs with multiple nodes and edges, making it suitable for further incorporation with other biological data.
Chemical Structure Embedding:
Enhances the model's understanding of the drug's chemical structure.
References
-
Decagon. Marinka Zitnik, Monica Agrawal, Jure Leskovec, Modeling polypharmacy side effects with graph convolutional networks, Bioinformatics, Volume 34, Issue 13, July 2018, Pages i457–i466, https://doi.org/10.1093/bioinformatics/bty294 ↩
-
nSides. Vanguri, Rami; Romano, Joseph; Lorberbaum, Tal; Youn, Choonhan; Nwankwo, Victor; Tatonetti, Nicholas (2017). nSides: An interactive drug--side effect gateway. figshare. Dataset. https://doi.org/10.6084/m9.figshare.5483698.v2 ↩
-
Simvec. Lukashina, N., Kartysheva, E., Spjuth, O. et al. SimVec: predicting polypharmacy side effects for new drugs. J Cheminform 14, 49 (2022). https://doi.org/10.1186/s13321-022-00632-5 ↩
-
STITCH databank. Damian Szklarczyk, Alberto Santos, Christian von Mering, Lars Juhl Jensen, Peer Bork, Michael Kuhn, STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data, Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, Pages D380–D384, https://doi.org/10.1093/nar/gkv1277 ↩
-
SMILES. David Weininger. 1988. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 1 (February 1988), 31–36. https://doi.org/10.1021/ci00057a005 ↩↩
-
Molformer. Large-Scale Chemical Language Representations Capture Molecular Structure and Properties Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, et al. https://arxiv.org/abs/2106.09553 ↩
-
SAGEConv. Inductive Representation Learning on Large Graphs William L. Hamilton, Rex Ying, Jure Leskovec https://arxiv.org/abs/1706.02216 ↩