23
Research Article
Initial development of web application to support excipient
selection for immediate release tablets using message passing
neural network
Linh Nguyen Trana,*, Hang Phan Thi Thua, Linh Vu Ngoc Haia
aHanoi University of Pharmacy, 13-15 Le Thanh Tong, Hanoi, Vietnam
Journal of Pharmaceutical Research and Drug Information, 2023, 14 (5): 23-31
A R T I C L E I N F O
Article history
Received 13 May 2023
Revised 15 sept 2023
Accepted 24 Nov 2023
Keywords
Immediate release tablets
Artificial intelligence
Message passing neural
network
Excipient selection
Web application
.
A B S T R A C T
The formulation of immediate release tablets is a challenging task due to
the need to balance multiple factors such as bioavailability and stability.
The selection of appropriate excipients is critical in achieving these
objectives. The recent successful application of Artificial Intelligence (AI)
and Message Passing Neural Network (MPNN) in predicting
physicochemical and biological properties of drug molecules suggests for
the researchers on the ability to apply these models in selecting excipients.
The aim of this study is to develop an innovative approach for selecting
excipients using AI and MPNN and to create a user-friendly web
application to support excipient selection for immediate release tablets.
The study utilized a database of 13,278 immediate-release tablets to train,
validate, and test the MPNN model on the basis of Simplified Molecular-
Input Line-Entry system (SMILES) of drug substances. The performance
of the model was validated based on its ability to predict the probability
of selecting an excipient reasonably. A web application named FormAI
was developed using the Streamlit web framework and integrated with the
trained model. The MPNN model demonstrated good performance, with
an average Area Under Curve > 0.98 and R2 > 0.99, indicating its ability
to predict the probability of selecting an excipient reasonably. The FormAI
application provides a user-friendly platform for excipient selection. The
results of the study demonstrate the potential of using AI and MPNN in
drug formulation design, specifically in excipient selection for immediate-
release tablets. The FormAI application provides a practical solution for
pharmaceutical scientists and formulators.
* Corresponding author: Linh Nguyen Tran; e-mail address: linhnt@hup.edu.vn
https://doi.org/10.59882/1859-364X/134
Journal homepage: jprdi.vn/JP
Journal of Pharmaceutical Research and Drug Information
An official journal of Hanoi University of Pharmacy
24
Introduction
Excipient selection is one of the most
important but also difficult issues in the
process of researching and developing drug
products. Traditionally, excipient selection
is often based on the trial and error”
method which can be time-consuming,
costly, and does not always lead to the most
effective formulation. Experimental design
and optimization methods can be applied but
require formulators to have knowledge and
experience in drug design. To address this
issue, researchers have explored the use of
Artificial Intelligence (AI) as a powerful
tool in pharmaceutical research and
development.
In AI models, Graph Neural Networks
(GNNs), such as Message Passing Neural
Networks (MPNNs, a special type of GNN),
are increasingly being used in pharmaceutical
research and development. MPNN is a type
of neural network that can operate on
structured graph data (consisting of nodes and
vectors linking nodes), such as the structure
of drug molecules (in this case, atoms in the
molecule act as nodes, while chemical bonds
act as vectors). MPNN has been shown to be
highly effective in predicting drug molecule
properties such as solubility, oil-water
distribution coefficient, biological effects or
toxicity [1], [2], [3]. However, until now, no
studies have been published on the use of AI
models for selecting excipients in drug
formulation research.
The goal of this study was to develop an
intelligent web application that uses artificial
intelligence and message passing neural
networks to help formulators shorten the
research and development process for
compressed and released drugs by suggesting
suitable excipients for each type of drug
substance. Applying this application will help
reduce research costs and time, as well as
improve product quality. The results of the
study can provide the pharmaceutical
industry with a useful tool to enhance
competitiveness and meet market demand.
Materials and Methods
Data collection
Data on the ingredients of tablets was
collected from DailyMed. This is a public
database managed by the US Food and Drug
Administration (FDA), containing detailed
information on drug products approved by
the FDA. The web scraping tool
BeautifulSoup was used to collect
information on the ingredients of tablets from
DailyMed’s web pages. The collected data
included the name of the drug products, drug
substances, dosage form, strength, and
corresponding excipients. These data were
then stored in a CSV (comma-separated
values) file format for ease of retrieval and
use for model training purposes. Collecting
and storing this data will provide an
important database resource for similar
studies in the future.
Data preparation
The drug substance and excipient names
in the CSV file were standardized by a unique
identifier (Unique Ingredient Identifiers -
UNIIs); the drug substance structural
formulas in simplified molecular-input line-
entry system (SMILES) format was added to
the database and transformed into an input
tensor for the MPNN model. This
transformation allowed the researchers to use
complex data analysis algorithms and tools to
make reasonable predictions about excipient
selection for immediate-release tablet
formulation. The input features generated
from SMILES ensured the integrity and
accuracy of the chemical structure data of the
Linh Nguyen Tran et al. J.Pharm.Res-DI. 2023, 14(5): 23-31
25
drug molecule used in the MPNN model
training process.
Model developpement
The MPNN model had the following
architecture [4]:
Input Layer: Received input as the
attributes of each atom in the molecule and
the attributes of each bond between atoms.
This layer had sub-layers:
- Atom_features (Input Layer): The input
was a 42-dimensional vector containing
information about all common atoms in the
drug molecule.
- Bond_features (Input Layer): The input
was a 7-dimensional vector containing
information about all types of chemical bonds
between atoms in the molecule.
- Pair_indices (Input Layer): The input
was a 2-dimensional vector containing
information about pairs of atoms linked by a
chemical bond.
Message Passing Layer: Performed the
process of message passing between atoms in
the molecule. This layer had 1 sub-layer:
- Message_passing (Message Passing):
The input was the input layers prepared
earlier and returned the feature vector of
atoms after being passed through bonds.
Graph Pooling Layer: Performed pooling
to reduce the output dimension of feature
vectors of atoms in the molecule. This layer
had 1 sub-layer:
Global_average_pooling1d: The input was
the feature vector of atoms after being passed
through bonds and returned a global feature
vector of the molecule.
Fully Connected Layer: Passed the global
feature vector of the molecule through fully
connected layers to calculate predictions
about appropriate excipients for immediate-
release tablet formulation. This layer had sub-
layers:
- Dense_2 (Dense): The input was the
global feature vector of the molecule and
returned a feature vector with size determined
when optimizing the model.
- Dense_3 (Dense): The input was a
feature vector with size determined when
optimizing the model and returned the
probability of an excipient being selected for
immediate-release tablet formulation.
The activation function for the last layer
was a sigmoid function used to calculate
probability of an excipient being selected.
The activation function for the other layers
was ReLu function [5].
The loss function (which represents the
difference between the actual value and the
value predicted by the model) was
BinaryCrossentropy [6]. When training the
model, a lower value of this function was
better.
Training and validating model
The model was written in Python 3.9,
which is a simple and easy-to-write syntax
language, making it one of the most popular
programming languages today. Python is
highly flexible and can be used to develop
web applications, computer software,
artificial intelligence, data analysis, and many
other fields. The process of training and
evaluating the model was performed on
Google Colab using TensorFlow and Keras.
Google Colab is a Google service platform
that allows users to access a free Jupyter
Notebook environment for data analysis and
running Python code. TensorFlow is an open-
source library developed by Google for
processing large data and developing
machine learning and deep learning models.
Keras is an application programming
interface (API) for TensorFlow to help
simplify the building and training of deep
learning models.
Linh Nguyen Tran et al. J.Pharm.Res-DI. 2023, 14(5): 23-31
26
The database was randomly divided into
a training set (train_dataset) accounting for
60% for model training, a validation set
(valid_dataset) accounting for 30% for model
validation, and a test set (test_dataset)
accounting for 10% to test the model.
The model fitting method was used to train
the model on the training dataset. During
training, callbacks were used to minimize
overfitting and increase model stability. The
model was evaluated through the area under
the curve (AUC) representing the accuracy of
the model over training epochs and R2 value.
The trained model was saved then reloaded
to retest on the test dataset. The built model
is capable of suggesting excipients for drugs
not in the original database.
Web application deployment
The trained MPNN model was stored on
Github (https://github.com) and integrated
into the web application developed on
Streamlit platform (https://streamlit.io) to
deploy the model.
In order to assist researchers in the
selection of excipients, we present the
Excipient Selection Scale, which is
determined through the following formula:
Excipient Selection
Scale= P×(1-(1-P)×(1-Q)) (1)
Where:
P: The predicted probability of finding the
excipient under consideration in drug
products containing the input drug substance
by MPNN model.
Q: Proportion of drug products containing
the excipient under consideration over total
drug products in the database.
The Excipient Selection Scale takes values
from 0 to 1 and the larger it is, the more
excipients should be considered for selection
The Streamlit framework was used to
build a simple and easy-to-use interface.
Users could enter the name or SMILES of the
drug substances and the application would
return a list of excipients along with the
Excipient Selection Scale. In addition, some
information about molecular structure and
physical, chemical, and pharmacokinetic
properties of the drug was also predicted with
the help of SwissADME tool
(http://www.swissadme.ch).Files must be in
MS Word only and should be formatted for
direct printing, using the CRC MS Word
provided. Figures and tables should be
embedded and not supplied separately.
Results and discussion
Results
Data collection and preparation
After collecting and preparing the data, a
significant dataset of immediate-release tablet
ingredients from DailyMed, including 13,278
drug products produced from 622 different
drug substances (including different
derivatives of each drug substance) was
obtained. The number of excipients collected
in this dataset was 322.
Based on the collection results, three
separate datasets for use in the training,
validation, and testing of the model were
created. The training set consisted of 7,967
products, the validation set consisted of 3,983
products, and the test set consisted of 1,328
products. These were datasets that were large
and diverse enough to ensure the feasibility
and accuracy of the training, validation, and
testing process.
MPNN model developpement
The MPNN Model function defined an
MPNN (Message Passing Neural Network)
model in TensorFlow. The input parameters
of this function included:
- Atom_dim: The size of the feature vector
for each atom in the molecule.
Linh Nguyen Tran et al. J.Pharm.Res-DI. 2023, 14(5): 23-31
27
- Bond_dim: The size of the feature vector
for each bond between atoms in the molecule.
- Batch_size: The number of samples
(drug products) in each training batch.
- Message_units: The number of units in
the message passing layer.
- Message_steps: The number of messages
passing steps performed.
- Num_attention_heads: The number of
attention heads used in the attention layer.
- Dense_units: The number of neuron units
in the fully connected layers.
The Adam optimization algorithm (7) was
used to optimize the model’s parameters. The
initial learning rate was set to 0.001 and
decreased by a factor of 10 after every 20
epochs. The training process stopped if the
model’s error on the validation dataset did not
decrease after 20 consecutive epochs. The
number of neurons in each layer was
configured to achieve the best performance.
The architecture of the optimized MPNN
model is shown in Figure 1.
Figure 2 represents the value of the Area
Under Curve (AUC) which represents the
accuracy of the model over the number of
Linh Nguyen Tran et al. J.Pharm.Res-DI. 2023, 14(5): 23-31
Figure. 1. The structure of MPNN