← Back to Use Cases

Improved Drug Discovery Through Better Machine Learning Models

Name Affiliation Phone Number Email Address
Stan Lazic AstraZenca matt.butchers@ktn-uk.org
Ola Engkvist AstraZeneca matt.butchers@ktn-uk.org
Industrial Sectors:



As has been widely published there is a renewed interest in machine learning since a couple of years. There has been remarkable progress with new innovative meth- ods creating a step change in image analysis and very recently in developing a program that beat the world champion in GO. A feat that only a few years ago was thought of as close to impossible.

It is therefore natural that also in drug discovery there is a renewed interest in machine learn- ing methods. While proven very useful in QSPR (Quantitative Structure Property Relationships) modeling to predict physicochemical properties like lipophilicity, there is a renewed interest to both increase the application domain of machine learning as well as take a fresh look at classical machine learning tasks to predict the bioactivity of a compound and the ADME (Adsorption, Distri- bution, Metabolism and Excretion) of a compound. This is not only driven by progress in machine learning, but also by the increased automation of drug discovery generating more data than has been done historically.

2.1 Process Inputs

A lot of decisions within AstraZeneca are based on predicting various properties of molecules. In many instances, we are using substructure based descriptions as an approximation of the molec- ular graph and combine it with support vector machines (SVM) as machine learning tool. [1] In the early phases of discovery, we experimentally test tens of thousands of molecules and use this knowledge as a base for predictive modeling of the corresponding properties.

2.2 Propagation

We are interested to investigate different methods that could increase our predictive accuracy and make a comparison to our current standard approach. We do not see any limitations to what can be tried if it is related to small-molecule drug discovery. One approach we are interested to pursue is to investigate advances in deep learning and molecular graph convolution. Partici- pants are encouraged to build models either with scikit-learn [2] or for deep learning within the DeepChem [3] framework; however, participants can also use other frameworks. Several inter- esting and inspiring articles have recently been published using molecular graph convolution, that might be of interest for the participants. [4]

2.3 Interpretation and Communication of Results


AstraZeneca will for the project bring data sets that can be used by the participants. There will be some sets for bioactivity prediction downloaded from PubChem. A second type of data is ADME datasets that have been generated internally. The training set is in the public domain. The external training set are compounds that are in the public domain while the test set results are not. For this set the participants will do predictions and we will feedback the accuracy achieved.