Search for relevant subsets of binary predictors in high dimensional regression for discovering the lead molecule

One of the main problems that the drug discovery research field confronts is to identify small molecules, modulators of protein function, which are likely to be therapeutically useful. Common practices rely on the screening of vast libraries of small molecules (often 1-2 million molecules) in order to identify a molecule, known as a lead molecule, which specifically inhibits or activates the protein function. To search for the lead molecule, we investigate the molecular structure, which generally consists of an extremely large number of fragments. Presence or absence of particular fragments, or groups of fragments, can strongly affect molecular properties. We study the relationship between molecular properties and its fragment composition by building a regression model, in which predictors, represented by binary variables indicating the presence or absence of fragments, are grouped in subsets and a bi-level penalization term is introduced for the high dimensionality of the problem. We evaluate the performance of this model in two simulation studies, comparing different penalization terms and different clustering techniques to derive the best predictor subsets structure. Both studies are characterized by small sets of data relative to the number of predictors under consideration. From the results of these simulation studies, we show that our approach can generate models able to identify key features and provide accurate predictions. The good performance of these models is then exhibited with real data about the MMP-12 enzyme.

Search for relevant subsets of binary predictors in high dimensional regression for discovering the lead molecule

Mameli, Valentina;Slanzi, Debora;Poli, Irene;Green, Darren V S

2021

Abstract

One of the main problems that the drug discovery research field confronts is to identify small molecules, modulators of protein function, which are likely to be therapeutically useful. Common practices rely on the screening of vast libraries of small molecules (often 1-2 million molecules) in order to identify a molecule, known as a lead molecule, which specifically inhibits or activates the protein function. To search for the lead molecule, we investigate the molecular structure, which generally consists of an extremely large number of fragments. Presence or absence of particular fragments, or groups of fragments, can strongly affect molecular properties. We study the relationship between molecular properties and its fragment composition by building a regression model, in which predictors, represented by binary variables indicating the presence or absence of fragments, are grouped in subsets and a bi-level penalization term is introduced for the high dimensionality of the problem. We evaluate the performance of this model in two simulation studies, comparing different penalization terms and different clustering techniques to derive the best predictor subsets structure. Both studies are characterized by small sets of data relative to the number of predictors under consideration. From the results of these simulation studies, we show that our approach can generate models able to identify key features and provide accurate predictions. The good performance of these models is then exhibited with real data about the MMP-12 enzyme.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2021
			
	Titolo della Rivista
	
				PHARMACEUTICAL STATISTICS
			
	N° Volume
	
				20
			
	DOI
	
				https://dx.doi.org/10.1002/pst.2117
			
	Appare nelle tipologie:
	
				2.1 Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
2021_PS.pdf non disponibili Tipologia: Versione dell'editore Licenza: Accesso chiuso-personale Dimensione 1.53 MB Formato Adobe PDF Visualizza/Apri	1.53 MB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/3738547

Citazioni

1

2

2

social impact