Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain Knowledge Filters
Co-Investigator: Mark J. Embrechts
Associate Professor, Department of Decision Sciences & Engineering Systems
Transparent Chemometrics Modeling
In the past we developed machine learning methodologies and software for molecular drug design or QSAR (quantitative structural activity relationships) that solves similar problems under the NSF funded DDASSL project (Embrechts et al., 1999). The DASSL project (Drug Discovery and Semi-Supervised Learning) is a 5-year 1.5 Million dollar research project under the supervision of Mark Embrechts (with Profs. Curt Breneman and Kristin Bennett as Co-PIs), that came to completion in December 2004. As a product of this research we developed and implemented (direct) kernel partial-least squares or K-PLS(Gao et al., 1998; Gao et al., 1999; Bennett et al., 2003; Rosipal et al., 2001; Lindgren et al., 1993; Embrechts et al., 2004; Shawe-Taylor et al., 2004) for feature identification and model building. This software is currently utilized at several pharmaceutical companies as their flagship software for drug design. K-PLS is closely related to support vector machines (SVMs) (Cristianini et al., 2000; Vapnik, 1998; Scholkopf et al., 2002; Boser et al., 1992). SVMs are currently one of the main paradigms for machine learning and data mining.
The relevance K-PLS for chemometrics is that on the one hand it is a powerful nonlinear modeling and feature selection method that can be formulated as a paradigm closely related (and almost identical) to support vector machines. On the other hand, K-PLS is a natural nonlinear extension to the PLS method (Wold et al., 2001; Wold, 2001), a purely statistical method that has dominated chemometrics and drug design during the past decade. The idea of using of K-PLS rather than support vector machines for the purpose of molecular design can be motivated on several levels: i) Extensive theoretical and experimental benchmarking studies have shown that there is little difference between K-PLS and SVMs; ii) Unlike SVMs, there is no patent on K-PLS; iii) K-PLS is a statistical method and a Natural extension to PLS and Principal Component Analysis, which is currently the method of choice in chemometrics and drug design; iv) We developed and implemented a powerful feature selection procedure with K-PLS that is fully benchmarked and ranked 6th out of 80 group entries in the NIPS feature selection challenge (Embrechts et al., 2004); iv) PLS is one of the few methods besides Bayesian networks that has proven to be successful for causality models.
Sensitivity analysis will be used to select relevant descriptors from a predictive model. The underlying hypothesis of sensitivity analysis analysis (Embrechts et al., 2004; Kewley et al., 2000; Embrechts et al., 2003; Breneman et al., 2003) is that once a model is built, all inputs are frozen at their average value, and then one-by-one the inputs are tweaked within their allowable range. The inputs or features for which the predictions do not vary a lot when they are tweaked are considered less important, and they are slowly pruned out from the input data in a set of successive iterations between model building and feature selection. Typically sensitivity analysis proceeds in an iterative fashion where about 10% of the features (genes) are dropped during each step.
During the past three years we experimented to identify a small subset of transparent and explanative descriptors based on sensitivity analysis and integrated domain filters based on experiments. The idea here is that we present the domain expert a comprehensive list of selected molecules cross-linked with “cousin” descriptors that have a high correlation with the selected descriptors (typically > 85%). One of the novelties of this proposal is to integrate domain expertise for selecting between alternate sets of descriptors and the integration of appropriate chemical domain filters in the descriptor selection phase.
Previous || Next