Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain Knowledge Filters
Causal Analysis of Chemometric Models Partial Least Squares (PLS)
Having determined the subset of descriptors that have either real or spurious relation for a given property under study, Partial Least Squares is used to assess causal models that are based on a combination of (1) data mining using nonlinear kernel PLS and (2) expert domain knowledge. Some background explanation is useful to better understand the use of PLS as a tool for both data mining and hypothesis testing. This is followed by consideration of the use of PLS for testing of hypotheses and theories put forth through consultation with domain experts.
PLS was initially developed in Sweden by Herman Wold (1966) (Wold, 1966) for causal analysis of complex social science problems characterized by one or more of nonnormally distributed data, many measurable and/or latent variables, and a small sample size. The technique was introduced into Chemometrics by Svante Wold [(Wold et al., 2001; Wold, 2001) for predictive modeling of chemical systems and spectral analysis (Gao et al., 1998; Gao et al., 1999; Thosar et al., 2001). The difference in need between social science research and chemometrics has resulted in different evolutionary paths for the technique. In applied sciences, this focus is on prediction in the face of nonlinearity (Bennett et al., 2003; Rosipal et al., 2001) and small and large data sets (Bennett et al., 2003). In social sciences the use of PLS and other structural equation modeling (SEM) techniques has focused on hypothesis testing and causal modeling (Fornell, 1982 and Kaplan, 2000 and Marcoulides et al., 1996). PLS is superior to other structural equation modeling techniques in that it requires neither an assumption of normally distributed data nor the independence of predictor variables (Linton, 2004 and Falk et al., 1992; Fornell et al., 1982). It is also possible to obtain solutions with PLS even if there are more variables than observations (Linton, 2004; Falk et al., 1992; Chin et al., 1999). Although PLS may not offer Best Least Unbiased Estimators (BLUE) if the number of observations is small, with increasing numbers of observations the model coefficients quickly converge on the BLUE criteria (Fornell et al., 1982; Chin et al., 1999). The quality and robustness of PLS models are measured by considering the magnitude of the explained variance and whether or not relations between different measured and theoretical variables in the proposed model are found to be statistically significant when tested with bootstrapping (resampling) (Efron et al., 1993; Efron, 1982). These techniques are frequently and successfully used (Linton, 2004; Yoshikawa et al., 2004; Johnston et al., 2000; Yoshikawa et al., 2000; Tiessen et al., 2000; Gray et al., 2004; Croteau et al., 2003; Das et al., 2003; Croteau et al., 2001; Hulland, 1999; Igbaria, 1990; Cook et al., 1989) for evaluating causal models.
By reducing the list of possible combinations of descriptors under consideration for a given molecule set under study, experts with suitable domain knowledge can focus on developing theories and models of likely candidate descriptors and their associated interactions. Once models are developed, causal PLS can be used to determine how much of the variance is explained by the proposed model and whether all or some of the hypotheses supporting the model are statistically significant. Through this process it is possible to combine data mining with domain expertise to gain insights into not only the relationship between molecular descriptors and properties under consideration. This process of (1) data mining followed by (2) hypothesis generation by a domain expert, and (3) hypothesis testing is novel and has potential application to many other fields as well. Both this particular application and others are excellent candidates for future external funding.
Previous  Next
