RECCR Rensselaer Exploratory Center for Cheminformatics Research
Rensselaer Exploratory Center for Cheminformatics Research (RECCR)

Project Modules

Background and Significance

Regression Techniques

 

News Members Projects Publications Software Data MLI ECCRS
Background and Significance

Regression techniques and machine learning methods

Partial Least Squares : Partial Least Squares (PLS) analysis has the advantage of deriving predictive models in cases where a large number of non-orthogonal descriptor variables are available. PLS simultaneously identifies latent variables and the regression coefficients for the response variable using an iterative approach (Wold et. al., 2001). While PLS modeling is equivalent to creating linear models in principal planes within property space, kernel PLS is used to build non-linear models on curved surfaces within data space.

Modeling with Artificial Neural Networks (ANN) : ANNs are non-linear modeling methods that are reasonably well suited for cases in which there is a limited amount of experimental data with a large number of descriptors per case. (Embrechts et al., 1998, Embrechts et al., 1999, Kewley et al., 1998). The flexibility of ANN models to learn complex patterns is powerful, but must be coupled with model validation techniques to avoid overtraining.

Modeling with Support Vector Machines (SVM) : Support vector machines (SVM) are a powerful general approach for non-linear modeling. SVM are based on the idea that it is not enough to just minimize empirical error on training data such as is done in least squares methods; one must balance training error with the capacity of the model used to fit the data. Through introduction of capacity control, SVM methods avoid overfitting, producing models that generalize well. SVM’s generalization error is not related to the input dimensionality of the problem since the input space is implicitly mapped to a high dimensional feature space by means of so-called kernel functions. This explains why SVM is less sensitive to the large number of input variables than many other statistical approaches. However, reducing the dimension of problems can still produce lots of benefits, such as improving prediction accuracy by removing irrelevant features and emphasizing on relevant features, speeding up the learning process by decreasing the size of search space, and reducing the cost of acquiring data because some descriptors or experiments may be found to be unnecessary. To date, SVM has been applied successfully to a wide range of problems, such as classification, regression, time series prediction and density estimation. The recent literature (Bennett, et.al., 2000, Cristianini, et.al., 2000) contains extensive overviews of SVM methods.

Previous

Rensselaer Polytechnic Institute RECCR Home Page || Member Area || Wiki

Copyright ©2005 Rensselaer Polytechnic Institute
All rights reserved worldwide.