RECCR Rensselaer Exploratory Center for Cheminformatics Research
News Members Projects Publications Software Data MLI ECCRS
Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain Knowledge Filters

Novel Outlier Detection Methods with One-Class SVM and Direct Kernel Methods

In the context QSAR, it is important to identify outliers and molecules that contain novelty in order to assemble a coherent set of molecules for building a predictive and explanatory model. This set of issues falls under the class of outlier detection and/or novelty detection problems. Outlier detection and novelty detection are hard problems for machine learning. Outlier detection is difficult because there are just very few samples for the outlier class to learn from. An additional hurdle is that the classes do not have a balanced number of samples. Most machine learning methods initially tend to be biased towards the majority class. Yet, classification problems that mandate outlier identification are ubiquitous. The general use of support vector machines for outlier detection is described in the machine learning literature (Chang et al., 2001; Chen et al., 2001; Unnthorsson, 2003; Campbell et al., 2001; Scholkopf et al., 2000; Tax et al., 1999).

Novelty detection methods are similar to outlier detection, but these methods have the additional challenge that the novelty pattern is not known a priori; all that is known is that the novel pattern is just very different from a normal pattern. There is a fair body of recent literature addressing outlier detection and novelty detection in the context of neural networks (Albrecht et al., 2000; Crook et al., 2002), statistics, and machine learning in general. An interesting approach for novelty detection is the use of auto-associative neural networks or auto-encoders (Principe et al., 2000). Auto-associative neural networks are feedforward neural networks where the output layer reflects the input layer via a bottleneck of a much smaller number of neurons in the inner hidden layer. Monitoring the deviation from typical outputs for the neurons in the hidden layer has often been proven as a robust way for novelty and outlier detection with neural networks.

A pilot version for outlier detection has recently been implemented in the Analyze/StripMiner code (Embrechts et al., 1999) as illustrated in Figure 1, and we propose to develop this model further to industrial grade software.

The development of domain-specific filters and hypothesis testing within this methodology make it an ideal candidate for use in collaborative interactions with all aspects of the Cheminformatics Center community.

Previous || Next

Rensselaer Polytechnic Institute RECCR Home Page || Member Area || Wiki

Copyright ©2005 Rensselaer Polytechnic Institute
All rights reserved worldwide.