|Background and Significance
The importance of Cheminformatics has increased dramatically in recent history in direct proportion to the extensive growth of computer technology. In the past few decades, the drug design field has extensively used computational tools to accelerate the development of new and improved therapeutics (Hall et al., 2002; Wessel et al., 1998; Hansch et al., 1985; Kumar et al., 1974). Researchers have recognized the urgent need to establish relationships between chemical structures and their properties. The first correlation of this kind was reported in the 19 th century by Brown and Fraser in the area of alkaloid activity (Albert, 1975). Subsequently, several researchers have reported correlations for a wide variety of chemical properties (eg., equilibrium and rate constants, drug absorption, toxicity, solubility, etc.) (Hammett, 1935; Hammett, 1937; Hansch, 2002; Kier, 2002; Guertin, 2002). The term Quantitative Structure Property Relationships (QSPR) is generically used to describe these types of models, while the term QSAR is often used to refer specifically to structural correlations with bioactivity. When a fundamental thermodynamic property is related to molecular features, the correlations are referred to as Linear Gibbs Free Energy Relationships (LFER) (Hammet, 1937).
The cheminformatics analysis tools that have been deployed as part of the industrial drug discovery process are gaining in sophistication, and are earning increasing respect as tools crucial for the rapid development of new therapeutics. One factor driving the need for effective chemical data analysis is the tremendous growth of in-house molecular databases as a result of automated combinatorial synthesis techniques and HTS assay systems. Cheminformatics techniques facilitate the analysis and interpretation of the chemical information contained within thede sets of complex and high-dimensional molecular data. The reliability of automated methods for the analysis of this data have been plagued by numerous problems related to fortuitous correlations and over-trained models, but in spite of these problems, the technique of cheminformatic anslysis has gained additional credibility as methods for validating predictive models have become available.
QSPR/QSAR methods can be a valuable source of knowledge on both the nature of molecular interactions and a means of predicting molecular behavior. The importance and type of interactions involved in specific situations can be identified with the help of robust machine learning and data mining algorithms. When presented with high-dimensional chemical data, success of statistical learning models depends strongly on their ability to identify a subset of meaningful molecular descriptors among numerous electronic, geometric, topological and molecular size-related descriptors. When one begins with a large number of descriptors, relevant features must be identified by a combination of appropriate objective and subjective feature selection routines. The resulting descriptor set can then be employed to generate validated, predictive models using one of several regression or classification modeling methods. Alternatively, some laboratories create structure/property correlation models based on the use of a relatively small number of pre-determined descriptors, each having a subjective chemical meaning. This approach often yields more interpretable models, but often at the expense of predictive accuracy.