RECCR Rensselaer Exploratory Center for Cheminformatics Research
Rensselaer Exploratory Center for Cheminformatics Research (RECCR)

Modeling and Mining

Small Molecules



News Members Projects Publications Software Data MLI ECCRS


Materials Informatics || Modeling and Mining || Small Molecules || DNA & RNA || Proteins

Materials Informatics
  • Polymerdesign - NEW!!
    Polymerdesign is an online server to predict dielectric constants, dielectric loss tangent, band gap, glass transition temperature and suitable solvents for pure polymers. It utilizes a new set of descriptors designed for polymers called infinite chain descriptors. This project is supported by Office of Naval Research (ONR).

  • Polymerizer - NEW!!
    Polymerizer is an online server to predict surface energy for nanocomposites, including the work of adhesion, wetting angle and the work of spreading. The current website implementation uses selectable literature particle surface energies. This server also converts the property-encoded surfaces into images for display. This project is supported by Office of Naval Research (ONR).

  • YAMS - NEW!!
    Yet Another Modeling System (YAMS) tool is designed to aid in the discovery and creation of the materials of tomorrow. Informatics best practives are demonstrated throughout the design of the tool and establish the basis for robust informatics modeling. The YAMS tool performs a series of balanced, rational decisions in dataset evaluation, parameter/feature selection and choice of modeling method. The focus of the tool is to convey rich information about model quality and predictions to the end user that helps "close the loop" between modeling and experimental efforts.

Modeling and Mining
  • SOLE - Beta version
    The SVR-based Online Learning Equipment (SOLE) is a web-based machine learning system. The SOLE system provides three SVR-based machine learning algorithms that can be used in the QSAR/QSPR studies. For the comparison purpose, this machine learning system also includes Partial Least Squares (PLS) algorithm. Several feature selection methods are provided for dimension reduction of the input descriptor space. Two cross-validation methods -- LOO and K-fold -- are provided for model hyperparameter selection.

  • RS-WebPredictor - NEW!!
    RS-Predictor is a tool for generating pathway-independent, isozyme-specific P450 regioselectivity QSARs from any set of known substrates and metabolites. Models have been trained using a combination of topological descriptors and SMARTCyp reactivities applied to substrate sets of CYPs 1A2(271), 2A6(105), 2B6(151), 2C19(218), 2C8(142), 2C9(226), 2D6(270), 2E1(145) and 3A4(475), as well a MERGED(680) set representing every curated reaction, regardless of isozyme. A large proportion of the metabolites in each set were identified within the top two predicted rank-positions: 1A2(82.3%), 2A6(85.7%), 2B6(76.8%), 2C19(86.2%), 2C8(83.8%), 2C9(84.1%), 2D6(83.7%), 2E1(80.7%), 3A4(82.1%), MERGED(86.0%). These models may be quickly applied (~30s per substrate per model) to any set of user supplied substrates through RS-WebPredictor.

  • ROE Beta
    This web based application enables automation of Rank Order Entropy utilizing Kendall Tau. This ROE process will examine an input dataset and give a recommendation for the user in terms of the stability of models created by the dataset as well as an evaluation of the ability of the dataset to create a predictive model. The procedure takes the dataset, divides it to training/testing sets, tests the training/testing sets for ability to model, and then models the data, and finally examines the data in terms of Rank Order (as Kendall Tau) and Rank Order Entropy (as Kendall Tau over several truncations of data).

  • ROMS
    The RECCR Online Modeling System (ROMS) is a general web-based machine learning system. By using the available learning methods, users can generate a model and visualize its performance by uploading their data set through the web client. Three learning methods provided are Partial Least Squares (PLS), Kernel-PLS and Support Vector Machine (SVM). In addition to basic modeling functionality, cross validation methods such as Leave-One-Out (LOO) and Monte Carlo Cross Validation (MCCV) are provided for model parameter selection.

  • MIRank
    Multiple Instance Ranking (MIRank) is a novel machine learning model that enables ranking to be performed in a multiple instance learning setting. The motivation for MIRank stems from the hydrogen abstraction problem in computational chemistry, that of predicting the group of hydrogen atoms from which a hydrogen is abstracted (removed) during metabolism. The model predicts the preferred hydrogen group within a molecule by ranking the groups, with the ambiguity of not knowing which hydrogen atom within the preferred group is actually abstracted. The paper formulates MIRank in its general context and proposes an algorithm for solving MIRank problems using successive linear programming. The method outperforms multiple instance classification models on several real and synthetic datasets. This website freely distributes the datasets and source codes used in this first study.

  • DMTL
    Data Mining Template Library (DMTL) supports the mining of increasingly complex and informative patterns types, in structured and unstructured datasets, including Itemsets, Sequences, Trees and Graphs (See Fig. 1). DMTL is a C++ library consisting of highly efficient algorithms and data structures, utilizing a generic data mining approach, where all aspects of mining are controlled via a set of properties. Another novel feature of DMTL is that it provides transparent persistency and indexing support for effective computation over massive datasets. We have successfully mined datasets in the 60-100GB range using a desktop PC! DMTL has been publicly released as open-source software on the world-wide SourceForge site, and it has already been downloaded by over 2000 researchers from all over the world.

Small Molecules
    RECON is an algorithm for the rapid reconstruction of molecular charge densities and charge density-based electronic properties of molecules, using atomic charge density fragments pre-computed from ab initio wavefunctions. These are known as Transferable Atom Equivalents, or "TAEs". The method is based on Bader's quantum theory of Atoms in Molecules.

  • PEST
    PEST Shape/Property hybrid descriptor technology, developed in DDASSL, allows better representation of the kinds of intermolecular interactions that are dependent on molecular shape. The inclusion of PEST descriptors has been found to significantly improve QSPR models where intermolecular interactions play an important role in the chemical effects being modeled. PEST descriptors are generated using TAE molecular surface representations to define property-encoded boundaries similar to the Zauhar "Shape Signature" ray-tracing approach to shape/property convolution.

    Web-based descriptor generator that provides a TAE-based representation of the electronic properties of the major or minor grooves of DNA. DIXEL represents electron density features such as electrostatic potential (EP) and local average ionization potential (PIP) on the accessible surfaces of the major or minor groove on a grid of rectangles -- the "Dixel" coordinate system. These features can be displayed graphically and/or employed as input to data mining algorithms.

    The objective of the Mfold web server for nucleic acid folding and hybridization prediction is to provide easy access to RNA and DNA folding and hybridization software to the scientific community at large. By making use of universally available web GUIs (Graphical User Interfaces), the server circumvents the problem of portability of this software. Detailed output, in the form of structure plots with or without reliability information, single strand frequency plots and 'energy dot plots', are available for the folding of single sequences.

  • PESDserv
    A server for high-throughput comparison of surfaces of protein-ligand binding sites.

    A version of the RECON/TAE program optimized for use with proteins, allowing users to rapidly produce a set of descriptors that can characterize protein behavior. Protein Recon is an algorithm for the rapid reconstruction of molecular charge density-based electronic properties of proteins, using peptide fragments precomputed from ab initio wavefunctions. These properties can be displayed graphically and/or employed as input to data mining algorithms.

  • WebPDB
    WebPDB is a web-based workflow system that is flexible and capable of semi-automatic protein structure cleaning activities. The protein data may be provided by the user, but can also be directly downloaded from the PDB archive as part of the automated workflow. In its next generation, WebPDB will produce pH-sensitive protein surface descriptors that take into account appropriate protonation states and fractional protonation/deprotonation of basic and acidic side chain groups. WebPDB prepares proteins for use in virtual screening and predictive modeling. It removes gaps (through self-homology with FASTA information), heteroatoms and ligands (for re-use). Coupled with other modeling tools, WebPDB can be useful in probe development and the interpretation of secondary screening results through docking and scoring computations.

    The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction.

    Proteins of the same class often share a secondary structure packing arrangement but differ in how the secondary structure units are ordered in the sequence. We find that proteins that share a common core also share local sequence-structure similarities, and these can be exploited to align structures with different topologies. In this study, segments from a library of local sequence-structure alignments were assembled hierarchically, enforcing the compactness and conserved inter-residue contacts but not sequential ordering. Previous structure-based alignment methods often ignore sequence similarity, local structural equivalence and compactness. SCALI (Structural Core ALIgnment), can efficiently find conserved packing arrangements, even if they are nonsequentially ordered in space. SCALI alignments conserve remote sequence similarity and contain fewer alignment errors. Clustering of our pairwise non-sequential alignments shows that recurrent packing arrangements exist in topologically different structures.

  • MASKER contacts & MASKER voids
    A fast algorithm for computing the solvent accessible molecular surface area (SAS) using Boolean masks (Le Grand, S. M. & Merz, K. M. J. (1993). J. Comp. Chem. 14, 349-52.) has been modified to estimate the solvent excluded molecular surface area (SES), including contact, toroidal and reentrant surface components. Numerical estimates of arc lengths of intersecting atomic SAS are using to estimate the toroidal surface, and intersections between those arcs are used to estimate the reentrant surface area. The new method is compared to an exact analytical method. Boolean molecular surface areas are continuous and pairwise differentiable, and should be useful for molecular dynamics simulations, especially as the basis for an implicit solvent model. MASKER contacts finds the surface area burial by residue in a protein while MASKER voids finds the locations of empty cavities in proteins (or any molecule).

Rensselaer Polytechnic Institute RECCR Home Page || Member Area || Wiki

Copyright ©2005 Rensselaer Polytechnic Institute
All rights reserved worldwide.