Materials Informatics ||
Modeling and Mining ||
Small Molecules ||
DNA & RNA ||
Modeling and Mining
Polymerdesign - NEW!!
Polymerdesign is an online server to predict dielectric constants, dielectric loss tangent, band gap, glass transition temperature and suitable solvents for pure polymers. It utilizes a new set of descriptors designed for polymers called infinite chain descriptors. This project is supported by Office of Naval Research (ONR).
Polymerizer - NEW!!
Polymerizer is an online server to predict surface energy for nanocomposites, including the work of adhesion, wetting angle and the work of spreading. The current website implementation uses selectable literature particle surface energies. This server also converts the property-encoded surfaces into images for display. This project is supported by Office of Naval Research (ONR).
YAMS - NEW!!
Yet Another Modeling System (YAMS) tool is designed to aid in the discovery and creation of the materials of tomorrow. Informatics best practives are demonstrated throughout the design of the tool and establish the basis for robust informatics modeling. The YAMS tool performs a series of balanced, rational decisions in dataset evaluation, parameter/feature selection and choice of modeling method. The focus of the tool is to convey rich information about model quality and predictions to the end user that helps "close the loop" between modeling and experimental efforts.
SOLE - Beta version
The SVR-based Online Learning Equipment (SOLE) is a web-based machine learning system. The SOLE system provides three SVR-based machine learning algorithms that can be used in the QSAR/QSPR studies. For the comparison purpose, this machine learning system also includes Partial Least Squares (PLS) algorithm. Several feature selection methods are provided for dimension reduction of the input descriptor space. Two cross-validation methods -- LOO and K-fold -- are provided for model hyperparameter selection.
RS-WebPredictor - NEW!!
RS-Predictor is a tool for generating pathway-independent, isozyme-specific P450 regioselectivity QSARs from any set of known substrates and metabolites. Models have been trained using a combination of topological descriptors and SMARTCyp reactivities applied to substrate sets of CYPs 1A2(271), 2A6(105), 2B6(151), 2C19(218), 2C8(142), 2C9(226), 2D6(270), 2E1(145) and 3A4(475), as well a MERGED(680) set representing every curated reaction, regardless of isozyme. A large proportion of the metabolites in each set were identified within the top two predicted rank-positions: 1A2(82.3%), 2A6(85.7%), 2B6(76.8%), 2C19(86.2%), 2C8(83.8%), 2C9(84.1%), 2D6(83.7%), 2E1(80.7%), 3A4(82.1%), MERGED(86.0%). These models may be quickly applied (~30s per substrate per model) to any set of user supplied substrates through RS-WebPredictor.
This web based application enables automation of Rank Order Entropy utilizing Kendall Tau.
This ROE process will examine an input dataset and give a recommendation for the user in terms of the
stability of models created by the dataset as well as an evaluation of the ability of the dataset to create a predictive model.
The procedure takes the dataset, divides it to training/testing sets, tests the training/testing sets for ability to model, and then models the data,
and finally examines the data in terms of Rank Order (as Kendall Tau) and Rank Order Entropy (as Kendall Tau over several truncations of data).
The RECCR Online Modeling System (ROMS) is a general web-based machine
learning system. By using the available learning methods, users can
generate a model and visualize its performance by uploading their data
set through the web client. Three learning methods provided are Partial
Least Squares (PLS), Kernel-PLS and Support Vector Machine (SVM). In
addition to basic modeling functionality, cross validation methods such
as Leave-One-Out (LOO) and Monte Carlo Cross Validation (MCCV) are
provided for model parameter selection.
Multiple Instance Ranking (MIRank) is a novel machine learning model that enables ranking to be performed in a multiple instance learning setting. The motivation for MIRank stems from the hydrogen abstraction problem in computational chemistry, that of predicting the group of hydrogen atoms from which a hydrogen is abstracted (removed) during metabolism. The model predicts the preferred hydrogen group within a molecule by ranking the groups, with the ambiguity of not knowing which hydrogen atom within the preferred group is actually abstracted. The paper formulates MIRank in its general context and proposes an algorithm for solving MIRank problems using successive linear programming. The method outperforms multiple instance classification models on several real and synthetic datasets. This website freely distributes the datasets and source codes used in this first study.
Data Mining Template Library (DMTL) supports the mining of
increasingly complex and informative patterns types, in structured and
unstructured datasets, including Itemsets, Sequences, Trees and Graphs
(See Fig. 1). DMTL is a C++ library consisting of highly efficient
algorithms and data structures, utilizing a generic data mining
approach, where all aspects of mining are controlled via a set of
properties. Another novel feature of DMTL is that it provides
transparent persistency and indexing support for effective computation
over massive datasets. We have successfully mined datasets in the
60-100GB range using a desktop PC!
DMTL has been publicly released as open-source software on the
world-wide SourceForge site, and
it has already been downloaded by over 2000 researchers from all over
DNA & RNA
RECON is an algorithm for the rapid reconstruction of molecular
charge densities and charge density-based electronic properties of
molecules, using atomic charge density fragments pre-computed from ab
initio wavefunctions. These are known as Transferable Atom Equivalents,
or "TAEs". The method is based on Bader's quantum theory of Atoms in
PEST Shape/Property hybrid descriptor technology, developed in DDASSL,
allows better representation of the kinds of intermolecular interactions
that are dependent on molecular shape. The inclusion of PEST descriptors
has been found to significantly improve QSPR models where intermolecular
interactions play an important role in the chemical effects being modeled.
PEST descriptors are generated using TAE molecular surface representations
to define property-encoded boundaries similar to the Zauhar "Shape
Signature" ray-tracing approach to shape/property convolution.
Web-based descriptor generator that provides a TAE-based
representation of the electronic properties of the major or minor grooves
of DNA. DIXEL represents electron density features such as electrostatic
potential (EP) and local average ionization potential (PIP) on the
accessible surfaces of the major or minor groove on a grid of rectangles
-- the "Dixel" coordinate system. These features can be displayed
graphically and/or employed as input to data mining algorithms.
The objective of the Mfold web server for nucleic acid folding and hybridization prediction is to provide easy access to RNA and DNA folding and hybridization software to the scientific community at large. By making use of universally available web GUIs (Graphical User Interfaces), the server circumvents the problem of portability of this software.
Detailed output, in the form of structure plots with or without reliability information, single strand frequency plots and 'energy dot plots', are available for the folding of single sequences.
A server for high-throughput comparison of surfaces of protein-ligand binding sites.
A version of the RECON/TAE program optimized for use with
proteins, allowing users to rapidly produce a set of descriptors that can
characterize protein behavior. Protein Recon is an algorithm for the
rapid reconstruction of molecular charge density-based electronic
properties of proteins, using peptide fragments precomputed from ab
initio wavefunctions. These properties can be displayed graphically
and/or employed as input to data mining algorithms.
WebPDB is a web-based workflow system that is flexible and
capable of semi-automatic protein structure cleaning activities. The
protein data may be provided by the user, but can also be directly
downloaded from the PDB archive as part of the automated workflow. In its
next generation, WebPDB will produce pH-sensitive protein surface
descriptors that take into account appropriate protonation states and
fractional protonation/deprotonation of basic and acidic side chain groups.
WebPDB prepares proteins for use in virtual screening and predictive modeling.
It removes gaps (through self-homology with FASTA information), heteroatoms
and ligands (for re-use).
Coupled with other modeling tools, WebPDB can be useful in probe
development and the interpretation of secondary screening results through
docking and scoring computations.
The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction.
Proteins of the same class often share a secondary structure
packing arrangement but differ in how the secondary structure
units are ordered in the sequence. We find that proteins that share
a common core also share local sequence-structure similarities, and
these can be exploited to align structures with different topologies.
In this study, segments from a library of local sequence-structure
alignments were assembled hierarchically, enforcing the compactness
and conserved inter-residue contacts but not sequential ordering.
Previous structure-based alignment methods often ignore sequence
similarity, local structural equivalence and compactness.
SCALI (Structural Core ALIgnment), can
efficiently find conserved packing arrangements, even if they are nonsequentially
ordered in space. SCALI alignments conserve remote
sequence similarity and contain fewer alignment errors. Clustering of
our pairwise non-sequential alignments shows that recurrent packing
arrangements exist in topologically different structures.
MASKER contacts & MASKER voids
A fast algorithm for computing the solvent accessible molecular surface area (SAS) using Boolean masks (Le Grand, S. M. & Merz, K. M. J. (1993). J. Comp. Chem. 14, 349-52.) has been modified to estimate the solvent excluded molecular surface area (SES), including contact, toroidal and reentrant surface
components. Numerical estimates of arc lengths of intersecting atomic SAS are using to estimate the toroidal surface, and
intersections between those arcs are used to estimate the reentrant surface area. The new method is compared to an exact
analytical method. Boolean molecular surface areas are continuous and pairwise differentiable, and should be useful for
molecular dynamics simulations, especially as the basis for an implicit solvent model.
MASKER contacts finds the surface area burial by residue in a protein while
MASKER voids finds the locations of empty cavities in proteins (or any molecule).