RECCR Rensselaer Exploratory Center for Cheminformatics Research
Rensselaer Exploratory Center for Cheminformatics Research (RECCR)

News Members Projects Publications Software Data MLI ECCRS
Rank Order Entropy Analaysis Tutorial

Data file requirements

The data file should either be comma or tab separated. This means that each line must be a list separated by either a comma or a tab, and whichever spaceholder must be consistent throughout the file. Both should not be present in the same file, and other white spaces should not be present in the data file.

The data file should contain multiple lines of information, where each line corresponds to a molecule, and each number within the line should be a descriptor, or a numerical representation of chemical, biological, or structural data, ending with a numerical activity value.

What the options mean:

Data contains header Does your top line contain labels for each descriptor or activity? What's being asked here is whether the top line of your data file contains something similar to this: Names,desc1,desc2,desc3,...descN,activity where activity can be any variety of numeric values, including IC50, pIC50, half live, TD50, etc.

Data contains names Does your left column contain labels for each entry? What's being asked here is whether the left column is not a specific molecular descriptor but instead a unique identifier for the molecule whose data is represented in the descriptors.

Do not optimize using cousin threshhold (shortens run time) Correlated descriptors (cousin descriptors) can be removed if this remains unchecked. Otherwise, all descriptors are used. This option turns off a time-consuming parameter optimization step. Cousin descriptors are descriptors that are highly correlated. A threshhold can be specified for which descriptors to remove if there are highly correlated descriptors. For example, once the correlation between descriptors is calculated, a threshhold of 85% can be set, meaning that some descriptors with 85% correlation or higher will be discarded. The theory behind this is that redundant descriptors just add noise.

Verbose output (default is non-verbose with status bar) Verbose output gives specific details of progress. This means that more information is present in the final status page than if it is unchecked.

Number of random dataset splits Normally, the data is split after being sorted by activity value (the right-hand most column of a data file). Once the sorting is complete, the "even"-ranked molecules are set as the training set and the "odd"-ranked molecules are set as the test set. When the number of random dataset splits is nonzero, a new training/testing split is created using a random number generator. These splits can provide a broader examination of the behavior of the data in terms of potential towards lowest performance and highest performance.

Do not optimize parameters; instead, use user-input parameters. Reduces runtime if you know what are optimal parameters for your model. Recommended parameters are pre-selected. This option is to enable both proficient modelers to use their own specific parameters that they have found to be optimal, or for beginning modelers to use default parameters, which are chosen based on frequent use within the research group.

Number of latent variables for PLS modeling Latent variables represent a group of descriptors within a model. The greater the number of latent variables within a model, the more flexible a model is, and the more difficult it is to explain from the original descriptors themselves. PLS modeling is Partial Least-Squares Regression, a linear modeling method.

Cousin threshhold for PLS modeling Cousin threshhold is the cutoff of maximum percent correlation that is allowed within descriptors. For each pair of "cousin" descriptors, the descriptor with a higher correlation to the activity is kept and the other is discarded within this method.

Number of latent variables for KPLS modeling KPLS modeling, or Kernel Partial Least-Squares Regression, is a non-linear modeling method, for comparison to PLS.

Sigma value for KPLS modeling The sigma value within KPLS modeling is the number of standard deviations from the average that descriptor values are allowed to fall within. If a descriptor has values outside the sigma limit, the descriptor is discarded.

Cousin threshhold for KPLS modeling This is the same as in PLS modeling.

Recommended options for the new modeler:

The options that are recommended for the new modeler are to check the followintg boxes: Do not optimize using cousin threshhold, verbose output, Number of random dataset splits -> 5, Do not optimize parameters

The Output, Explained.

If you immediately click through "see your results" you will see minimal output. Don't fret; the page self-refreshes every ten seconds to give you newly calculated information.

The Numbers

The first output will be restating the parameters input for modeling.

The first thing the code does is to create a model using all of the data as both the training and the test set to see if there is any signal, using r2 values as a deciding factor. Minimum y-scrambling data is also presented. Y-scrambling involves taking the test set and scrambling the order of activities without scrambling the order of the molecular data, then creating a model and using r2 to measure performance. Ten of these y-scrambled models were taken, and the minimum difference between the highest r2 of the y-scrambled model and the r2 value of the unscrambled model is reported for both PLS and KPLS models. The original hope was that the greater the minimum difference, the better the performance would be.

The program then assembles parameters for the even/odd split modeling. The r2 values given in the next image are the training r2 values, or how well the line created by PLS or KPLS fits the data. The minimal y-scrambling difference is reported here but not used as a classifying metric because the behavior did not correlate with any metric utilized.

This is the test set r2 ("unknown") behavior of the even/odd split modeling. A "good" r2 value for this would be 0.6 or above.

The final analysis of the dataset includes the maximum, minimum, and average Kendall Tau values. Remember that there are six datapoints involved in the creation of these values. The maximum variation is the difference between the minimum and maximum values of Kendall Tau. Shannon entropy is a consistency metric; the more consistent the behavior (the closer the values of Kendall Tau are to each other), the smaller the Shannon entropy. Zero is the optimal value here.

This next piece of output explains the meaning of the grades/recommendations given by the metric.

This link within the output is where to get the images (and the updated statuses)

The link takes you to a new page with more links, and downloading options.

The Images

Two images are given to the user at the end of the process, if it completes.

Here's the one labeled "TRAIN VS R2". This is the training set r2 value plotted against the number of molecules remaining in the test set. As the number of molecules decreases (from right to left), the r2 values increase. This is a common behavior seen across many datasets. The red line is PLS data and the blue line is KPLS data.

Here's the figure labeled "TEST VS R2". This is the test set r2 value plotted against the number of molecules remaining in the test set. Across most datasets, when this graph is observed from right to left, there is a constant portion of behavior in r2 (and Kendall Tau), and then the values drop off suddenly. The red line is PLS data and the blue is KPLS data.

Rensselaer Polytechnic Institute RECCR Home Page || Member Area || Wiki

Copyright ©2005 Rensselaer Polytechnic Institute
All rights reserved worldwide.