Analysis Options and Usage
A Brief Tutorial

M. Krein, 2009



File Input:


Analysis requires two files, a Training File and a Testing File.  Both files are comma delimited, without quotes, and should be identically formatted. 
The format of the files is to be:

Column Number

Function Description

First Column

Label Alphanumeric Characters to describe the cases
Columns 2...(n-1) Independent Variables Numerics only; the descriptors that will model the response
Last column Dependent Variable Numerics only; the response.



R2 Cutoff:

Descriptors are often not orthogonal to each other; often times, they are highly correlated to each other. 
Removal of one in a correlated pair of descriptors can improve model interpretation and quality.
In the case that a pair of descriptor columns is identiftied as being intercorrelated beyond the threshold specified,
the second of the pair will be removed automatically.





Scaling:

To be modeled propertly, descriptors usually must be scaled. 
If your data is already scaled appropriately, select "No Scaling".
"Sigma Scaled" divides each descriptor by its standard deviation.
The default option, "Mean Centered / Sigma Scaled" is also known as Standardization.
"Median Centered / Mean Absolute Deviation Scaled" is appropriate if descriptors are highly non-gaussian.





Thresholding:

Thresholding / sigma removal of descriptors can help to control the effects of noisy descriptors.

After the data is standardized, those descriptors that contain values more than the user-specified amount will be flagged as "noisy' descriptors.
(An example of this might be a sparsely populated binary descriptor.)




Depending on preference, one can choose to leave features as is (Do nothing),
Cap Values,
or clamp values of the offending elements to the threshold specified,
or Remove Descriptors in the case of many descriptors and few cases, where feature selection based on "noisiness" is desired.





K-means clustering
:

After analysis is performed, a 3D (PCA) visualization of the training data will be presented.
K-means clustering of the training data will also be performed, and the number of clusters to make can be selected here.



This can be used to identify the population density of cases in descriptor space -- if there is a natural clustering of the data that
suggests that a classification model would perform well.