Analysis Options and Usage
A Brief Tutorial
M. Krein, 2009
Analysis requires two files, a Training File and a Testing File. Both files are
comma delimited, without quotes, and should be identically
The format of the files is to be:
||Alphanumeric Characters to
describe the cases
|| Numerics only; the descriptors
that will model the response
|| Dependent Variable
|| Numerics only; the response.
Descriptors are often not orthogonal to each other; often times, they
are highly correlated to each other.
Removal of one in a correlated pair of descriptors can improve model
interpretation and quality.
In the case that a pair of descriptor columns is identiftied as being
intercorrelated beyond the threshold specified,
the second of the pair will be removed automatically.
To be modeled propertly, descriptors usually must be
If your data is already scaled appropriately, select "No Scaling".
"Sigma Scaled" divides each
descriptor by its standard deviation.
The default option, "Mean Centered /
Sigma Scaled" is also known as Standardization.
"Median Centered / Mean Absolute
Deviation Scaled" is appropriate if descriptors are highly
Thresholding / sigma removal of descriptors can help to control
the effects of noisy descriptors.
After the data is standardized, those descriptors that contain values
more than the user-specified amount will be flagged as "noisy'
(An example of this might be a sparsely populated binary descriptor.)
Depending on preference, one can choose to leave features as is (Do nothing),
Cap Values, or clamp values of the offending elements to the
or Remove Descriptors in the
case of many descriptors and few cases, where feature selection based
on "noisiness" is desired.
After analysis is performed, a 3D (PCA) visualization of the training
data will be presented.
K-means clustering of the training data will also be performed, and the
number of clusters to make can be selected here.
This can be used to identify the population density of cases in
descriptor space -- if there is a natural clustering of the data that
suggests that a classification model would perform well.