Multiple Instance Ranking (MIRank)

A novel machine learning paradigm that solves new problems

Overview, Software and Datasets

Context

Multiple instance ranking (MIRank) is a novel machine learning model that enables ranking to be performed in a multiple instance learning setting. The motivation for MIRank stems from the hydrogen abstraction problem in computational chemistry, that of predicting the group of hydrogen atoms from which a hydrogen is abstracted (removed) during metabolism. The model predicts the preferred hydrogen group within a molecule by ranking the groups, with the ambiguity of not knowing which hydrogen atom within the preferred group is actually abstracted. The paper formulates MIRank in its general context and proposes an algorithm for solving MIRank problems using successive linear programming. The method outperforms multiple instance classification models on several real and synthetic datasets. This website freely distributes the datasets and source codes used in this first study.

In this figure, the goal is to find the ranking function (orange arrow) so as to predict, for each box (red rectangle), an item (green star) belonging to the preferred bag (green ellipse). Others bags (blue ellipses) and items within (blue parallelograms) also appear.

Paper

The complete MIRank paper

Charles Bergeron, Jed Zaretzki, Curt Breneman, Kristin P. Bennett. Multiple Instance Ranking. Proceedings of the International Conference on Machine Learning. Helsinki, 5-9 July 2008.

can be found here in PDF format. The Multiple Instance Classification (MIC) algorithm, implemented for comparison, is that of Mangasarian & Wild (2008) described here .

Datasets

The CYP3A4 substrate dataset generated using the weighted Botzmann conformation can be found here in text format. The first column is the molecule (box) identification integer. The second column is the site of metabolism (bag) identification integer. The third column is the hydrogen atom (item) identification integer. The next 36 columns are the features (real numbers). The last column is the preference (1 indicates that the item belongs to a preferred bag, 0 indicates otherwise). Each row corresponds to a single item. All columns are space separated.

The census datasets are adapted from those found in the DELVE repository, using three different combinations of features. Their format follow the same structure. The files are found here, here and here.

Software

Prerequisites

This software runs under Matlab . I used version 7.0.0.19920 (R14) on an IBM laptop running Windows XP. The linear programs are solved using MOSEK . I used version 4.0. A free limited version is available, while the full package is available for free under a student license that must be renewed online every few months.

Outline

  1. The top-level function loads a dataset and processes the data.
  2. For each split, training, validation and testing subsets are randomly generated.
  3. For each attempted value of the tradeoff parameter nu, MIC and MIRank models are learned using the training subset. The validation subset determines which nu to retain and the testing subset determines the performance for that value of nu.
  4. Results are averaged over the splits and displayed.

Source codes

Version 1.0, made available here , is the initial and current release.

Example

The top-level function MultipleInstanceRankingDriver.m performs an N-fold cross-validation to choose tradeoff parameter nu and compare the performance of MIC and MIRank algorithms thereof.

The following call

>> dataset = 'boltzwei';
>> proportionValidation = 0.2;
>> proportionTesting = 0.2;
>> nSplits = 32;

>>
MultipleInstanceRankingDriver(dataset, proportionValidation, proportionTesting, nSplits)

loads the minimum energy conformation CYP3A4 substrate dataset, performs 32-fold cross-validation, assigning 60% of the boxes to the training set, 20% of the boxes to validation set and 20% of the boxes to the testing set. The following information is displayed upon termination

Results, allowing for 1 guess per box
MIC   :  mean: 45.9918%, std: 7.6134
MIRank:  mean: 53.125%, std: 6.9394
paired t-test p-value: 2.5468e-007s

Results, allowing for 2 guesses per box
MIC   :  mean: 67.2554%, std: 6.6476
MIRank:  mean: 70.856%, std: 6.4361
paired t-test p-value: 0.0041009
Execution time: 445.4052 minutes

In this example, all information is stored in file boltzwei.mat for later analysis.

Extensions

Follows are possible directions for future development of this software:

Contact

For all enquires, please contact Charles Bergeron at  b e r g e c   at   r p i   dot   e d u   or one of the other authors of the paper. Thanks!

This page was last updated on 23 June 2008.