Mining Complex Patterns
Co-Investigator: Mohammed Zaki
Associate Professor, Department of Computer Science, Rensselaer Polytechnic Institute
The importance of understanding and making effective use of large-scale data is becoming essential in cheminformatics applications, as well as in other fields. Key research questions are how to mine patterns and knowledge from complex datasets, how to generate actionable hypotheses and how to provide confidence guarantees on the mined results. Further, there are critical issues related to the management and retrieval of massive datasets. Data mining over large (perhaps multiple) datasets can take a prohibitive amount of time due to the computational complexity and disk I/O cost of the algorithms. We are currently developing an extensible high-performance generic pattern mining toolkit (GPMT). Pattern mining is a very powerful paradigm which encompasses an entire class of data mining tasks, namely those dealing with extracting informative and useful patterns in massive datasets, representing complex interactions between diverse entities from a variety of sources. These interactions may also span multiple-scales, as well as spatial and temporal dimensions. Our goal is to provide a systematic solution to this whole class of common pattern mining tasks in massive, diverse, and complex datasets, rather than to focus on a specific problem. We are developing a prototype large-scale GPMT toolkit (Zaki et al, 2005), which is: i) Extensible and modular for ease of use and customizable to needs of analysts, ii) Scalable and high-performance for rapid response on massive datasets. The extensible GPMT system will be able to seamlessly access file systems, databases, or data archives.