Taghi M. Khoshgoftaar, PI; Student: Victor Herrera
From bioinformatics to social computing to document mining, whole new research areas exist today which were not possible even 20 years ago. This research demands large-scale systems to both manage and process huge quantities of data. Many traditional approaches fail when dealing with multi-gigabyte datasets, preventing researchers and practitioners from fully benefiting from the data.
The High Performance Cluster Computing (HPCC) architecture, which was developed in conjunction with the ECL programming language, is LexisNexis’s answer to this challenge. This system has two essential functions for working with Big Data: HPCC is a cluster backend which stores and manages large quantities of data, making it accessible to the user in a timely manner, and ECL is the language which allows the user to perform queries on the data in question. One area where the HPCC platform is not yet fully mature, however, is the domain of machine learning (ML). Although HPCC includes some basic ML modules, many of the most commonly-used approaches in the field have yet to be implemented. The project objective is to extend ECL/HPCC to perform classification and regression using a wider range of ML algorithms. Further, we will implement our own algorithms in ECL, to make them widely available for a larger user base. With these additions, the HPCC/ECL platform will be fully prepared to take on the challenges posed by Big Data and permit a new scale of research.