Search This Blog

Wednesday, September 8, 2010

SEEDMiner

SeedMiner – the Scalable Data Mining Framework

Data mining can be defined as an attempt to semi automatically discovering previously unknown useful patterns from large data sets. With the emergence of databases capable of handling terabytes of data, data mining has grown as a separate area which is largely used in Business, Science and Engineering and marketplace surveys to make predictions.

Though many data mining applications are available nowadays, they are either targeted towards a very specific data set or application domain with scalability or generalized to multiple application domains without scalability.

Our Project is basically aiming at implementing a Data Mining Framework which will enable the practitioners to build data mining solutions easily and scale up according to their requirements, while preserving the efficiency and performance of their applications.

Process View of SeedMiner

The process architecture takes into account some non-functional requirements such as performance and availability. It addresses issues of concurrency and distribution, of system’s integrity, of fault-tolerance, and how the main abstractions from the logical view fit within the process architecture on which thread of control is an operation for an object actually executed.

Regarding the process view of the system, I can introduce three main processes of our framework. Those are introduces as layers in the design.

First one is the Data Feeding Layer. This will be an interface which provides a rich set of methods for feeding data from an external source to the internal storage. This layer holds the data feeders provided by the framework, and the capability is given to add new data feeders to the framework. The job of a data feeder would be to read data from an external source tailor it accordingly and feed data to the storage. Different data feeders can be attached to the DFL to support a multitude of data sources.

The data feeding process will supply a great demand for the framework’s performance. The objective of having a data feeding layer in our framework can be described as follows. The raw data inside a database is configured itself as horizontal data. What it really means is every column inside a table in the database is saving its values in a horizontal way. If we need to get a particular data item in a one column, we have to take the whole row and extract the data element we need. This gives a great drawback especially in data mining, which is we have to read entire columns every time we need to extract some data. That will decrease the performance of the framework in high charge. What we suggest here is extracting the data from data database first, and arranges them in a vertical structure. This will hold data exactly as the word mean, in a vertical manner. So if we need to extract a data item we can directly access the data item itself without worrying about garbage values. This will increase the performance of the framework in great demand.

When the data feeding is considered separately in several columns, this process is independent to each column. So we can achieve some concurrency here. Having multiple threads inside the data feeding later will increase the performance of the system.

As I mentioned in an earlier paragraph, data feeding layer will be implemented as which different kinds of data feeders can be attached into the DFL where it supports the ability to connect to multitude of data sources. This will raise the scalability of the framework.

The next very important part of our framework is the algorithm layer which the actual mining happens. In this process the vertically structured data will be exposed into different operations. The bit sliced data (dynamic bit-set values) will be entered to the algorithmic processes and many operations like AND, OR, NOT will be happened appropriately.

This is a process where concurrency applied in a higher level. Each bit slice is operated with the adjacent slices. This process is individual for each bit values in the dataset. So we can add several threads to the process and separate the bit slices to parts and do operations. This will enhance the performance of the framework.

Mainly SeedMiner Framework is targeted on Scalability. The algorithm developers should be able to develop their own algorithms on the framework and scale up based on their requirements. In order to achieve this SeedMiner mainly focuses on some facts. The data storage layer and the algorithm layer are handled as separate modules in the framework. By doing this SeedMiner comes up with a common API for the application layer which can be used by application developers to use it and build their own solutions without worrying about the fundamental data structure. The interface for application development will be created according to the standard specification and because of that developers can easily plug in their algorithms which are built with the standard.