From Gppd-wiki

Jump to: navigation, search


Data-intensive applications like petroleum extraction simulations, weather forecasting, natural disaster prediction, and biomedical research have to process an increasing amount of data. Cloud computing has increasingly been used as a platform for running large business and data processing applications. Although clouds have become extremely popular, when it comes to data processing, their use incurs high costs. Conversely, Desktop Grids have been used in a wide range of projects, and are able to take advantage of the large number of resources provided by volunteers, free of charge.

In view of this, data-intensive applications lead to the need to find new solutions to the problem of how this should be carried out with Big Data systems. Merging cloud computing and desktop grids into a hybrid infrastructure can provide a feasible low-cost solution for big data analysis.

MapReduce is a programming framework that abstracts the complexity of parallel applications. The management architecture is based on the master/worker model, while a slave-to slave data exchange requires a P2P model.

The simplified computational model for data handling means the programmer is unaware of the complexity of data distribution and management. There is a considerable degree of complexity because of the large number of data sets, scattered across hundreds or thousands of machines, and the need for lower computing runtime [6]. The MapReduce architecture consists of a master machine that manages other worker machines. Figure 1, adapted from [6], shows the data flow in MapReduce in three distinct phases. Big Data systems are composed several frameworks and techniques for data processing that can be summarized about batch and stream data processing, data mining and machine learning in large-scale.

MapReduce Data Flow
Figure 1: MapReduce Data Flowchart Model adapted from [6].

Projects/ Beginning - End
Simplified Description
MRSG 2009/2010
A simulator to MapReduce Simulator over SimGrid.
MRA ++ 2010/2012
A simulator to MapReduce with Adapted Algorithms for Heterogeneous Environments.
BIGHybrid 2012/2015
A Toolkit for MapReduce Simulation in Hybrid Environments.
SMART 2014/actual
A Hybrid Platform for Big Data.
SMART-Sent 2015/actual
A Hybrid Platform for Big Data Integrated with IoT.

[6] T. White, Hadoop - The Definitive Guide, 4st Edition, Vol. 1, O'Reilly Media, Inc., 2012.

Research Resources

We have a cluster MapReduce named Gradep with Cassandra and Hadoop both installed. The cluster is composed by a variety of machines that provide a realistic heterogeneous environment in order to process Data-intensive applications. You can find materials clicking in the link below, there are a list of commands and tutorials about Cassandra and Hadoop's execution in Gradep.

Link: Gradep


Personal tools