Federated Big Data for resource aggregation and load balancing with DIRAC

BigDataDIRAC is a Big Data solution with a Distributed Infrastructure with Remote Agent Control (DIRAC) access point. Users have the opportunity to access multiple Big Data re- sources scattered in different geographical areas, such as access to grid resources. This approach opens the possibility of offering not only grid and cloud to the users, but also Big Data resources from the same DIRAC environment. In this work, we describe a system to allow access to a federation of Big Data resources, including load balancing, using DIRAC. Our results demon- strate the ability of BigDataDIRAC to manage jobs driven by dataset location stored in the Hadoop File System (HDFS) of the Hadoop distributed clusters. DIRAC is used to monitor the execution, collect the necessary statistical data, and upload the results from the remote HDFS to the SandBox Storage machine. Performance results demonstrate that BigDataDIRAC load balancing is able to aggregate resources in an efficient manner.

keywords: Big Data federation, DIRAC, MapReduce, Hadoop, Cloud Computing