Big Data and large-scale Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures

One of the significant shift of next-generation computing technologies will certainly be in thedevelopment of Big Data (BD) platforms. This trend results in a wide range of revolutionary and state-of-the-art enhancements within the data science. Apache Hadoop, the BD landmark, evolved as a widely deployed large-scale data operating system. In principle, Hadoop is designed to utilize clusters of commodity machines. Over the years, the model was ported to be compatible with different types of architecture and paradigms. Its new features include federation configuration to provide Hadoop with the maturity to serve different markets. This dissertation focuses on the architectural elements of the BD processing frameworks and focusing on the scalability aspects to improve its performance and security. We propose a hybrid BD execution environment (called EME) that keeps spawning a containerized DataNodes for the benefit of BD applications. A dynamic provisioning scheduler (called OPERA) is presented to take advantage of the underutilized on-premise and cloud resources. The results demonstrate that OPERA has immense potential as it significantly decreases the time of execution up to 74% for CPU bound jobs (as in the PiEst benchmark), and up to 26% for HDFS bound jobs (as in the wordcount benchmark) compared with a native Hadoop cluster. Also, the privacy and security of BD architectures are discussed as a significant concern among practitioners. A BD federation single-sign-on authentication module and a novel access broker framework are introduced. Experimental results demonstrate the efficiency of the proposed access broker with only 1% impact on the Hadoop performance compared with a non-secure one. Finally, a modern secure case of study regarding data streaming of edge nodes to the clouds in vehicular clouds is explained to validate the thesis findings.

keywords: Big Data, Cloud Computing, Internet of Things, Security and Privacy, Access Control, Resource Management, Docker Containers, Parallel and Distributed Computing