Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset

Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset

ABSTRACT:

Clustering techniques have been widely adopted in many real world data analysis applications, such as customer behavior analysis, targeted marketing, digital forensics, etc. With the explosion of data in today’s big data era, a major trend to handle a clustering over large-scale datasets is outsourcing it to public cloud platforms. This is because cloud computing offers not only reliable services with performance guarantees, but also savings on in-house IT infrastructures. However, as datasets used for clustering may contain sensitive information, e.g., patient health information, commercial data, and behavioral data, etc, directly outsourcing them to public cloud servers inevitably raise privacy concerns. In this paper, we propose a practical privacy-preserving Kmeans clustering scheme that can be efficiently outsourced to cloud servers. Our scheme allows cloud servers to perform clustering directly over encrypted datasets, while achieving comparable computational complexity and accuracy compared with clusterings over unencrypted ones. We also investigate secure integration of MapReduce into our scheme, which makes our scheme extremely suitable for cloud computing environment. Thorough security analysis and numerical analysis carry out the performance of our scheme in terms of security and efficiency. Experimental evaluation over a 5 million objects dataset further validates the practical performance of our scheme.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Jiawei Yuan, Member, IEEE, Yifan Tian, Student Member, IEEE, “Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset”, IEEE Transactions on Cloud Computing, 2019.

 

PISCES: Optimizing Multi-job Application Execution in MapReduce

PISCES: Optimizing Multi-job Application Execution in MapReduce

ABSTRACT:

Nowadays, many MapReduce applications consist of groups of jobs with dependencies among each other, such as iterative machine learning applications and large database queries. Unfortunately, the MapReduce framework is not optimized for these multi-job applications. It does not explore the execution overlapping opportunities among jobs and can only schedule jobs independently. These issues significantly inflate the application execution time. This paper presents PISCES (Pipeline Improvement Support with Critical chain Estimation Scheduling), a critical chain optimization (a critical chain refers to a series of jobs which will make the application run longer if any one of them is delayed), to provide better support for multi-job applications. PISCES extends the existing MapReduce framework to allow scheduling for multiple jobs with dependencies by dynamically building up a job dependency DAG for current running jobs according to their input and output directories. Then using the dependency DAG, it provides an innovative mechanism to facilitate the data pipelining between the output phase (map phase in the Map-Only job or reduce phase in the Map-Reduce job) of an upstream job and the map phase of a downstream job. This offers a new execution overlapping between dependent jobs in MapReduce which effectively reduces the application runtime. Moreover, PISCES proposes a novel critical chain job scheduling model based on the accurate critical chain estimation. Experiments show that PISCES can increase the degree of system parallelism by up to 68% and improve the execution speed of applications by up to 52%.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Qi Chen, Jinyu Yao, Benchao Li, and Zhen Xiao, Senior Member, IEEE, “PISCES: Optimizing Multi-job Application Execution in MapReduce”, IEEE Transactions on Cloud Computing, 2019.

 

On Scalable and Robust Truth Discovery in Big Data Social Media Sensing Applications

On Scalable and Robust Truth Discovery in Big Data Social Media Sensing Applications

ABSTRACT:

Identifying trustworthy information in the presence of noisy data contributed by numerous unvetted sources from online social media (e.g., Twitter, Facebook, and Instagram) has been a crucial task in the era of big data. This task, referred to as truth discovery, targets at identifying the reliability of the sources and the truthfulness of claims they make without knowing either a priori. In this work, we identified three important challenges that have not been well addressed in the current truth discovery literature. The first one is “misinformation spread” where a significant number of sources are contributing to false claims, making the identification of truthful claims difficult. For example, on Twitter, rumors, scams, and influence bots are common examples of sources colluding, either intentionally or unintentionally, to spread misinformation and obscure the truth. The second challenge is “data sparsity” or the “long-tail phenomenon” where a majority of sources only contribute a small number of claims, providing insufficient evidence to determine those sources’ trustworthiness. For example, in the Twitter datasets that we collected during real-world events, more than 90% of sources only contributed to a single claim. Third, many current solutions are not scalable to large-scale social sensing events because of the centralized nature of their truth discovery algorithms. In this paper, we develop a Scalable and Robust Truth Discovery (SRTD) scheme to address the above three challenges. In particular, the SRTD scheme jointly quantifies both the reliability of sources and the credibility of claims using a principled approach. We further develop a distributed framework to implement the proposed truth discovery scheme using Work Queue in an HTCondor system. The evaluation results on three real-world datasets show that the SRTD scheme significantly outperforms the state-of-the-art truth discovery methods in terms of both effectiveness and efficiency.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Daniel (Yue) Zhang, Dong Wang, Nathan Vance, Yang Zhang, and Steven Mike, “On Scalable and Robust Truth Discovery in Big Data Social Media Sensing Applications”, IEEE Transactions on Big Data, 2019.

 

K-nearest Neighbors Search by Random Projection Forests

K-nearest Neighbors Search by Random Projection Forests

ABSTRACT:

K-nearest neighbors (kNN) search is an important problem in data mining and knowledge discovery. Inspired by the huge success of tree-based methodology and ensemble methods over the last decades, we propose a new method for kNN search, random projection forests (rpForests). rpForests finds nearest neighbors by combining multiple kNN-sensitive trees with each constructed recursively through a series of random projections. As demonstrated by experiments on a wide collection of real datasets, our method achieves a remarkable accuracy in terms of fast decaying missing rate of kNNs and that of discrepancy in the k-th nearest neighbor distances. rpForests has a very low computational complexity as a tree-based methodology. The ensemble nature of rpForests makes it easily parallelized to run on clustered or multicore computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights on rpForests by showing the exponential decay of neighboring points being separated by ensemble random projection trees when the ensemble size increases. Our theory can also be used to refine the choice of random projections in the growth of rpForests; experiments show that the effect is remarkable.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Donghui Yan, Yingjie Wang, Jin Wang, Honggang Wang, and Zhenpeng Li, “K-nearest Neighbors Search by Random Projection Forests”, IEEE Transactions on Big Data, 2019.

Hierarchical Density-Based Clustering using MapReduce

Hierarchical Density-Based Clustering using MapReduce

ABSTRACT:

Hierarchical density-based clustering is a powerful tool for exploratory data analysis, which can play an important role in the understanding and organization of datasets. However, its applicability to large datasets is limited because the computational complexity of hierarchical clustering methods has a quadratic lower bound in the number of objects to be clustered. MapReduce is a popular programming model to speed up data mining and machine learning algorithms operating on large, possibly distributed datasets. In the literature, there have been attempts to parallelize algorithms such as Single-Linkage, which in principle can also be extended to the broader scope of hierarchical density-based clustering, but hierarchical clustering algorithms are inherently difficult to parallelize with MapReduce. In this paper, we discuss why adapting previous approaches to parallelize Single-Linkage clustering using MapReduce leads to very inefficient solutions when one wants to compute density-based clustering hierarchies. Preliminarily, we discuss one such solution, which is based on an exact, yet very computationally demanding, random blocks parallelization scheme. To be able to efficiently apply hierarchical density-based clustering to large datasets using MapReduce, we then propose a different parallelization scheme that computes an approximate clustering hierarchy based on a much faster, recursive sampling approach. This approach is based on HDBSCAN*, the state-of-the-art hierarchical density-based clustering algorithm, combined with a data summarization technique called data bubbles. The proposed method is evaluated in terms of both runtime and quality of the approximation on a number of datasets, showing its effectiveness and scalability.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Joelson A. dos Santos, Talat Iqbal Syedy, Murilo C. Naldiz, Ricardo J. G. B. Campellox, J¨org Sander, “Hierarchical Density-Based Clustering using MapReduce”, IEEE Transactions on Big Data, 2019.

Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique

Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique

ABSTRACT:

The increased use of cyber-enabled systems and Internet-of-Things (IoT) led to a massive amount of data with different structures. Most big data solutions are built on top of the Hadoop eco-system or use its distributed file system (HDFS). However, studies have shown inefficiency in such systems when dealing with today’s data. Some research overcame these problems for specific types of graph data, but today’s data are more than one type of data. Such efficiency issues lead to large scale problems, including larger space required in data centers, and waste in resources (like power consumption), that in turn lead to environmental problems (such as more carbon emission), as per scholars. We propose a data-aware module for the Hadoop eco-system. We also propose a distributed encoding technique for Genetic Algorithms. Our framework allows Hadoop to manage the distribution of data and its placement based on cluster analysis of the data itself. We are able to handle a broad range of data types as well as optimize query time and resource usage. We performed our experiments on multiple datasets generated via LUBM.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 gb

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Mustafa Hajeer, Member, IEEE, and Dipankar Dasgupta, Fellow, IEEE, “Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique”, IEEE Transactions on Big Data, 2019.

Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability

Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability

ABSTRACT:

As MapReduce is becoming ubiquitous in large-scale data analysis, many recent studies have shown that the performance of MapReduce could be improved by different job scheduling approaches, e.g., Fair Scheduler and Capacity Scheduler. However, most exiting MapReduce job schedulers focus on the scenario that MapReduce cluster is stable and pay little attention to the MapReduce cluster with dynamic resource availability. In fact, MapReduce cluster resources may fluctuate as there is a growing number of Hadoop clusters deployed on hybrid systems, e.g., infrastructure powered by mix of traditional and renewable energy, and cloud platforms hosting heterogeneous workloads. Thus, there is a growing need for providing predictable services to users who have strict requirements on job completion times in such dynamic environments. In this paper, we propose, RDS, a Resource and Deadline-aware Hadoop job Scheduler that takes future resource availability into consideration when minimizing job deadline misses. We formulate the job scheduling problem as an online optimization problem and solve it using an efficient receding horizon control algorithm. To aid the control, we design a self-learning model to estimate job completion times. We further extend the design of RDS scheduler to support flexible performance goals in various dynamic clusters. In particular, we use flexible deadline time bounds instead of the single fixed job completion deadline. We have implemented RDS in the open-source Hadoop implementation and performed evaluations with various benchmark workloads. Experimental results show that RDS substantially reduces the penalty of deadline misses by at least 36% and 10% compared with Fair Scheduler and Earliest Deadline First (EDF) scheduler, respectively. In a Hadoop cluster running partially on renewable energy, the experimental result shows the green power based resource prediction approach can further reduce the penalty of deadline misses by 16% compared to Auto-Regressive Integrated Moving Average (ARIMA) prediction approach.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Dazhao Cheng, Xiaobo Zhou, Yinggen Xu, Liu Liu, and Changjun Jiang, “Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability”, IEEE Transactions on Parallel and Distributed Systems, 2019.

 

Cross-cloud MapReduce for Big Data

Cross-cloud MapReduce for Big Data

ABSTRACT:

MapReduce plays a critical role as a leading framework for big data analytics. In this paper, we consider a geo-distributed cloud architecture that provides MapReduce services based on the big data collected from end users all over the world. Existing work handles MapReduce jobs by a traditional computation-centric approach that all input data distributed in multiple clouds are aggregated to a virtual cluster that resides in a single cloud. Its poor efficiency and high cost for big data support motivate us to propose a novel data-centric architecture with three key techniques, namely, cross-cloud virtual cluster, data-centric job placement, and network coding based traffic routing. Our design leads to an optimization framework with the objective of minimizing both computation and transmission cost for running a set of MapReduce jobs in geo-distributed clouds. We further design a parallel algorithm by decomposing the original large-scale problem into several distributively solvable sub-problems that are coordinated by a high-level master problem. Finally, we conduct real-world experiments and extensive simulations to show that our proposal significantly outperforms the existing works.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Peng Li, Member, IEEE, Song Guo, Senior Member, IEEE, Shui Yu, Member, IEEE, and Weihua Zhuang, Fellow, IEEE, “Cross-cloud MapReduce for Big Data”, IEEE Transactions on Cloud Computing, 2019.

A Survey on Geographically Distributed Big-Data Processing using MapReduce

A Survey on Geographically Distributed Big-Data Processing using MapReduce

ABSTRACT:

Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS: 

  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :1 GB

SOFTWARE REQUIREMENTS: 

  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL

REFERENCE:

Shlomi Dolev, Senior Member, IEEE, Patricia Florissi, Ehud Gudes, Member, IEEE Computer Society, Shantanu Sharma, Member, IEEE, and Ido Singer, “A Survey on Geographically Distributed Big-Data Processing using MapReduce”, IEEE Transactions on Big Data, 2019.