Wide Area Analytics for Geographically Distributed Datacenters

Wide Area Analytics for Geographically Distributed Datacenters

Wide Area Analytics for Geographically Distributed Datacenters

Wide Area Analytics for Geographically Distributed Datacenters


Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of geo-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.


  • Traditionally, the Existing data parallel frameworks are designed to process data within the same datacenter, where jobs typically run within the same cluster, and the data to be processed is locally stored in the Hadoop Distributed File System (HDFS).
  • The problem of wide-area data analytics has been widely acknowledged in the recent literature, and a number of solutions have been proposed.
  • Analytics for geo-distributed datacenters in the wide area network have several aspects. Some mechanisms are batch processing, some are stream processing. Bandwidth and latency are two important optimization issues we consider in the wide area analytics.


  • As the volume of data grows, storing such data within the same datacenter is no longer feasible, and they naturally need to be distributed across multiple datacenters.
  • Data movement constraints are not considered; sometimes it is slow.
  • The greedy approach is not optimal for the general DAGs.
  • The assumptions hide the complicated situations in the realworld data analytics.


  • In this paper, we will focus on several representative solutions in the literature towards this research direction. Due to the pressing need of processing large volumes of data across multiple geo-distributed datacenters, these proposed solutions are exciting and highly relevant, and may soon be utilized in real-world data analytic applications.
  • WANalytics consists of two main components: a runtime layer and a workload analyzer. In the runtime layer, there is a coordinator in a master datacenter that interacts with datacenter managers at each datacenter. In each datacenter manager, there is a caching mechanism. Analyst submits DAGs of queries, and then the coordinator asks the workload analyzer for the best distributed execution plan.


  • In this paper, we present a number of typical mechanisms in the wide area analytics, discuss high-level ideas, and give a comparison of these mechanisms.
  • It firstly comes up with some DAG execution plans of the workload, secondly measures their costs by using pseudo distributed measurements, then computes a new best plan by using the optimizing execution, finally it deploys the new best plan.


Wide Area Analytics for Geographically Distributed Datacenters




  • System : i3 Processor
  • Hard Disk : 500 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram :



  • Operating system : Windows 7/UBUNTU.
  • Coding Language : Java 1.7 ,Hadoop 0.8.1
  • IDE : Eclipse
  • Database : MYSQL


Siqi Ji and Baochun Li, “Wide Area Analytics for Geographically Distributed Datacenters”, IEEE 2016.