Data Processing Models for Distributed Computing and it’s Ecosystem: A Survey

Abstract
Authors
Keywords
Conclusion
References

As the data volumes are growing rapidly, most of the processing needs to be distributed to several machines. Whereas the data processing over the distributed machines accompanies a few difficulties such as parallelism, scalability, machine failures and large data sizes. To face these challenges, much work has been carried out in this Big data domain. As a result, an extensive list of processing models and its co-existent technologies has been proposed for distributed cluster computing. However, there is a lack of comprehensive and comparative study to evaluate and choose from the number of options available. While many distributed computing technologies have emerged recently it can also be tough to understand how these technologies are related. We comprehend and briefly discussed the underlined architecture of distributed computing for a better understanding of the relations among various Big data technologies. This work also surveys features, strengths, and limitations of existing processing models by comparing them. Most of the existing data processing models are generally specific to particular application domain. In this work, we are also supporting the generic processing model that works for a variety of workloads as well as new application domains. Few of the key characteristics mentioned here can potentially guide in making an educated decision about the right combination of distributed processing technologies.

Published In : IJCSN Journal Volume 6, Issue 6

Date of Publication : December 2017

Pages : 728-745

Figures :01

Tables : --

Harish Mamilla : is a research assistant at the Indian Institute of Technology Delhi. He received his bachelor’s degree in computer science from Jawaharlal Nehru Technological University, Kakinada, India. He is currently working as Information Systems Officer at India’s Largest Enterprise, IndianOil Corportation Limited. His research interests include distributed fault-tolerant computing, Big data platforms.

Sai Pradeep : is a research assistant at the Indian Institute of Technology Delhi. He received his bachelor’s degree in computer science from M. S. Ramaiah Institute Of Technology, Bangalore, India. His research interests include distributed computing, computing cluster resource management.

Suresh C Gupta : is a visiting faculty in the Department of Computer Science and Engineering at the Indian Institute of Technology Delhi. He worked as Scientist at Computer Group at Tata Institute of Fundamental Research and NCSDCT (Now C-DAC Mumbai). Till recently, he worked as Deputy Director General, Scientist -G at National Informatics Centre, Government of India. His research interests includes Software Engineering, Data Bases, Cloud Computing, Software Defined Storage and Networks.

Distributed Computing, Big data, Data processing models, Hadoop, MapReduce, Spark, Flink

In this work, we analyzed the distributed data processing complete technology stack along with their ecosystem by putting them into a layered architecture. We discussed recent projects in each of these layers and highlighted some design principles. Major approaches to distributed processing such as batch, iterative batch, and real-time streams were described and related insights were presented and discussed. In this survey, we primarily focused on the comprehensive review of data processing engines that are currently available. Besides, we have focused on the components of the generic processing engine. The intention of this survey is neither to endorse nor to be judgmental for one particular project but rather to compare and explore them briefly. Choosing the models purely based on finding the best balance between the computational requirements. Most of these projects have found ways to co-exist complimenting each other to create a unique open source environment for innovative product development in the Big data distributed computing domain.

[1] Vinod K. Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In SoCC, 2013. [2] Benjamin Hindman, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine-grained resource sharing in the datacenter. In NSDI, 2011. [3] Apache Hadoop. http://hadoop.apache.org/ [4] Murthy, Arun (2012-08-15). "Apache Hadoop YARN – Concepts and Applications". hortonworks.com. Hortonworks. Retrieved 2014-09-30. [5] Zaharia, Matei. "HUG Meetup August 2010: Mesos: A Flexible Cluster Resource manager - Part 1". youtube.com. Retrieved 13 January 2015. [6] Myriad Home. https://cwiki.apache.org/confluence/display/MYRIA D/Myriad+Home [7] NFS. http://en.wikipedia.org/wiki/Network_File_System [8] S3. http://aws.amazon.com/s3/ [9] HDFS. http://hadoop.apache.org/docs/stable/hadoopproject-dist/hadoop-hdfs/HdfsUserGuide.html [10] Borthakur D. HDFS architecture guide. HADOOP APACHE PROJECT.2008. [11] Amazon Simple Storage Service. http://docs.aws.amazon.com/AmazonS3/latest/dev/ Welcome.html [12] DB-Engines. https://db-engines.com/ [13] Column-oriented DBMS. https://en.wikipedia.org/wiki/Columnoriented_DBMS [14] Apache HBase. http://hbase.apache.org/ [15] Apache Cassandra. http://cassandra.apache.org/ [16] Apache Cassandra. https://en.wikipedia.org/wiki/Apache_Cassandra [17] Key-value database. https://en.wikipedia.org/wiki/Key-value_database [18] Redis. http://redis.io/ [19] Memcached. https://www.memcached.org/ [20] Redis. https://en.wikipedia.org/wiki/Redis [21] MongoDB. https://www.mongodb.org/ [22] Apache CouchDB. http://couchdb.apache.org/ [23] Apache CouchDB. https://en.wikipedia.org/wiki/CouchDB [24] Erlang. https://en.wikipedia.org/wiki/Erlang_(programming _language) [25] Neo4j. http://neo4j.com/ [26] Titan Distributed Graph Database. http://thinkaurelius.github.io/titan/ [27] Neo4j. https://en.wikipedia.org/wiki/Neo4j [28] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Eric Baldeschwieler, Scott Shenker, and Ion Stoica. Tachyon: Memory throughput I/O for cluster computing frameworks. In LADIS, 2013. [29] J. Ousterhout, P. Agrawal, D. Erickson et al., “The case for ramclouds: Scalable high-performance storage entirely in dram,” OSR, 2010. [30] Hao Zhang, Gang Chen, Member, IEEE, Beng Chin Ooi, Fellow, IEEE, Kian-Lee Tan, Member, IEEE, Meihui Zhang, Member, IEEE. In-Memory Big Data Management and Processing: A Survey [31] MPI: A Message-Passing Interface StandardMessage Passing Interface Forum- http://mpiforum.org/docs/ [32] Milojicic DS, Kalogeraki V, Lukose R, Nagaraja K, Pruyne J, Richard B, Rollins S, Xu Z. Peer-to-peer computing. Technical Report HPL-2002-57, HP Labs. 2002. [33] Apache Software Foundation: https://www.apache.org/ [34] Apache Spark. http://spark.incubator.apache.org [35] Apache Flink. http://flink.apache.org/ [36] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, University of California, Berkeley, 2011. [37] Distributed shared memory. B. Nitzberg and V. Lo. Distributed shared memory: a survey of issues and algorithms. Computer, 24(8):52–60, Aug 1991. [38] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. [39] Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B. Parallel data processing with MapReduce: a survey. ACM SIGMOD Record. 2012;40(4):11–20. doi: 10.1145/2094114.2094118. [40] Apache Flink. https://flink.apache.org/ [41] Ian Pointer (7 May 2015). "Apache Flink: New Hadoop contender squares off against Spark". InfoWorld. [42] Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268- 1279. DOI/ [43] Flink Gelly. https://ci.apache.org/projects/flink/flink-docsrelease-1.2/dev/libs/gelly/index.html [44] Flink-ML. https://cwiki.apache.org/confluence/display/FLINK/ FlinkML%3A+Vision+and+Roadmap [45] Bikas Sahah , Hitesh Shahh , Siddharth Sethh , Gopal Vijayaraghavanh , Arun Murthyh , Carlo Curino Hortonworks, Microsoft. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. [46] Apache Hive. https://hive.apache.org/ [47] Apache Pig. https://pig.apache.org/ [48] Apache Spark. https://spark.apache.org/ [49] Spark SQL: https://spark.apache.org/sql/ [50] Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Cetintemel, Ying Xing, and Stanley B. Zdonik. Scalable distributed stream processing. In CIDR, 2003. [51] Apache Storm. http://storm-project.net. [52] Apache Storm Trident. http://storm.apache.org/releases/current/Tridentstate.html [53] Spark Streaming. https://spark.apache.org/streaming/ [54] Apache Flink: Stream and Batch Processing in a Single Engine: P Carbone, S Ewen, S Haridi, A Katsifodimos, V Markl, K Tzoumas [55] Apache Kafka. http://kafka.apache.org/ [56] Apache Kafka. https://en.wikipedia.org/wiki/Apache_Kafka [57] Apache Samoa. https://samoa.incubator.apache.org/documentation/H ome.html [58] Magdalena Balazinska, Hari Balakrishnan, Samuel R. Madden, and Michael Stonebraker. Faulttolerance in the Borealis distributed stream processing system. ACM Trans. Database Syst., 2008. [59] Apache Strom. http://storm.apache.org/releases/1.1.0/Guaranteeingmessage-processing.html [60] Supposedly paraphrased from: Samuel, Arthur (1959). "Some Studies in Machine Learning Using the Game of Checkers". IBM Journal of Research and Development. 3 (3). doi:10.1147/rd.33.0210. [61] Apache Mahout. https://mahout.apache.org/ [62] Seminario CE, Wilson DC. Case Study Evaluation of Mahout as a Recommender Platform. In: 6th ACM conference on recommender engines (RecSys 2012); 2012. pp. 45–50. [63] List of algorithms. https://mahout.apache.org/users/basics/algorithms.ht ml [64] Spark machine learning library (MLlib). http://spark.incubator.apache. org/docs/latest/mllib-guide.html. [65] Spark Mlib. https://spark.apache.org/docs/1.1.0/mllib-guide.html [66] List of algorithms. http://spark.apache.org/mllib/ [67] ML pipelines. https://spark.apache.org/docs/latest/ml-pipeline.html [68] Pan X, Sparks ER, Wibisono A. MLbase: Distributed Machine Learning Made Easy. University of California Berkeley Technical Report; 2013. [69] H2O. http://docs.h2o.ai/h2o/latest-stable/h2odocs/welcome.html [70] H2O. https://en.wikipedia.org/wiki/H2O_(software) [71] High Performance Machine Learning in R with H2O Erin LeDell Ph.D. ISM HPC on R Workshop Tokyo, Japan October 2015. [72] Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in Big data analytics. J Big data. 2015;2(1):1–21. [73] Flink-ML. https://cwiki.apache.org/confluence/display/FLINK/ FlinkML%3A+Vision+and+Roadmap [74] Oryx. https://github.com/cloudera/oryx [75] Samsara. http://mahout.apache.org/users/environment/in-corereference.html [76] Zheng J, Dagnino A. An initial study of predictive machine learning analytics on large volumes of historical data for power system applications. In: 2014 IEEE International Conference on Big data; 2014. pp. 952–59. [77] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135–146, 2010. [78] Leslie G Valiant, “A bridging model for parallel computation,” Communications of the ACM, vol. 33, no. 8, pp. 103– 111, 1990. [79] Wijnand J. Suijlen: BSPonMPI, 2006. [80] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin, “Powergraph: Distributed graph-parallel computation on natural graphs,” in Presented aspart of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), Hollywood, CA, 2012, pp. 17–30, USENIX. [81] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica, “Graphx: Graph processing in a distributed dataflow framework,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, Oct. 2014, pp. 599– 613, USENIX Association. [82] Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems, GRADES ’13, 2013. [83] Gelly. https://ci.apache.org/projects/flink/flink-docsrelease-1.3/dev/libs/gelly/index.html [84] Shaw, S., Vermeulen, A.F., Gupta, A., Kjerrumgaard, D., 2016. Hive architecture. In: Practical Hive. Springer, pp. 37–48. [85] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. [86] HiveQL. https://cwiki.apache.org/confluence/display/Hive/La nguageManual [87] Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Proceedings of the ACM SIGMOD international conference on Management of Data. : ACM; 2008. Pig latin: a not-so-foreign language for data processing; pp. 1099–1110. [88] Spark SQL. https://spark.apache.org/sql/ [89] DataFrame. https://spark.apache.org/docs/latest/sqlprogramming-guide.html#datasets-and-dataframes [90] Apache Drill. https://drill.apache.org/ [91] Apache Drill. https://en.wikipedia.org/wiki/Apache_Drill [92] Apache Flume: Distributed Log Collection for Hadoop S. Hoffman Packt Publishing Ltd. (2015) [93] Apache ZooKeeper. https://zookeeper.apache.org/ [94] Apache Ambari. https://cwiki.apache.org/confluence/display/AMBA RI/Ambari [95] Why a connected data strategy is critical to the future of your data: a Hortonworks white paper, March 2016 [96] Cloudera. https://www.cloudera.com [97] MapR: https://mapr.com/products/mapr-distributionincluding-apache-hadoop/