As the data volumes are growing rapidly, most of the processing needs to be distributed to several
machines. Whereas the data processing over the distributed machines accompanies a few difficulties such as
parallelism, scalability, machine failures and large data sizes. To face these challenges, much work has been carried out
in this Big data domain. As a result, an extensive list of processing models and its co-existent technologies has been
proposed for distributed cluster computing. However, there is a lack of comprehensive and comparative study to
evaluate and choose from the number of options available. While many distributed computing technologies have
emerged recently it can also be tough to understand how these technologies are related. We comprehend and briefly
discussed the underlined architecture of distributed computing for a better understanding of the relations among various
Big data technologies. This work also surveys features, strengths, and limitations of existing processing models by
comparing them. Most of the existing data processing models are generally specific to particular application domain. In
this work, we are also supporting the generic processing model that works for a variety of workloads as well as new
application domains. Few of the key characteristics mentioned here can potentially guide in making an educated
decision about the right combination of distributed processing technologies.
Published In:IJCSN Journal Volume 6, Issue 6
Date of Publication : December 2017
Pages : 728-745
Figures :01
Tables : --
Harish Mamilla : is a research assistant at the
Indian Institute of Technology Delhi. He received his bachelor’s
degree in computer science from Jawaharlal Nehru
Technological University, Kakinada, India. He is currently
working as Information Systems Officer at India’s Largest
Enterprise, IndianOil Corportation Limited. His research
interests include distributed fault-tolerant computing, Big data
platforms.
Sai Pradeep : is a research assistant at the
Indian Institute of Technology Delhi. He received his bachelor’s
degree in computer science from M. S. Ramaiah Institute Of
Technology, Bangalore, India. His research interests include
distributed computing, computing cluster resource
management.
Suresh C Gupta : is a visiting faculty in the
Department of Computer Science and Engineering at the
Indian Institute of Technology Delhi. He worked as Scientist at
Computer Group at Tata Institute of Fundamental Research
and NCSDCT (Now C-DAC Mumbai). Till recently, he worked
as Deputy Director General, Scientist -G at National
Informatics Centre, Government of India. His research
interests includes Software Engineering, Data Bases, Cloud
Computing, Software Defined Storage and Networks.
Distributed Computing, Big data, Data processing models, Hadoop, MapReduce, Spark, Flink
In this work, we analyzed the distributed data
processing complete technology stack along with their
ecosystem by putting them into a layered architecture.
We discussed recent projects in each of these layers
and highlighted some design principles. Major
approaches to distributed processing such as batch,
iterative batch, and real-time streams were described
and related insights were presented and discussed. In
this survey, we primarily focused on the
comprehensive review of data processing engines that
are currently available. Besides, we have focused on
the components of the generic processing engine. The
intention of this survey is neither to endorse nor to be
judgmental for one particular project but rather to
compare and explore them briefly. Choosing the
models purely based on finding the best balance
between the computational requirements. Most of these
projects have found ways to co-exist complimenting
each other to create a unique open source environment
for innovative product development in the Big data
distributed computing domain.
[1] Vinod K. Vavilapalli, Arun C. Murthy, Chris
Douglas, Sharad Agarwal, Mahadev Konar, Robert
Evans, Thomas Graves, Jason Lowe, Hitesh Shah,
Siddharth Seth, Bikas Saha, Carlo Curino, Owen
O’Malley, Sanjay Radia, Benjamin Reed, and Eric
Baldeschwieler. Apache Hadoop YARN: Yet
Another Resource Negotiator. In SoCC, 2013.
[2] Benjamin Hindman, Andrew Konwinski, Matei
Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H.
Katz, Scott Shenker, and Ion Stoica. Mesos: A
platform for fine-grained resource sharing in the
datacenter. In NSDI, 2011.
[3] Apache Hadoop. http://hadoop.apache.org/
[4] Murthy, Arun (2012-08-15). "Apache Hadoop
YARN – Concepts and
Applications". hortonworks.com. Hortonworks.
Retrieved 2014-09-30.
[5] Zaharia, Matei. "HUG Meetup August 2010: Mesos:
A Flexible Cluster Resource manager - Part
1". youtube.com. Retrieved 13 January 2015.
[6] Myriad Home.
https://cwiki.apache.org/confluence/display/MYRIA
D/Myriad+Home
[7] NFS.
http://en.wikipedia.org/wiki/Network_File_System
[8] S3. http://aws.amazon.com/s3/
[9] HDFS.
http://hadoop.apache.org/docs/stable/hadoopproject-dist/hadoop-hdfs/HdfsUserGuide.html
[10] Borthakur D. HDFS architecture guide. HADOOP
APACHE PROJECT.2008.
[11] Amazon Simple Storage Service.
http://docs.aws.amazon.com/AmazonS3/latest/dev/
Welcome.html
[12] DB-Engines. https://db-engines.com/
[13] Column-oriented DBMS.
https://en.wikipedia.org/wiki/Columnoriented_DBMS
[14] Apache HBase. http://hbase.apache.org/
[15] Apache Cassandra. http://cassandra.apache.org/
[16] Apache Cassandra.
https://en.wikipedia.org/wiki/Apache_Cassandra
[17] Key-value database.
https://en.wikipedia.org/wiki/Key-value_database
[18] Redis. http://redis.io/
[19] Memcached. https://www.memcached.org/
[20] Redis. https://en.wikipedia.org/wiki/Redis
[21] MongoDB. https://www.mongodb.org/
[22] Apache CouchDB. http://couchdb.apache.org/
[23] Apache CouchDB.
https://en.wikipedia.org/wiki/CouchDB
[24] Erlang.
https://en.wikipedia.org/wiki/Erlang_(programming
_language)
[25] Neo4j. http://neo4j.com/
[26] Titan Distributed Graph Database.
http://thinkaurelius.github.io/titan/
[27] Neo4j. https://en.wikipedia.org/wiki/Neo4j
[28] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Eric
Baldeschwieler, Scott Shenker, and Ion Stoica.
Tachyon: Memory throughput I/O for cluster
computing frameworks. In LADIS, 2013.
[29] J. Ousterhout, P. Agrawal, D. Erickson et al., “The
case for ramclouds: Scalable high-performance
storage entirely in dram,” OSR, 2010.
[30] Hao Zhang, Gang Chen, Member, IEEE, Beng Chin
Ooi, Fellow, IEEE, Kian-Lee Tan, Member, IEEE,
Meihui Zhang, Member, IEEE. In-Memory Big Data
Management and Processing: A Survey
[31] MPI: A Message-Passing Interface StandardMessage
Passing Interface Forum- http://mpiforum.org/docs/
[32] Milojicic DS, Kalogeraki V, Lukose R, Nagaraja K,
Pruyne J, Richard B, Rollins S, Xu Z. Peer-to-peer
computing. Technical Report HPL-2002-57, HP
Labs. 2002.
[33] Apache Software Foundation:
https://www.apache.org/ [34] Apache Spark. http://spark.incubator.apache.org
[35] Apache Flink. http://flink.apache.org/
[36] Matei Zaharia, Mosharaf Chowdhury, Tathagata
Das, Ankur Dave, Justin Ma, Murphy McCauley,
Michael Franklin, Scott Shenker, and Ion Stoica.
Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing.
Technical Report UCB/EECS-2011-82, University
of California, Berkeley, 2011.
[37] Distributed shared memory. B. Nitzberg and V. Lo.
Distributed shared memory: a survey of issues and
algorithms. Computer, 24(8):52–60, Aug 1991.
[38] Jeffrey Dean and Sanjay Ghemawat. MapReduce:
Simplified data processing on large clusters. In
OSDI, 2004.
[39] Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B.
Parallel data processing with MapReduce: a
survey. ACM SIGMOD Record. 2012;40(4):11–20.
doi: 10.1145/2094114.2094118.
[40] Apache Flink. https://flink.apache.org/
[41] Ian Pointer (7 May 2015). "Apache Flink: New
Hadoop contender squares off against
Spark". InfoWorld.
[42] Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann,
and Volker Markl. 2012. Spinning fast iterative data
flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268-
1279. DOI/
[43] Flink Gelly.
https://ci.apache.org/projects/flink/flink-docsrelease-1.2/dev/libs/gelly/index.html
[44] Flink-ML.
https://cwiki.apache.org/confluence/display/FLINK/
FlinkML%3A+Vision+and+Roadmap
[45] Bikas Sahah , Hitesh Shahh , Siddharth Sethh ,
Gopal Vijayaraghavanh , Arun Murthyh , Carlo
Curino Hortonworks, Microsoft. Apache Tez: A
Unifying Framework for Modeling and Building
Data Processing Applications.
[46] Apache Hive. https://hive.apache.org/
[47] Apache Pig. https://pig.apache.org/
[48] Apache Spark. https://spark.apache.org/
[49] Spark SQL: https://spark.apache.org/sql/
[50] Mitch Cherniack, Hari Balakrishnan, Magdalena
Balazinska, Donald Carney, Ugur Cetintemel, Ying
Xing, and Stanley B. Zdonik. Scalable distributed
stream processing. In CIDR, 2003.
[51] Apache Storm. http://storm-project.net.
[52] Apache Storm Trident.
http://storm.apache.org/releases/current/Tridentstate.html
[53] Spark Streaming.
https://spark.apache.org/streaming/
[54] Apache Flink: Stream and Batch Processing in a
Single Engine: P Carbone, S Ewen, S Haridi, A
Katsifodimos, V Markl, K Tzoumas
[55] Apache Kafka. http://kafka.apache.org/
[56] Apache Kafka.
https://en.wikipedia.org/wiki/Apache_Kafka
[57] Apache Samoa.
https://samoa.incubator.apache.org/documentation/H
ome.html
[58] Magdalena Balazinska, Hari Balakrishnan, Samuel
R. Madden, and Michael Stonebraker. Faulttolerance
in the Borealis distributed stream
processing system. ACM Trans. Database Syst.,
2008.
[59] Apache Strom.
http://storm.apache.org/releases/1.1.0/Guaranteeingmessage-processing.html
[60] Supposedly paraphrased from: Samuel, Arthur
(1959). "Some Studies in Machine Learning Using
the Game of Checkers". IBM Journal of Research
and Development. 3 (3). doi:10.1147/rd.33.0210.
[61] Apache Mahout. https://mahout.apache.org/
[62] Seminario CE, Wilson DC. Case Study Evaluation
of Mahout as a Recommender Platform. In: 6th
ACM conference on recommender engines (RecSys
2012); 2012. pp. 45–50.
[63] List of algorithms.
https://mahout.apache.org/users/basics/algorithms.ht
ml
[64] Spark machine learning library (MLlib).
http://spark.incubator.apache.
org/docs/latest/mllib-guide.html.
[65] Spark Mlib.
https://spark.apache.org/docs/1.1.0/mllib-guide.html
[66] List of algorithms. http://spark.apache.org/mllib/
[67] ML pipelines.
https://spark.apache.org/docs/latest/ml-pipeline.html
[68] Pan X, Sparks ER, Wibisono A. MLbase:
Distributed Machine Learning Made Easy.
University of California Berkeley Technical Report;
2013.
[69] H2O. http://docs.h2o.ai/h2o/latest-stable/h2odocs/welcome.html
[70] H2O. https://en.wikipedia.org/wiki/H2O_(software)
[71] High Performance Machine Learning in R with H2O
Erin LeDell Ph.D. ISM HPC on R Workshop Tokyo,
Japan October 2015.
[72] Najafabadi MM, Villanustre F, Khoshgoftaar TM,
Seliya N, Wald R, Muharemagic E. Deep learning
applications and challenges in Big data analytics. J
Big data. 2015;2(1):1–21.
[73] Flink-ML.
https://cwiki.apache.org/confluence/display/FLINK/
FlinkML%3A+Vision+and+Roadmap
[74] Oryx. https://github.com/cloudera/oryx
[75] Samsara.
http://mahout.apache.org/users/environment/in-corereference.html
[76] Zheng J, Dagnino A. An initial study of predictive
machine learning analytics on large volumes of
historical data for power system applications. In:
2014 IEEE International Conference on Big data;
2014. pp. 952–59.
[77] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C.
Dehnert, I. Horn, N. Leiser, and G. Czajkowski.
Pregel: a system for large-scale graph processing. In
SIGMOD Conference, pages 135–146, 2010.
[78] Leslie G Valiant, “A bridging model for parallel
computation,” Communications of the ACM, vol.
33, no. 8, pp. 103– 111, 1990.
[79] Wijnand J. Suijlen: BSPonMPI, 2006.
[80] Joseph E. Gonzalez, Yucheng Low, Haijie Gu,
Danny Bickson, and Carlos Guestrin,
“Powergraph: Distributed graph-parallel
computation on natural graphs,” in Presented aspart of the 10th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 12),
Hollywood, CA, 2012, pp. 17–30, USENIX.
[81] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave,
Daniel Crankshaw, Michael J. Franklin, and Ion
Stoica, “Graphx: Graph processing in a distributed
dataflow framework,” in 11th USENIX Symposium
on Operating Systems Design and Implementation
(OSDI 14), Broomfield, CO, Oct. 2014, pp. 599–
613, USENIX Association.
[82] Reynold S. Xin, Joseph E. Gonzalez, Michael J.
Franklin, and Ion Stoica. Graphx: A resilient
distributed graph system on spark. In First
International Workshop on Graph Data Management
Experiences and Systems, GRADES ’13, 2013.
[83] Gelly. https://ci.apache.org/projects/flink/flink-docsrelease-1.3/dev/libs/gelly/index.html
[84] Shaw, S., Vermeulen, A.F., Gupta, A.,
Kjerrumgaard, D., 2016. Hive architecture. In:
Practical Hive. Springer, pp. 37–48.
[85] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P,
Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a
warehousing solution over a map-reduce
framework.
[86] HiveQL.
https://cwiki.apache.org/confluence/display/Hive/La
nguageManual
[87] Olston C, Reed B, Srivastava U, Kumar R, Tomkins
A. Proceedings of the ACM SIGMOD international
conference on Management of Data. : ACM;
2008. Pig latin: a not-so-foreign language for data
processing; pp. 1099–1110.
[88] Spark SQL. https://spark.apache.org/sql/
[89] DataFrame. https://spark.apache.org/docs/latest/sqlprogramming-guide.html#datasets-and-dataframes
[90] Apache Drill. https://drill.apache.org/
[91] Apache Drill.
https://en.wikipedia.org/wiki/Apache_Drill
[92] Apache Flume: Distributed Log Collection for
Hadoop S. Hoffman Packt Publishing Ltd. (2015)
[93] Apache ZooKeeper. https://zookeeper.apache.org/
[94] Apache Ambari.
https://cwiki.apache.org/confluence/display/AMBA
RI/Ambari
[95] Why a connected data strategy is critical to the
future of your data: a Hortonworks white paper,
March 2016
[96] Cloudera. https://www.cloudera.com
[97] MapR: https://mapr.com/products/mapr-distributionincluding-apache-hadoop/