PROS & CONS OF APACHE SPARK

What are the pros and cons of Apache Spark?

Shwati Kumar, Part of Apache Software Foundation

Apache Spark - Next gen Big Data tool. It is a general-purpose & lightning fast cluster computing platform.

Pros of Spark:

Spark is easy to program and don’t require much hand coding, whereas MapReduce is not that easy in terms of programming and requires lots of hand coding.
Apache Spark processes the data in memory while Hadoop MapReduce persists back to the disk after map or reduce action. But Spark needs a lot of memory.
Spark is general purpose cluster computation engine with support for streaming, machine learning, batch processing as well as interactive mode whereas Hadoop MapReduce supports only batch processing.
Spark executes batch processing jobs about 10 to 100 times faster than Hadoop MapReduce.
Spark uses a variety of abstraction such as RDD, DataFrame, Streaming, GraphX which makes Spark feature rich whereas MapReduce doesn’t have any abstraction.
Spark use lower latency by caching partial/complete results across distributed nodes whereas MapReduce is completely disk-based.

Cons of Spark:

a. No Support for Real-time Processing

In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient Distributed Database (RDDs). Then these RDDs are processed using the operations like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming.

b. Problem with Small File

If we use Spark with Hadoop, we come across a problem of a small file. HDFSprovides a limited number of large files rather than a large number of small files. Another place where Spark legs behind is we store the data gzipped in S3. This pattern is very nice except when there are lots of small gzipped files. Now the work of the Spark is to keep those files on network and uncompress them. The gzipped files can be uncompressed only if the entire file is on one core. So a large span of time will be spent in burning their core unzipping files in sequence.

In the resulting RDD, each file will become a partition; hence there will be a large amount of tiny partition within an RDD. Now if we want efficiency in our processing, the RDDs should be repartitioned into some manageable format. This requires extensive shuffling over the network.

c. No File Management System

Apache Spark does not have its own file management system, thus it relies on some other platform like Hadoop or another cloud-based platform which is one of the Spark known issues.

d. Expensive

In-memory capability can become a bottleneck when we want cost-efficient processing of big data as keeping data in memory is quite expensive, the memory consumption is very high, and it is not handled in a user-friendly manner. Apache Spark requires lots of RAM to run in-memory, thus the cost of Spark is quite high.

e. Less number of Algorithms

Spark MLlib lags behind in terms of a number of available algorithms like Tanimoto distance.

f. Manual Optimization

The Spark job requires to be manually optimized and is adequate to specific datasets. If we want to partition and cache in Spark to be correct, it should be controlled manually.

g. Iterative Processing

In Spark, the data iterates in batches and each iteration is scheduled and executed separately.

h. Latency

Apache Spark has higher latency as compared to Apache Flink.

i. Window Criteria

Spark does not support record based window criteria. It only has time-based window criteria.

j. Back Pressure Handling

Back pressure is build up of data at an input-output when the buffer is full and not able to receive the additional incoming data. No data is transferred until the buffer is empty. Apache Spark is not capable of handling pressure implicitly rather it is done manually.

Search This Blog

Big Data Analytics

PROS & CONS OF APACHE SPARK

Comments

Post a Comment