Posts

Beginner Level Projects on Apache Spark

Apache Spark : Spark, also an open-source framework for performing general data analytics on distributed computing cluster, was originally designed at the University of California, and later donated to the Apache Software Foundation. Spark’s real-time data processing capability provides it a substantial lead over Hadoop’s MapReduce. Spark is a multi-stage RAM-capable compute framework with libraries for machine learning, interactive queries and graph analytics. It can run on a Hadoop cluster with YARN but also Mesos or in standalone mode. Apples and oranges, really. An interesting point to note here is that Spark is devoid of its own distributed filesystem. So, for distributed storage, it has to either use HDFS or other alternatives, such as MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu, etc. Now that we have caught a glimpse of Hadoop and Spark, it’s time to talk about different types of data processing they perform. For beginner’s level use cases in Spark ...

PROS & CONS OF APACHE SPARK

Image
What are the pros and cons of Apache Spark? Shwati Kumar , Part of Apache Software Foundation Apache Spark  - Next gen Big Data tool. It is a general-purpose & lightning fast cluster computing platform. Pros of Spark: Spark is easy to program and don’t require much hand coding, whereas MapReduce is not that easy in terms of programming and requires lots of hand coding. Apache Spark processes the data in memory while Hadoop MapReduce persists back to the disk after map or reduce action. But Spark needs a lot of memory. Spark is general purpose cluster computation engine with support for streaming, machine learning, batch processing as well as interactive mode whereas Hadoop MapReduce supports only batch processing. Spark executes batch processing jobs about 10 to 100 times faster than Hadoop MapReduce. Spark uses a variety of abstraction such as RDD, DataFrame, Streaming, GraphX which makes Spark feature rich whereas...