Beginner Level Projects on Apache Spark

March 27, 2018

Apache Spark:

Spark, also an open-source framework for performing general data analytics on distributed computing cluster, was originally designed at the University of California, and later donated to the Apache Software Foundation. Spark’s real-time data processing capability provides it a substantial lead over Hadoop’s MapReduce.

Spark is a multi-stage RAM-capable compute framework with libraries for machine learning, interactive queries and graph analytics. It can run on a Hadoop cluster with YARN but also Mesos or in standalone mode. Apples and oranges, really. An interesting point to note here is that Spark is devoid of its own distributed filesystem. So, for distributed storage, it has to either use HDFS or other alternatives, such as MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu, etc.

Now that we have caught a glimpse of Hadoop and Spark, it’s time to talk about different types of data processing they perform.

For beginner’s level use cases in Spark , refer the below links:

Link 1: HealthCare Use Case With Apache Spark

Link 2: Introduction to Spark RDD and Basic Operations in RDD

Link 3: Analyzing New York Crime Data Using SparkSQL

Link 4: Spark Use Case – Travel Data Analysis

Link 5: Spark Use Case – Uber Data Analysis

Link 6: Spark Use Case – Analyzing MovieLens Dataset

Link 7: Spark Use Case – Social Media Analysis