Why is Spark 10x faster on Disk than Hadoop MapReduce?

Joydip Nath
2 min readJul 7, 2022

All there are plenty of differences the way MapReduce and Spark works but we are going to see some of them

Lets start

  1. In case of hadoop mapreduce(MR) data is written to HDFS in each phase (map, shuffle, reduce). This is very expensive operation but Spark does not store anything unless any action is applied on the data and even if we apply any action like saveAsTextFile, it will run in parallel on each executor and each executor writes data to the file, which again makes spark faster.
  2. Another thing is the architecture. Hadoop MR starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. This process is killed once a job is done. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.
  3. From a software perspective, Spark’s optimizer now allows many workloads to avoid significant disk IO by pruning input data that is not needed in a given job. In Spark’s shuffle subsystem, serialization and hashing (which are CPU bound) have been shown to be key bottlenecks, rather than raw network throughput of underlying hardware. All these trends mean that Spark today is often constrained by CPU efficiency and memory pressure rather than IO. Since IO is of a less concern to Spark which makes it faster even in Disk then its competition.
  4. Spark runs multi-threaded tasks inside of JVM processes, whereas MapReduce runs as heavier weight JVM processes. This gives Spark faster startup, better parallelism, and better CPU utilization.
  5. Apache Spark is good for iterative algorithms. It is because, Hadoop MapReduce, by its nature, write output to HDFS after every MapReduce cycle. Since IO operations are very costly, therefor it makes Hadoop MapReduce slower for iterative algorithms, than Apache Spark.
  6. One of the significant reason is The Project Tungsten:
    In short Project Tungsten enables Optimize Apache Spark’s execution engine to take advantage of modern compilers and CPUs’ ability to efficiently compile and execute simple for loops (as opposed to complex function call graphs).
    The focus on CPU efficiency is motivated by the fact that Spark workloads are increasingly bottlenecked by CPU and memory use rather than IO and network communication. The trend is shown by recent research on the performance of big data workloads

These are the few complied points which makes Spark faster even in disk operations then its competitions.

Let me know in the comment section if you find something more in additions to above key notes.

--

--