Sunday, January 25, 2015

#apachespark for Hadoop programmers


Apache spark provides many advantages over Hadoop. Following are the important differences to consider before starting with Spark

Apache Spark API Hadoop API
The input is an RDD of Strings only, not of key-value pairs Mappers and Reducers always use key-value pairs as input and output
Tuple is the equivalent of key values. ReduceByKey is the equivalent

A Reducer reduces values per key only

Mapper should always return 1 record. Filter has to be used to remove unwanted records A Mapper or Reducer may emit 0, 1 or more key-value pairs for every input
Always returns typed results. Functions like flatten,flatmap, map and reduce have to be used in combination with GroupByKey. A worker may run out of memory if above function are improperly applied Mappers and Reducers may emit any arbitrary keys or values, not just subsets or transformations of those in the input
The Spark map() and flatMap() methods only operate on one input at a time though, and provide no means to execute code before or after transforming a batch of values. The nearest equivalent is mapPartitions. Mapper and Reducer objects have a lifecycle that spans many map() and reduce() calls. They support a setup() and cleanup() method, which can be used to take actions before or after a batch of records is processed

Other than the API differences there a lot of fundamental differences the way apache spark works

It provides

  • Caching + in memory computation
  • RDD(Resilient Distributed Data set): an RDD is the main abstraction of spark. It allows recovery of failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDD. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance
  • Spark API: Hadoop MapReduce has a very strict API that doesn't allow for as much versatility. Since spark abstracts away many of the low level details it allows for more productivity. Also things like broadcast variables and accumulators are much more versatile than DistributedCache and counters
  • As a product of in memory computation spark sort of acts as it's own flow scheduler. Whereas with standard MR you need an external job scheduler like Azkaban or Oozie to schedule complex flows
  • Scala API. Scala stands for Scalable Language and is clearly the best language to choose for parallel processing. They say Scala cuts down code by 2-5x, but in my experience from refactoring code in other languages - especially java mapreduce code, its more like 10-100x less code. Seriously I have refactored 100s of LOC from java into a handful of Scala / Spark. Its also much easier to read and reason about. Spark is even more concise and easy to use than the Hadoop abstraction tools like pig & hive, its even better than Scalding.
  • Spark has a repl / shell. The need for a compilation-deployment cycle in order to run simple jobs is eliminated. One can interactively play with data just like one uses Bash to poke around a system
  • Spark has much lower per job and per task overhead. It gives it ability to be applied to the cases where Hadoop MR is not applicable. It is cases when reply is needed in 1-30 seconds.
    Low per task overhead makes Spark more efficient for even big jobs with a lot of short tasks. As a very rough estimation - when task takes 1 second Spark will be 2 times more efficient then Hadoop MR
  • Spark has lower abstraction then MR - it is graph of computations. As a result it is possible to implement more efficient processing then MR - specifically in cases when sorting is not needed. In other words - in MR we always pay for the sorting, but in Spark - we do not have to.


References :

No comments: