Wednesday, January 14, 2015

Apache Spark Design Patterns - Using #Scala #apache-spark - Series -1 ; The word count

A simple word count using scala in Spark
Simple word count example - Click to see code
There are many limitations in the above code The objective is to count words in the post, however the Posts.xml has lot of meta-data like OwnerUserId,Title,Tags etc..The info we need is in the Body.
The missing logic is
1) Count words in the Body
2) Error handling
3) Data clean up - we don’t count single quotes, special characters This example uses case classes and xml parsing which in in-built Scala.
Enhanced word count example - Click to see code

No comments: