Life with Apache Spark

As programmers, we are frequently asked to play around with big data to solve some problems. Developing a solution for one-user, developers will easily do it. But the fun part comes when the solution is going to production where very large no of people try to use our solution :D. Then we need to program in away to handle the load balance and to provide a very good responsive solution for users.

The future has already arrived. It’s just not evenly distributed yet. - William Gibson

Apache Spark is an open-source data analytics cluster computing framework. Spark writes to something called RDDs (Resilient Distributed Datasets), which can live in memory. Apache Spark is mostly popular as an alternative solution to MapReduce and Hive when manipulating big data.

Rather than being restricted to maps and reduces, Apache Spark is wonderful distributed data manipulating framework Which supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning (ML) , GraphX for graph processing, and Shark Streaming.


Lightning-Fast Cluster Computing

Spark can run on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.

I’m really impressed with Apache Spark and the way it works on a distributed environment. I’m planning to use it on Hadoop lets see how Apache Spark can run like a lighting bolt :D to produce the data and to do machine learning on distributed an environment with real-time data.

Learning Guide


Fast Data Processing with Spark by Holden Karau helped me to learn and to understand all the fundamentals of Apache Spark. If you follow the book from Chapter 1 to the end you will be able to understand the entire concept and the use of Apache Spark. It also provides a guide to compile, build and run Spark on Hadoop.